Professional Documents
Culture Documents
Encyclopedia of Computer Science and Engineering PDF
Encyclopedia of Computer Science and Engineering PDF
Intelligent subsystems
Load management;
Environment control;
Lighting control;
Security;
Safety;
Access control;
Voice communication;
Entertainment.
Several other applications that can make use of the
communications that exist outside the home include:
Home banking;
Information services;
Low-cost implementation;
Communicationbyintegratingthe householdmanage-
ment system into external communications facilities.
IEEE 1394
In order to combine entertainment, communication, and
computing electronics in consumer multimedia, digital
interfaces have been created. Such is the case of IEEE
1394, which was conceived by Apple Computer as a desktop
LAN, andthenwas standardizedbythe IEEE1394working
group.
HOME AUTOMATION 3
IEEE 1394 can be described as a low-cost digital inter-
face with the following characteristics:
High reliability;
Health-related websites.
Health applications for todays household are very
limited in its range. In some countries, smart cards carry
4 HOME COMPUTING SERVICES
patients data for billing and insurance companies or
health consultancy software for private diagnosis and
information about certain diseases. In the future, expert
systems will enable medical advice fromeachhome without
leaving the private bed.
Home Services. Home services consist of systems that
support home security, safety, meal preparation, heating,
cooling, lighting, and laundry.
Currently, home services comprise only special devices
such as those in a networked kitchen. Future scenarios
project comprehensive home automation with intercon-
nected kitchen appliances, audio and video electronics,
and other systems like heating or laundry. Some proto-
types by the German company Miele (called Miele @home)
showed early in the development of smart homes that
the TV can control the washing machine. The intercon-
nection to out-of-home cable TV or telephone networks
leads to the remote control services (e.g., security). Much
media attention was received by the Internet refrigerator
by NCR, which orders needed groceries without human
interaction.
Key areas comprise:
Adult content.
Education. Education refers to all applications that
train and educate members of the household in special
skills or knowledge. In an increasingly dynamic private
environment, this function will gain in importance.
Distance Learning (DL) is frequently a self-selected
activity for students with work and family commitments.
Effects of social isolation should thus be limited. For
instance, DL can facilitate daycare arrangements. In
some circumstances, exclusion from the social network of
the face-to-face classroom can be one of the drawbacks
of DL (21). The private household uses this type of
education for the training of special skills it is inter-
ested in using off-line computer-based training (CBT) soft-
ware on CD-ROM or DVD to improve, for example, on a
foreign language for the next holiday abroad or naval rules
in order to pass the sailing exam. In addition, electronic
accessible libraries and content on the Internet open the
eld for self-education processes to the private area. The
usage articial intelligence will substitute human teachers
as far as possible and make them more efcient for special
tasks. Virtual reality will help by visualization and demon-
stration of complex issues. Increasingly, colleges and uni-
versities offer DL classes based on strong demand from
traditional and nontraditional students. Besides the added
exibility and benet for students who are reluctant to
speakup inclass, DLbenets those students living far from
the place of instruction. Dholakia et al. (22) found
that DL has the potential to reduce or modify student
commutes.
OUTLOOK
Figure 2summarizesthehomeservicesandshowssomeofthe
expected developments for the next several years. It sum-
marizes three possible scenarios (status quo 2004, scenario
2007, and scenario 2010) based on the assessment of past,
current, and future trends, and developments of services.
BIBLIOGRAPHY
1. H. Stipp, Should TV marry PC? American Demographics,
July 1998, pp. 1621.
2. Computer companies muscle into eld of consumer electronics
Providence Sunday Journal, January 11, 2004, 15.
3. Consumer electronics show is packed with Jetson-style
gadgets Providence Journal, January 10, 2004, B1,8.
4. F. Cairncross, The Death of Distance, Boston, MA: Harvard
Business School Press, 1997.
5. E. Schonfeld, Dont just sit there, do something. E-company,
2000, pp. 155164.
6. Time Warner is pulling the plug on a visionary foray into
Interactive TV, Providence Journal, May 11, 1997, A17.
7. N. Lockwood Tooher, The next big thing: Interactivity, Provi-
dence Journal, February 14, 2000, A1, 7.
8. S. Levy, Thenewdigital galaxy, Newsweek, May31, 1999,5763.
9. B. Lee, Personal Technology, inRedHerring, 119, 5657, 2002.
10. A. Higgins, Jetsons, Here we come!, in Machine Design, 75 (7):
5253, 2003.
11. L. Kolbe, Informationstechnik fur den privaten Haushalt
(Information Technology for the Private Household), Heidel-
berg: Physica, 1997.
12. W. Brenner and L. Kolbe, Information processing in the
private household, in Telemat. Informat.12 (2): 97110, 1995.
13. I. Miles, Home Informatics, Information Technology and the
Transformation of Everyday Life, London, 1988.
14. A. T. Kearney, Cambridge business school mobile commerce
study 2002. Available: http://www.atkearney.com/main.
taf?p=1,5,1,106. Jan. 10. 2004.
15. Economist 2001: Looking for the pot of gold, in Economist,
Special supplement: The Internet, untethered, October 13,
2001, 1114.
16. O. Ju ptner, Over ve billion text messages sent in UK. Avail-
able: http://www.e-gateway.net/infoarea/news/news.cfm?nid=
2415. January 9, 2003.
17. Jupiter Communications Company, Consumer Information
Appliance, 5 (2): 223, 1994.
18. P. Grimaldi, Net Retailers have season of success. Providence
Journal, December 27, 2003, B1, 2.
19. J. Lee, An end-user perspective on le-sharing systems, in
Communications of the ACM, 46 (2): 4953, 2003.
20. Mundorf, Distancelearningandmobility, inIFMO, Institut fu r
Mobilita tsforschung Auswirkungen der virtuellen Mobilitat
(Effects of virtual mobility), 2004, pp. 257272.
21. N. Dholakia, N. Mundorf, R. R. Dholakia, and J. J. Xiao,
Interactions of transportation and telecommunications
behaviors, University of Rhode Island Transportation Center
Research Report 536111.
FURTHER READING
N. Mundorf and P. Zoche, Nutzer, private Haushalte und Infor-
mationstechnik, in P. Zoche (ed.), Herausforderungen fur die
Informationstechnik, Heidelberg 1994, pp. 6169.
A. Reinhardt, Building the data highway, in Byte International
Edition, March 1994, pp. 4674.
F. Van Rijn and R. Williams, Concerning home telematics, Proc.
IFIP TC 9, 1988.
NORBERT MUNDORF
University of Rhode Island
Kingston, Rhode Island
LUTZ KOLBE
HOME COMPUTING SERVICES 7
R
REMOTE SENSING INFORMATION
PROCESSING
INTRODUCTION
With the rapid advance of sensors for remote sensing,
including radar, microwave, multispectral, hyperspectral,
infrared sensors, and so on the amount of data available
has increased dramatically from which detailed or specic
information must be extracted. Information processing,
which makes extensive use of powerful computers and
techniques incomputer science andengineering, has played
a key role in remote sensing. In this article, we will review
some major topics on information processing, including
image processing and segmentation, pattern recognition
and neural networks, data and information fusion,
knowledge-based system, image mining, image compres-
sion, and so on. References (15) provide some useful refer-
ences on information processing in remote sensing.
In remote sensing, the large amount of data makes it
necessary to perform some type of transforms that pre-
serve the essential information while considerably redu-
cing the amount of data. Infact, most remote sensing image
data are redundant, correlated, and noisy. Transform
methods can help in three ways: by effective data repre-
sentation, effective feature extraction, and effective image
compression. Component analysis is key to transform
methods. Both principal component analysis and inde-
pendent component analysis will be examined for remote
sensing.
IMAGE PROCESSING AND IMAGE SEGMENTATION
The motivation to enhance the noisy images sent back
from satellites in the early 1960s has had signicant
impact on subsequent progress in digital image proces-
sing. For example, digital ltering such as Wiener lter-
ing allows us to restore the original image from its noisy
versions. Some new image processing, such as wavelet
transforms and morphological methods, have been useful
in remote sensing images. One important activity in
remote sensing is the speckle reduction of SAR (synthetic
aperture radar) images. Speckles appearing in SAR
images is caused by the coherent interference of waves
reected from many elementary scatters. The statistics
of SAR speckle has been well studied (6). Over a 100
articles have been published on techniques to remove
the speckles. One of the most well-known techniques
is the Lees lter, which makes use of the local statistics
(7). More recent studies of the subject are reported in
Refs. 8 and 9.
Image restoration in remote sensing is required to
remove the effects of atmospheric and other interference,
as well as the noises presented by the sensors. A good
example is the restoration of images fromthe Hubble Space
Telescope.
Image segmentation attempts to dene the regions
and boundaries of the regions. Techniques are developed
that preserve the edges and smooth out the individual
regions. Image segmentation may involve pixel-by-pixel
classication, which often requires using pixels of known
classication for training. Segmentation may also involve
region growing, which is essentially unsupervised. A good
example is to extract precisely the lake region of Lake
Mulargias on the island of Sardinia in Italy (10). The
original image is shown in Fig. 1a. The segmentationresult
is shown in Fig.1b, for which the exact size of the lake can
be determined. For land-based remote sensing images,
pixel-by-pixel classication allows us to determine pre-
cisely the area covered by each crop and to assess the
changes from one month to the next during the growing
season. Similarly, ood damage can be determined from
the satellite images. These examples are among the many
applications of image segmentation.
PATTERN RECOGNITION AND NEURAL NETWORKS
A major topic in pattern recognition is feature extraction.
An excellent discussion of feature extraction and selection
problem in remote sensing with multispectral and hyper-
spectral images is given by Landgrebe (5). In remote
sensing, features are usually taken from the measure-
ments of spectral bands, which this means 6 to 8 features
in multispectral data, but a feature vector dimension of
several hundred in hyperspectral image data. With a lim-
ited number of training samples, increasing the feature
dimension in hyperspectral images may actually degrade
the classication performance, which is referred to as the
Hughes phenomenon. Reference (5) presents procedures to
reduce such phenomena. Neural networks have found
many uses in remote sensing, especially with pattern clas-
sication. The back-propagation trained network, the
radial basis function network, and the support vector
machine are the three best-performing neural networks
for classication. A good discussion on statistical and
neural network methods in remote sensing classication
is contained in Ref. 11 as well as many other articles that
appear in Refs. 3, 4, and 12. A major advantage of neural
networks is that learning is from the training data only,
and no assumption of the data model such as probability
density is required. Also, it has been found that combin-
ing two neural network classiers such as combining
SOM, the self-organizing map, with a radial basis function
network can achieve better classication than either one
used alone (13).
One problem that is fairly unique and signicant to
remote sensing image recognition is the use of contextual
information in pattern recognition. In remote sensing
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
image data, there is a large amount of contextual infor-
mation that must be used to improve the classication.
The usual procedure for contextual pattern recognition is
to work with image models that exploit the contextual
dependence. Markov random eld models are the most
popular, and with only a slightly increased amount of
computation, classication performance can be improved
with the use of such models (2,4,12).
Another pattern recognition topic is the change detec-
tion. The chapters by Serpico and Bruzzone (14) and Moser
et al.(15) are recommended reading.
DATA FUSION AND KNOWLEDGE-BASED SYSTEMS
Inremote sensing, there are oftendatafromseveral sensors
or sources. There is no optimum or well-accepted approach
to the problem. Approaches can range frommore theoretic,
like consensus theory (16), to fuzzy logic, neural networks,
multistrategy learning to fairly ad hoc techniques in some
knowledge-based system to combine or merge information
from different sources. In some cases, fusion at decision
level can be more effective than fusion at data or feature
level, but the other way can be true in other cases. Readers
are referred to chapters by Solaiman (17), Benediktsson
and Kanellopoulos (18), and Binaghi et al. (19) for detailed
discussion.
IMAGE MINING, IMAGE COMPRESSION,
AND WAVELET ANALYSIS
To extract certain desired information from a large remote
sensing image database, we have the problem of data
mining. For remote sensing images, Aksoy et al. (20) des-
cribe a probabilistic visual grammar to automatically
analyze complex query scenarios using spatial relation-
ships of regions, and to use it for content-based image
retrieval and classication. Their hierarchical scene mod-
eling bridges the gap between feature extraction and
semantic interpretation. For image compression of hyper-
spectral images, Qian et al. (21) provide a survey of major
approaches including vector quantization, discrete cosine
transform and wavelet transform, and so on. A more com-
prehensive survey of remote sensing image compression is
provided by Aiazzi et al.(22). Besides its important cap-
ability in image compression, wavelet transform has a
major application in de-noising SAR images (23). The
wavelet analysis of SAR images can also be used for
near-real-time quick look screening of satellite data,
data reduction, and edge linking (24).
COMPONENT ANALYSIS
Transform methods are often employed in remote sens-
ing. A key to the transform methods is the component
analysis, which, in general, is the subspace analysis. In
remote sensing, component analysis includes the principal
component analysis (PCA), curvilinear component analysis
(CCA), and independent component analysis (ICA). The
three component analysis methods are different concep-
tually. PCA is to look for the principal components accord-
ing to the second-order statistics. CCA performs nonlinear
feature space transformation while trying to preserve as
much as possible the original data information in the
lower-dimensional space see Ref. 25. ICA looks for
independent components from the original data assumed
to be linearly mixed from several independent sources.
Nonlinear PCA that makes use of the higher-order statis-
tical information (26) can provide an improvement over the
linear PCA that employs only the second-order covariance
information.
ICA is a useful extension of the traditional PCA.
Whereas PCA attempts to decorrelate the components in
a vector, ICA methods are to make the components as
independent as possible. There are currently many
approaches available for ICA (27). ICA applications in
remote sensing study have become a new topic in recent
Figure 1. Original image of Lake
Mulargias region in Italy (A) and
the result of region growing to
extract the lake area (B).
2 REMOTE SENSING INFORMATION PROCESSING
years. S. Chiang et al. employed ICA in AVIRIS (airborne
visible infrared imaging spectrometer) data analysis (28).
T. Tu used a noise-adjusted version of fast independent
component analysis (NAFICA) for unsupervised signature
extraction and separation in hyperspectral images (29).
With remote sensing in mind, we developed a new (ICA)
method that makes use of the higher-order statistics The
workis quite different fromthat of Cardoso (30). We name it
the joint cumulant ICA (JC-ICA) algorithm (31,32). It can
be implemented efciently by a neural network. Experi-
mental evidence (31) shows that, for the SAR image pixel
classication, a small subset of ICA features perform a few
percentage points better than the use of original data or
PCA as features. The signicant component images
obtained by ICA have less speckle noise and are more
informative. Furthermore, for hyperspectral images, ICA
can be useful for selecting or reconguring spectral bands
so that the desired objects in the images may be enhanced
(32). Figures. 2 and 3 show, respectively, an original
AVIRIS image and the enhanced image using the JC-
ICA approach. The latter has more desired details.
CONCLUSION
In this article, an overview is presented of a number of
topics and issues on information processing for remote
sensing. One common theme is the effective use of comput-
ing power to extract the desired information fromthe large
amount of data. The progress in computer science and
engineering certainly presents many new and improved
procedures for information processing in remote sensing.
BIBLIOGRAPHY
1. J. A. Richard and X. Jin, Remote sensing digital image analy-
sis, 3rd ed., New York: Springer, 1991.
2. R. A. Schowengerdt, Remote sensing: Models and methods for
image processing, New York: Academic Press, 1977.
3. C. H. Chen (ed.), Information processing for remote sensing,
Singapore: World Scientic Publishing, 1999.
4. C. H. Chen (ed.), Frontiers of remote sensing information
processing, Singapore, World Scientic Publishing, 2003.
5. D. Landgrebe, Signal theory methods in multispectral remote
sensing, New York: Wiley, 2003.
6. J. S. Lee, et al., Speckle ltering of synthetic aperture radar
images: a review, Remote Sens. Rev., 8: 313340, 1994.
7. J. S. Lee, Digital image enhancement and noise ltering by use
of local statistics, IEEE Trans. Pattern Anal. Machine Intell.,
2(2): 165168, 1980.
8. J. S. Lee, Speckle suppression and analysis for synthetic
aperture radar images, Op. Eng.,25(5): 636643, 1996.
9. J. S. Lee andM. Grunes, Polarimetric SARspeckle lteringand
terrain classication-an overview, in C. H. Chen (ed.), Infor-
mation processing for remote sensing, Singapore: World Scien-
tic Publishing, 1999.
10. P. Ho and C. H. Chen, On the ARMA model based region
growing method for extracting lake region in a remote sensing
image, SPIE Proc., Sept. 2003.
11. J. A. Benediktsson, On statistical and neural network pattern
recognition methods for remote sensing applications, in C. H.
Chen et al. (eds.), Handbook of Pattern Recognition and
Compter Vision, 2nd ed., (ed.) Singapore: World Scientic
Publishing, 1999.
12. E. Binaghi, P. Brivio, and S. B. Serpico, (eds.), Geospatial
Pattern Recognition, Research Signpost, 2002.
Figure 3. Enhanced image using JC-ICA.
Figure 2. An AVIRIS image of Moffett field.
REMOTE SENSING INFORMATION PROCESSING 3
13. C. H. Chen and B. Sherestha, Classication of multi-source
remote sensing images using self-organizing feature map and
radial basis function networks, Proc. of IGARSS 2000.
14. S. B. Serpico and L. Bruzzone, Change detection, in C. H. Chen
(ed.), Information processing for remote sensing, Singapore:
World Scientic Publishing, 1999.
15. G. Moser, F. Melgani and S. B. Serpico, Advances in unsuper-
vised change detection, Chapter 18 in C. H. Chen (ed.),
Frontiers of remote sensing information processing, Singapore:
World Scientic Publishing, 2003.
16. J. A. Benediktsson and P. H. Swain, Consensus theoretic
classication methods, IEEE Trans. Syst. Man Cybernet.,
22(4): 688704, 1992.
17. B. Solaiman, Information fusion for multispectral image
classicatin post processing, in C. H. Chen (ed.), Information
processing for remote sensing, Singapore: World Scientic
Publishing, 1999.
18. J. A. Benediktsson and I. Kanellopoulos, Information extrac-
tion based on multisensor data fusion and neural networks, in
C. H. Chen (ed.), Information processing for remote sensing,
Singapore: World Scientic Publishing, 1999.
19. E. Binaghi, et al., Approximate reasoning and multistrategy
learning for multisource remote sensing daa interpretation, in
C. H. Chen (ed.), Information processing for remote sensing,
Singapore: World Scientic Publishing, 1999.
20. S. Aksoy, et al., Scene modelingandimage mining witha visual
grammar, Chapter 3 in C. H. Chen (ed.), Frontiers of remote
sensing information processing, Singapore, World Scientic
Publishing, 2003.
21. S. Qian and A. B. Hollinger, Lossy data compression of
3-dimensional hyperspectral imagery, in C. H. Chen (ed.),
Information processing for remote sensing, Singapore: World
Scientic Publishing, 1999.
22. B. Aiazzi, et al., Near-lossless compression of remote-sensing
data, Chapter 23 in C. H. Chen (ed.), Frontiers of remote
sensing information processing, Singapore, World Scientic
Publishing, 2003.
23. H. Xie, et al., Wavelet-based SAR speckle lters, Chapter 8 in
C. H. Chen (ed.), Frontiers of remote sensing information
processing, Singapore, World Scientic Publishing, 2003.
24. A. Liu, et al., Wavelet analysis of satellite images in ocean
applications, Chapter 7 in C. H. Chen (ed.), Frontiers of remote
sensing information processing, Singapore, World Scientic
Publishing, 2003.
25. M. Lennon, G. Mercier, and M. C. Mouchot, Curvilinear com-
ponent analysis for nonlinear dimensionality reduction of
hyperspectral images, SPIE Remote Sensing Symposium Con-
ference 4541, Image and Signal Processing for Remote Sensing
VII, Toulouse, France, 2001.
26. E. Oja, The nonlinear PCA learning rule in independent com-
ponent analysis, Neurocomputing, 17(1): 1997.
27. A. Hyvarinen, J. Karhunen, and E. Oja, Independent Compo-
nent Analysis, New York: Wiley, 2001.
28. S. Chiang, et al., Unsupervised hyperspectral image analysis
using independent component analysis, Proc. of IGARSS2000,
Hawaii, 2000.
29. T. Tu, Unsupervised signature extraction and separation in
hyperspectral images: A noise-adjusted fast independent com-
ponent analysis approach, Opt. Eng., 39: 2000.
30. J. Cardoso, High-order contrasts for independent component
analysis, Neural Comput., 11: 157192, 1999.
31. X. Zhang and C. H. Chen, A new independent component
analysis (ICA) method and its application to remote sensing
images, J. VLSI Signal Proc., 37 (2/3): 2004.
32. X. Zhang and C. H. Chen, On a new independent component
analysis (ICA) method using higher order statistics with
application to remote sensing image, Opt. Eng., July 2002.
C. H. CHEN
University of Massachusetts,
Dartmouth
NorthDartmouth, Massachusetts
4 REMOTE SENSING INFORMATION PROCESSING
T
TRANSACTION PROCESSING
A business transaction is an interaction in the real world,
usually between an enterprise and a person, in which
something, such as money, products, or information, is
exchanged (1). It is often called a computer-based transac-
tion, or simply a transaction, when some or the whole of the
work is done by computers. Similar to the traditional
computer programs, a transaction program includes func-
tions of input and output and routines for performing
requested work. A transaction can be issued interactively
by users through a Structured Query Language (SQL) or
some sort of forms. A transaction can also be embedded in
the application program written in a high-level language
such as C, Pascal, or COBOL.
A transaction processing (TP) system is a computer
system that processes the transaction programs. A collec-
tion of such transaction programs designed to perform the
functions necessary to automate givenbusiness activities is
often called an application program (application software).
Figure 1 shows a transactionprocessing system. The trans-
action programs are submitted to clients, and the requests
will bescheduledbythe transactionprocessingmonitor and
then processed by the servers. A TP monitor is a piece of
software that connects multiple clients to multiple servers
to access multiple data resources (databases) in TP sys-
tems. One objective of the TP monitor is to optimize the use
of system and network resources when clients and servers
execute on different processors.
TP is closely associated with database systems. In fact,
most earlier TP systems, such as banking and airlines
reservation systems, are database systems, in which
data resources are organized into databases and TP is
supported by database management systems (DBMSs).
In traditional database systems, transactions are usually
simple and independent, and they are characterized as
short duration in that they will be nished within minutes
(probably seconds). Traditional transaction systems have
some limitations for many advanced applications such as
cooperative work, in which transactions need to cooperate
witheachother. For example, incooperative environments,
several designers might work on the same project. Each
designer starts up a cooperative transaction. Those coop-
erative transactions jointly forma transactiongroup. Coop-
erative transactions in the same transaction group may
reador update eachothers uncommitted (unnished) data.
Therefore, cooperative transactions may be interdepen-
dent. Currently, some research work on advanced TP
has been conducted in several related areas such as com-
puter-supported cooperative work (CSCW) and groupware,
workow, and advanced transaction models (26). In this
article, wewill rst discuss traditional transactionconcepts
and then examine some advanced transaction models.
Because of recent developments in laptop or notebook
computers and low-cost wireless digital communication,
mobile computing began to emerge in many applications.
As wireless computing leads to situations where machines
and data no longer have xed locations in the network,
distributed transactions will be difcult to coordinate, and
data consistency will be difcult to maintain. Inthis article,
we will also briey discuss the problems and possible solu-
tions in mobile transaction processing.
This paper is organized as follows. First, we will intro-
duce traditional database TP, including concurrency con-
trol and recovery in centralized database TP. The next
section covers the topics on distributed TP. Then, we dis-
cuss advanced TP and dene an advanced transaction
model and a correctness criterion. Mobile TP is also pre-
sented. Finally, future research directions are included.
DATABASE TRANSACTION PROCESSING
As database systems are the earlier formof TP systems, we
will start with database TP.
Databases Transactions
A database system refers to a database and the access
facilities (DBMS) to the database. One important job of
DBMSs is to control and coordinate the execution of con-
current database transactions.
A database is a collection of related data items that
satisfy a set of integrity constraints. The database should
reect the relevant state as asnapshot of the part of the real
worldit models. It is natural to assume that the states of the
database are constrained to represent the legal (permissi-
ble) states of the world. The set of intintegrity constraints
such as functional dependencies, referential integrity,
inclusion, exclusion constraints, and some other user-
dened constraints are identied in the process of informa-
tion analysis of the application domain. These constraints
represent real-world conditions or restrictions (7). For
example, functional dependencies specify some constraints
between two sets of attributes in a relation schema,
whereas referential integrity constraints specify con-
straints between two sets of attributes from different rela-
tions. For detailed denitions and discussions on various
constraints, we refer readers to Refs. 7 and 8. Here, we
illustrate only a few constraints with a simple example.
Suppose that a relational database schema has the
following two table structures for Employee and Depart-
ment with attributes like Name and SSN:
Employee (Name, SSN, Bdate, Address, Dnumber)
Department (Dname, Dnumber, Dlocation).
Name employee name
SSN social security number
Bdate birth date
Address living address
Dnumber department number
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Dname department name
Dlocation department location
Each employee has a unique social security number (SSN)
that can be used to identify the employee. For each SSN
value in the Employee table, there will be only one asso-
ciated value for Bdate, Address, and Dnumber in the table,
respectively. In this case, there are functional dependen-
cies from SSN to Bdate, Address, Dnumber. If any Dnum-
ber value in the Employee relation has the same Dnumber
value inthe Department relation, there will be areferential
integrity constraint from Employees Dnumber to Depart-
ments Dnumber.
Adatabase is said to be consistent if it satises a set of
integrity constraints. It is assumed that the initial state of
the database is consistent. As an empty database always
satises all constraints, often it is assumed that the initial
state is an empty database. It is obvious that a database
system is not responsible for possible discrepancies
between a state of the real world and the corresponding
state of the database if the existing constraints were inade-
quately identied in the process of information analysis.
The values of data items can be queried or modied by a set
of applicationprograms or transactions. As the states of the
database corresponding to the states of the real world are
consistent, a transaction can be regarded as a transforma-
tion of a database from one consistent state to another
consistent state. Users access to a database is facilitated
by the software system called a DBMS, which provides
services for maintaining consistency, integrity, and secur-
ity of the database. Figure 2 illustrates a simplied data-
base system. The transaction scheduler provides functions
for transaction concurrency control, and the recovery man-
ager is for transaction recovery in the presence of failures,
which will be discussed in the next section.
The fundamental purpose of the DBMS is to carry out
queries and transactions. A query is an expression, in a
suitable language, that determines a portion of the data
contained in the database (9). A query is considered as
a read-only transaction. The goal of query processing is
extracting informationfromalarge amount of datato assist
a decision-making process. A transaction is a piece of
programming that manipulates the database by a sequence
of read and write operations.
read(X) or R(X), whichtransfers the dataitemXfromthe
database to a local buffer of the transaction
write(X) or W(X), which transfers the data item X from
the local buffer of the transaction back to the data-
base
In addition to read and write operations, a transaction
starts with a start (or begin) operation and ends with a
commit operation when the transaction succeeds or an
abort when the transaction fails to nish. The following
example shows a transaction transferring funds between
two bank accounts (start and end operations are omitted).
Figure 1. TP monitor between cli-
ents Data resources and data
resources.
M
o
n
i
t
o
r
T
T
T
T
Transactions Clients Servers Data resources
Transaction manager
Transaction scheduler
recovery manager
Database
DBMS
Transactions
T
1
T
2
T
n-1
T
n
transaction identier;
2
|{z}
membrane
w
2
Gs
@
2
Gs
@s
2
2
|{z}
thin plate
3
where G(s) denotes the curve andw
1
andw
2
are the weights.
Practically, when w
1
w
2
, the curve is allowed to kink,
whereas w
1
w
2
forces the curve to bendslowly. Acommon
practice is to use different weights for each control point, so
that both w
1
and w
2
become functions of s. This approach
allows parts of the snake to have corners and allows the
others parts to be smooth.
SNAKE EVOLUTION
The motion of each control point, which is governed by
Equation (1), evolves the underlying curve to new a con-
Figure 1. The convolution mask to detect (a) the
vertical edges and (b) the horizontal edges, (c) An
input image. Resulting (d) vertical edges after con-
volving (a) with (c), and (e) horizontal edges after
convolving (b) with (c). (f) The gradient magnitude
image generated using (d) and (e) for highlighting the
boundary of the object.
Figure 2. (a) The edges obtained by applying the Canny edge
detector (3) with different thresholds. Note the ambiguity of the
features that will guide the snake evolution. (b) The inside and
outside regions dened by the snake.
Figure 3. (a) The membrane function and (b) the
thin plate function, which are a regularization lter
of order 1 and 2, respectively.
(b) (a)
Figure 4. The evolution of an initial
snake using the gradient magnitude
image shown in Fig. 1(f) as its feature.
2 ACTIVE CONTOURS: SNAKES
guration. This process is shown in Fig. 4. Computing the
motion of a control point s
i
requires the evaluation of the
rst- and the second- order curve derivatives in a neighbor-
hood G(s
i
). An intuitive approach to evaluate the curve
derivatives is to use the nite difference approximation:
@Gs
i
@s
Gs
i
Gs
i1
d
4
@
2
Gs
i
@s
2
Gs
i1
2Gs
i
Gs
i1
d
2
5
The nite difference approximation, however, is not applic-
able in regions where two control points overlap, resulting
inzero Euclideandistance betweenthe neighboring control
points: d 0. Hence, a special handling of the displace-
ment between the control points is required, so they do not
overlap during their motion. Another approach to compute
the derivatives is to t a set of polynomial functions to
neighboring control points and to compute the derivatives
from these continuous functions. In the snake literature,
the parametric spline curve is the most commonpolynomial
approximation to dene the contour from a set of control
points. As shown in Fig. 5, the spline curve provides natu-
rally a smooth and continuous approximation of the object
contour. This property waives the requirement to include
regularization terms in the energy function. Hence, the
complexity of the snake energy formulation is simplied.
The complexity of the snake formulation can also be
reduced by using a greedy algorithm (4,5). The greedy
algorithms move the control points on an individual basis
by nding a set of local solutions to the regular snake
energy Equation (1). computing locally the energy at
eachcontrol point requires analytical equations to evaluate
the snake regularity and curvature (5). An alternative
greedy formulation is to move each control point individu-
ally based on similarity in appearance between local
regions dened inside and outside of the curve (4). Practi-
cally, if the appearance of the outside is similar to that of the
inside, the control point is moved inside; if not, it is moved
outside. In either approach, the assumption is that each
local solution around a control point is correct and contri-
butes to the global solution dened by the object boundary.
DISCUSSION
The snakes framework has been very useful to overcome
the limitations of the segmentation and tracking methods
for cases when the features generated from the image are
not distinctive. In addition, its parametric formresults in a
compact curve representation that provides a simple tech-
nique to compute geometric features of the curve, such as
the curvature, andthe moments of the object region, suchas
the object area.
The algorithms developed for the snake framework per-
formnear real time when the initial snake is placed close to
the objects of interest. Their performance, however,
degrades in the presence of background clutter. To over-
come this limitation, researchers have proposed various
shape models to be included in the external energy term.
One of the mainconcerns about the snakes is the amount
of control points chosen initially to represent the object
shape. Selection of a few control points may not dene the
object, whereas selecting too many control points may not
converge to a solution. For instance, if the object circum-
ference is 50 pixels and the snake is initialized with 100
control points, snake iterations will enter a resonant state
caused by the regularity constraint, which prevents the
control points fromoverlapping. Aheuristic solution to this
problem would be to add or remove control points when
such cases are observed during the snake evolution.
Images composed of multiple objects require initializa-
tion of several independent snakes surrounding each
object. Multiple snakes are required because both the nite
difference approximation and the splines prevents the
snake from changing its topology by splitting of one curve
into two or merging of two curves into one. For topology
changing curves, we refer the reader to the article on the
Level Set Methods.
BIBLIOGRAPHY
1. M. Kass, A. Witkin, andD. Terzopoulos, Snakes: active contour
models, Internation. Conf. of Comp. Vision, London, UK,
pp.259268, 1987.
2. A. Blake and M. Isard, Active Contours: The Application of
Techniques from Graphics, Vision, Control Theory and Statis-
tics to Visual Tracking of Shapes in Motion., New York:
Springer, 2000.
3. J. Canny, A computational approach to edge detection, Putt.
that. Machine Intell., 8 (6): 679698, 1986.
4. R. Ronfard, Region based strategies for active contour models,
Iternat. J. Comp. Vision, 13 (2): 229251, 1994.
5. D. Williams and M. Shah, A fast algorithm for active contours
and curvature estimation, Comp. Vision Graphics Imag. Pro-
cess, 55 (1): 1426, 1992.
ALPER YILMAZ
The Ohio State University
Columbus, Ohio
Figure 5. Spline functionestimated fromfour control points. The
gray lines denote the control polygons connecting the control
points.
ACTIVE CONTOURS: SNAKES 3
C
COLOR PERCEPTION
INTRODUCTION
Color as a human experience is an outcome of three con-
tributors: light, the human eye, and the neural pathways of
the human brain. Factors such as the medium through
which the light is traveling, the composition of the light
itself, and anomalies in the human eye/brain systems are
important contributors. The human visual system, which
includes the optical neural pathways and the brain,
responds to an extremely limited part of the electromag-
netic (EM) spectrum, approximately 380 nm to 830 nm but
concentrated almost entirely on 400 nm to 700 nm. We are
blind, basically, to the rest of the EM spectrum, in terms of
vision. For normal observers, this wavelength range
roughly corresponds to colors ranging from blue to red
(as shown in Fig. 1). The red end of the spectrum is
associated with long wavelengths (toward 700 nm) and
the blue end with short wavelengths (400 nm).
COLOR VISION
The structure in the eye that enables color vision is the
retina, which contains the necessary color sensors. Light
passes through the cornea, lens, and iris; the functionality
of these is roughly comparable with the similar parts of
most common cameras. The pupillary opening functions in
a fashion similar to the aperture in a camera and results in
the formation of an upside-down image of the outside world
on the back face of the eye, the retinaa dense collection of
photoreceptors. Normal human color vision is enabled by
four different photoreceptors in the retina. They are called
the rods and the L, M, and S cones (for long, medium, and
short wavelength sensitive); each has a different spectral
sensitivity within the range of approximately 400 nm to
700 nm (1). Figure 2 shows the (normalized) spectral sen-
sitivities of the cones. Note that color as we know it is
specically a humanexperience andthat a different species
of animals respond differently to spectral stimuli. In other
words, a bee would see the same spectral stimulus drama-
tically differently than a human would. In fact, bees are
known to have their spectral sensitivities shifted toward
the lower wavelengths, whichgives themthe abilityto see
ultraviolet light. The following salient points must be con-
sidered for a holistic understanding of human color percep-
tion:
1. The rods are activated for vision at low luminance
levels (about 0.1 lux) at signicantly lower spatial
resolution than the cones. This kind of vision is called
scotopic vision. The corresponding spectral sensitiv-
ity function is shown in Fig. 3. At these luminances,
normal humans do not have any perception of color.
This lack is demonstrated easily by trying to look at a
colored painting at low luminance levels. Moreover,
at these luminance levels, our visual acuity is extre-
mely poor. This specic property of the visual system
is a function of the low spatial density of the rod
photoreceptors in the foveal region of the retina.
2. The cones are activated only at signicantly higher
luminance levels (about 10 lux and higher), at which
time, the rods are consideredto be bleached. This type
of vision is referred to as photopic vision. The corre-
sponding sensitivity function that corresponds to
luminance sensitivities is shown in Fig. 3. The green
curve is called the luminous efciency curve. In this
article we will consider photopic vision only. Inter-
estingly, the retinal density of the three types of cones
is not uniformacross the retina; the S cones are more
numerous than the L or M cones. The human retina
has an L, M, S cone proportion as high as 40:20:1,
respectively, although some estimates (2,3) put them
at 12:6:1. This proportion is used accordingly in com-
bining the cone responses to create the luminous
efciency curve. However, the impact of these propor-
tions to visual experiences is not considered a sig-
nicant factor and is under investigation (4). What is
referred to as mesopic visionoccurs at mid-luminance
levels, when the rods and cones are active simulta-
neously.
3. The pupillary opening, along with independently
scalable gains on the three cone output, permits
operation over a wide range of illuminant variations,
both in relative spectral content and in magnitude.
4. Color stimuli are different fromcolor experiences. For
the purposes of computational color science, the dif-
ferences between color measurements and color
experiences sometimes may not be considered, but
often the spatial and temporal relations among sti-
muli need to be taken into account.
L, M, and S cone functions may be represented as a
function of wavelength as l(l), m(l), and s(l). In the pre-
sence of a light source (illuminant), represented by i(l), the
reectance function of an arbitrary surface [described by
r(l)] is modied in a wavelength-selective fashion to create
a stimulus i
r
(l) to the eye given by
i
r
l ilrl 1
Let us denote the cone functions as a vector given by
lmsl ll; ml; sl 2
We denote the signals measured in the cones of the eye by
c [c
l
, c
m
, c
s
], given by
c
Z
l 830 nm
l 380 nm
lmsli
r
ldl 3
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Onanaside, althoughit is shownhere that a color stimulus
is formed by the process of reection of the spectral energy
i(l) of an illuminant off a surface r(l), it is only one man-
ifestation of the cause of a color stimulus. The source for all
colors at an elemental level may be grouped into 15 basic
causes in ve groups: (1) vibrations, rotations, and excita-
tions, as in ames, neon lights, and so on; (2) ligand-eld
effects in transition metal compounds like turquoise,
including impurities, as in rubies or emeralds; (3) molecu-
lar orbitals of organic compounds like chlorophyll and
charge-transfer compounds like sapphire; (4) energy-bands
in brass, gold, diamonds, and so on; and (5) geometric and
physical optical effects like interference, diffraction, scat-
tering, refraction, and so forth. The Nassau book on this
topic is a thorough reference (5). In the case of emissive
sources of stimuli (e.g. trafc lights, or television sets),
Equation (3) is rewritten as
c
Z
l830 nm
l380 nm
lmsleldl 4
where e(l) denotes the spectral stimulus that excites the
cones.
A glance at the L, M, and S cone functions in Fig. 2
clearly highlights that the three measurements c, resulting
from stimulating these cones, is not going to reside in an
orthogonal three-dimensional color spacethere will be
correlation among L, M, and S. As a result of psychophy-
sical testing, it is understood that human color vision
operates on the basis of opponent color theory, which
was rst proposed by Ewald Hering in the latter half of
the nineteenthcentury(2,6,7). Heringusedasimple experi-
ment to provide the primary proof of opponent color pro-
cesses in the human visual system. An older hypothesis
about the human visual system (works of Hermann von
Helmholtz and James Maxwell in the mid-nineteenth cen-
tury) suggested that the human visual system perceives
colors in three independent dimensions (each correspond-
ing to the three known color-sensitive pigments in the eye,
roughly approximated by red, green, and blue axes).
Although conceptually correct, this hypothesis could not
explain some of the effects (unique hues and afterimages)
that Hering observed. In a series of published works, Her-
ing suggested that the visual systemdoes not see colors as a
combination of red and green (a reddish-green color). In
fact, a combination of a red stimulus with a green stimulus
produces no hue sensationat all. He suggested also that the
Ultraviolet
Visible
Infrared
Radar
Radio/TV
30m
300um
0.3um
0.3A
3cm
3um
30um
300A
30A
3A
0.3cm
30cm
3m
Wavelength
(nm)
Frequency
(MHz)
10
7
10
12
10
15
10
19
700nm
400nm
Figure 1. Electromagnetic spectrum, showing the limited range
of human vision and color perception.
400 450 500 550 600 650 700 750 800
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wavelength(nm)
R
e
l
a
t
i
v
e
S
e
n
s
i
t
i
v
i
t
y
L
M
S
Figure 2. Cone sensitivity functions of normal human observers.
400 450 500 550 600 650 700 750 800
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wavelength(nm)
R
e
l
a
t
i
v
e
S
e
n
s
i
t
i
v
i
t
y
Scotopic Vision
Photopic Vision
Figure 3. Sensitivity functions for photopic and scotopic vision in
normal human observers.
2 COLOR PERCEPTION
human visual system has two different chromatic dimen-
sions (one corresponding to an orthogonal orientation of a
redgreen axis and another with a yellowblue axis), not
three. These concepts have been validated by many
researchers in the decades since and form the framework
of moderncolor theories. Similarly, staring at the bright set
of headlights of an approaching car leaves a black or dark
image after the car passes by, which illustrates that
humans also see colors along a luminance dimension
(absence or presence of white.) This nding has formed
the backbone of moderncolor theory. Opponent color theory
suggests that at least at the rst stage, human color vision
is based on simple linear operations on the signals mea-
sured by the L, M, and S cones. In other words, from the L,
M, and S cone stimulations c, three resulting signals are
computed that perform the task of reducing the interde-
pendence between the measurements, somewhat orthogo-
nalizing the space and hence reducing the amount of
information transmitted through the neural system from
theeye to the brain(seeFig. 4). This functionalityis enabled
by the extensive neural systemin the retina of the eye. The
opponent colors result in opponent cone functions (plotted
in Fig. 5), which clearly suggests a fundamental conclusion
of modern human color perception research: The three
independent axes are luminance, rednessgreenness,
and yellownessblueness (8). Inother words, we have three
sets of opposing color perception: black and white, red and
green, and yellow and blue. In the gure, the redgreen
process appears as it does because colors on the ends of the
spectrum appear similar (deep red is similar to purple)
the hue wraps around and often is portrayed in a circle.
Not surprisingly, much of the eld of color science has
been involved in trying to determine relationships between
stimuli entering the eye and overall color experiences. This
determination requires the ability to isolate color stimuli
not just from their spatial relationships but also from the
temporal relationships involved. Additionally, it requires a
clear understanding of perceptual color appearance phe-
nomena. As data have become available via a variety of
experiments, the linear relationships between cone signals
and color specications has needed to be revised into a
complex set of nonlinear relationships. Note that most color
appearance phenomena have been developed for simplistic
viewing elds, whereas all of our color experiences are
basedoncomplex images that involve complex illumination
settings. It is instructive, nonetheless, to visit some com-
monlocal color appearance descriptors that do not take into
account spatial relationships such as the surroundings
around a viewed scene. The terminology used below is
used commonly in the color appearance work published
by the Commission Internationale del Eclairage in its
standard documents (e.g. See Ref. 9) and in some popular
textbooks on this subject (2,10). Groups involved in color
ordering work and computational color technology, how-
ever, may dene these terms differently based on their
specic needs.
Hue
Hue is denedas the property of acolor that describes it as a
red, green, yellow, or blue or a combination of these unique
hues. By denition, grays are not associated with any hue.
A hue scale is dened typically as an angle. Figure 6 has
magenta/violet hues at one end and reds at the other.
Brightness
The property of a color that makes it appear to emit more or
less light.
Lightness
The property of a color that describes its brightness relative
to that of the brightest white object in the visual eld. A
typical lightness scale is shown in Fig. 7.
L M S
Achromatic Red-Green Yellow-Blue
Figure 4. Weighted linear combinations of the L, M, S cone
stimulations result in opponent color functions and an achromatic
signal.
400 450 500 550 600 650 700 750 800
2
0
2
4
6
8
10
x 10
3
Wavelength(nm)
N
o
r
m
a
l
i
z
e
d
S
e
n
s
i
t
i
v
i
t
y
Luminance
RedGreen
YellowBlue
Figure 5. Normalized functions showing resultant redgreen
and yellowblue sensitivities, along with the luminance channel.
Figure 6. A typical hue scale.
COLOR PERCEPTION 3
Colorfulness
The property of a color that makes it appear more or less
chromatic.
Chroma
The property of a color that describes its colorfulness
relative to the brightness of the brightest white object in
the visual eld. In general, the relationship that exists
between brightness and lightness is comparable with the
relationship between colorfulness and chroma. Figure 8
shows a chroma scale for a hue of red and yellow. Note that
inthis example the chroma of yellows extends muchfarther
than that of reds, as yellows appear much brighter than
reds in nature and in most display systems.
Saturation
The property of a color that describes its colorfulness in
proportion to its brightness. A typical saturation scale
(shown here for a red and yellow hue) is displayed in
Fig. 9. Note that by denition, saturation is normalized
and hence unlike chroma; the same scale exists for both
reds and yellows (and for other hues as well).
To aid in understanding these properties, Fig. 10 shows
the locus of lines with constant hue, saturation, lightness,
and chroma if we x the brightness of white. By denition,
saturation and hue are independent of the lightness of
white.
Related and Unrelated Colors
In its simplest form, a color can be assumed to be indepen-
dent of everything else in the viewing eld. Consider, for
illustrative purposes, a small patch of a color displayed in a
darkroomonamonitor withblackbackground. This color is
observed devoid of any relationships. This setup is typically
the only one where colors are unrelated and are associated
with attributes like brightness, hue, and saturation.
Related colors, on the other hand, are observed in relation-
ship with their surrounding and nearby colors. A simple
example involves creating an image with a patch of brown
color on a background with increasing white brightness,
fromblack to white. The brown color is observed as a bright
yellow color when on a black background but as a muddy
brownonthe brightest white background, whichillustrates
its relationship to the background (for neutral background
colors). In practice, related colors are of great importance
and are associated with perceptual attributes such as hue,
lightness, and chroma, which are attributes that require
relationships with the brightness of white. To specify a
color completely, we needto dene its brightness, lightness,
colorfulness, chroma, and hue.
Metamerism
According to Equation (3) , if we can control the stimulus
that enters the eye for a given color, then to match two
colors we merely need to match their resulting stimulus
measurements c. In other words, two different spectra can
be made to appear the same. Such stimuli are called meta-
mers. If c
1
c
2
, then
Z
l 830 nm
l 380 nm
lmsli
r
1
ldl
Z
l 830 nm
l 380 nm
lmsli
r
2
ldl
5
Different manifestations of the above equality carry differ-
ent names: observer, illuminant, and object metamer-
ism, depending on whether equal stimuli c result from
changing the sensor functions lms(l), the light i(l), or
the surface r(l). So, two completely different spectral sti-
muli can be made to generate the same cone stimuli for the
same observera property that color engineers in elds
ranging from textiles to televisions have considered a bles-
sing for decades, because changes of pigments, dyes, phos-
Figure 7. A typical lightness scale, with black at one end and the
brightest possible white at the other.
Figure 8. Atypical chroma scale for red and yellow, starting with
zero chroma on left and moving to maximum chroma on the right.
Figure 9. A typical saturation scale for red, starting with zero
saturation on the left and moving to a saturation of 1 for pure color
on the right.
Luminance
constant hue
constant chroma
red-green
yellow-blue
constant saturation
constant lightness
Figure 10. Loci of constant hue, saturation, lightness, and
chroma shown in a perceptual color space.
4 COLOR PERCEPTION
phors, and color lters can achieve a consistent perception
of colors across various media. Equal colors are called
metameric. Consider, for example, two color samples
that have reectance functions r
1
(l) and r
2
(l), as in
Fig. 11. When plotted on a wavelength scale, it may appear
that these two reectance functions must result in com-
pletelydifferent perceptions to the observer. However, if we
were to apply the same illuminant and observer sensitivity
functions to these otherwise different colors, they result in
identical colors being perceived by the eye. These two colors
(reectance functions) hence are called metamers, and this
is an example of object metamerism.
On a similar note, consider two patches of color with
reectance functions r
3
(l) and r
4
(l) being viewed under
identical illuminationconditions by two different observers
(observers whose cone functions are not the same), as
shown in Fig. 12. One observer would view these patches
as being the same (they are metameric), whereas the other
would view this exact same pair as distinctly different
resulting in observer metamerism. This kind of metamer-
ism is relatively common, because most, if not all, concepts
related to color are built around a standard or average
observer, whereas in fact signicant variation exists
between observers. The nal type of metamerism, illumi-
nant metamerism, consists of metameric colors that arise
fromthe same observer and reectance but different lights.
Adaptation
Arguably, the most remarkable capability of the human
visual system is its ability to adapt to changes in the
illuminant. This ability may be classied broadly as light-
ness and chromatic adaptation. The resulting effect is that
despite changes in the spectral content of the light or its
absolute power, the visual systemmaintains quite constant
overall perception. However, certain limitations apply to
these abilities; these changes are limited mainly to changes
in natural illuminants and objects. This occurnance is to be
expected giventhe types of illuminants withwhichhumans
have been most familiar.
This ability is explained best by means of an example. As
one walks out of a relatively dark movie theater to bright
afternoon sunlight, it takes only a fewseconds for the visual
system to adapt to as much as two orders of magnitude
change in the intensity of the illuminant, without change in
visual experience. Here, the cones are the dominant photo-
receptors, and the rods have become bleached (unable to
replenish their photopigments). This type of adaptation is
referred to as luminance or light adaptation. Similarly,
entering a dimly lit movie theater from the bright sunny
outdoors againrequires time to adapt to the darkconditions,
after which our visual system has adapted well to the
surroundings. This, however, takes slightly longer than
inthe former situationbecause nowthe rods need to become
active, requiringthemtounbleach, whichis acomparatively
longer process. This kind of adaptation is called dark adap-
tation. The abilityto darkandlight adapt gives us the ability
to have reasonable visual capability in varying illuminant
conditions whiletakingmaximal advantage of theotherwise
limited dynamic range of the photoreceptors themselves.
A second, and perhaps the most fascinating, mode of
adaptation is called chromatic adaptation. This termrefers
to the ability of the visual system to maintain color percep-
tion under small, but signicant, changes in the spectral
content of the illuminant. A newspaper seems to maintain
its mostly white background independent of whether we
look at it outdoors under an overcast sky, indoors under a
uorescent lamp, or under an incandescent lamp. Consider
looking at a bowl of fruit that contains a red apple, a yellow
banana, and other fruit under an incandescent illuminant.
The apple will appear redandthe banana yellow. Changing
the illuminant to a typical uorescent lamp, which greatly
alters the spectral content of the light, does not appear to
change the color of the apple or the banana, after a few
seconds of adaptation. The humanvisual systemmaintains
its perception; our visual system has adapted chromati-
400 450 500 550 600 650 700
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wavelength(nm)
R
e
l
a
t
i
v
e
E
n
e
r
g
y
r
1
()
r
2
()
i()
Figure 11. Metameric reectances r
1
(l) andr
2
(l). Althoughtheir
reectance functions differ, under the illuminant i(l), their stimu-
lation of the cones is identical and color perceived is the same.
400 450 500 550 600 650 700
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Wavelength(nm)
R
e
l
a
t
i
v
e
S
e
n
s
i
t
i
v
i
t
y
L
0
M
0
S
0
L
1
M
1
S
1
Figure 12. Different observer cone functions showing observer
variances.
COLOR PERCEPTION 5
cally. Interestingly, our ability to adapt to changes in
spectral content of the illuminant are limited mostly to
changes in natural illuminants such as sunlight.
Color Constancy
The phenomenon of objects maintaining their appearance
under varying illuminants is referred to as color constancy.
For example, the appearance of a dress that looks red in the
store might look nothing like a red under street lighting
(e.g., sodium-vapor lamps); the visual systemcannot adapt
as well to a sodium-vapor lamp as it can to a uorescent or
incandescent lamp; thus it inconsistently renders the
perception of this dress fabric color. This subject is inter-
esting given the formation model described in Equation (3).
The information the eye receives from the object changes
with the illuminant athough, given the premise of color
constancy, the net result of the visual system needs to stay
the same. This occurrence is known however, to be untrue
for humans as we take informational cues from the color of
the illuminant and perform some form of chromatic adap-
tation (11). The study of color constancy provides us with
clues that describe how the human visual system operates
and is used often by computational color technologists in
maintaining numerical color constancy.
Color inconstancy is a battle that textile and paint
manufacturers, camera and display manufacturers, and
printer manufacturers regularly have to ght because sig-
nicant changes may take place between illuminants in
retail stores and in the home or between hardcopy and on-
screen imaging. In each case, the color data reside in a
different color space with its own color appearance models.
Moreover, illuminants are difcult to control because we
typically have mixed illuminants (not just one specic type)
in whatever surrounding we are in.
After Images
When stimulated for an extended period of time by a strong
stimulus, the human visual system adapts, and when the
source of this stimulus is removed, a negative after image
appears for a short period of time, most commonly attrib-
uted to sensor fatigue. Many forms of after images have
been shown to be valid. The most commonly known type of
after images is the one formedviacolor responses, knownas
chromatic after images. For example, if individuals x their
gaze onapicture of abrightly coloredset of squares for some
time andthenaplainwhite stimulus is presentedquickly to
the eye, the individuals experience a negative image of
corresponding opponent colors. Other forms of after images
and visual stimulus adaptation that may be of interest
to the reader have been demonstrated by Fairchild and
Johnson (12).
SIMPLE COLOR APPEARANCE PHENOMENA
Specifying a color merely by its physical attributes has its
advantages but has drawbacks, too, when describing the
appearance of colors, especially in somewhat complex
scenes. We illustrate these difculties via some examples.
Simultaneous Lightness and Color Contrast
Consider an achromatic color placed in a relatively simple
scene as shown in the upper half of Fig. 13. Both central
squares seem to have the same luminance. However, if we
reduce the luminance of the background in one half and
increase it in the other while keeping the same central
achromatic patches as shown in the lower half of the gure,
one patch appears brighter than the other although in fact
they have exactly the same luminance. This occurrence
may be attributed to the presence of reinforcing and inhi-
biting ONOFF receptive elds that work locally to enhance
differences. Asimilar phenomenon occurs when we change
the color of the background. The achromatic patches would
seem to have the opponent color of the background and no
longer retain their achromatic nature. In Fig. 14, all four
inner squares are the same achromatic color. However,
each square contrasts with its surrounding, resulting in
the appearance of opponent colors. Note that the upper left
square has a greenish tint, the upper right a bluish tint, the
Figure 13. A simple example that demonstrates simultaneous
lightness contrast. Notice that the same gray patches as in the
upper half of the image seem to have different brightnesses when
the background luminance is changed.
Figure 14. A simple example that demonstrates simultaneous
color contrast. Notice that the same gray patches as in the upper
half of the image seemto have different hues when the background
luminance is changed.
6 COLOR PERCEPTION
bottom left a yellowish tint, and the bottom right a reddish
tint.
Lightness and Chromatic Crispening
The difference between two colors that are only slightly
different is heightened if the color of the background that
surrounds them is such that its color lies between those of
the two patches. Figure 15 illustrates this phenomenon for
lightness. Note that the difference betweenthe two patches
is hardly noticeable whenthe background is white or black,
but when the luminance of the background is in between
the colors of the patches, the difference is accentuated
greatly.
Only a few color appearance phenomena have been
addressed in this article. We have looked at some phenom-
ena that are easy to observe anddo not needin-depthstudy.
Color appearance phenomena are the primary drivers of
image quality assessments in the imaging industry. The
interested reader is referred to books on this topic included
in the Bibliography.
ORDER IN COLOR PERCEPTION
From the preceding sections on the various color appear-
ance terms, one may gather that many potential candidates
exist for ordering color perceptions. One means is based
on the somewhat-orthogonal dimensions of redness
greenness, yellownessblueness, and luminance. These
axes may be placed along Euclidean axes, as shown in
Fig. 16. Color spaces with such an ordering of axes form
the basis of all computational color science.
Another method for ordering color perceptions could
be based on hue, chroma, and lightness, which again
could be placed along the Euclidean axes as in Fig. 17.
It turns out that the two orderings are related to each
other: The huechromalightness plot is simply another
representation of the opponent-color plot. Such a rela-
tionship was found also by extensive studies on color
orderings performed by Munsell in the early 1900s. In
his publications, Munsell proposed a color ordering in
which the spacing between each color and its neighbor
would be perceived as equal. This resulted in a color space
referred to as the Munsell color solid, which to date is the
most organized, successful, and widely used color order
system. Munsell proposed a notation for colors that spe-
cies their exact location in the color solid. A vertical
value (V) scale in ten steps denotes the luminance axis.
Two color samples along the achromatic axis (denoted by
the letter N for neutrals) are ordered such that they are
spaced uniformly in terms of our perception; for example,
a sample with a value of 4 would correspond to one that is
half as bright as one with a value of 8. Munsell dened
basic hues (H) of red (R), yellow (Y), green (G), blue (B),
and purple (P) and combinations (RP for red-purples and
so on) that traverse the circumference of a circle, as
shown in Fig. 18. A circle of constant radius denes
Figure 15. An example that demonstrates lightness crispening.
Notice that the difference between the gray patches in the white
andblackbackgrounds is hardlynoticeable, but whenthe lightness
of the background is in between that of the two patches the
appearance of the difference is accentuated.
Figure 16. A plot of lightness, rednessgreenness, and yellownessblueness ordered along Euclidean axes.
COLOR PERCEPTION 7
the locus of colors with the same chroma (C) or deviations
from the achromatic axis.
Increasingradii denote higher chromacolors onanopen-
ended scale. In this fashion, a color is denoted by H V/C
(Hue Value/Chroma). For example, 5GY6/10 denotes a hue
of 5GY (a greenyellow midway between a green and a
yellow) at value 6 and chroma 10. Most modern computa-
tional color models and color spaces are based on the
fundamentals of the Munsell color order system.
The NCScolor order systemis another ordering scheme,
much more recent and gaining acceptance (13). The NCS
color ordering system is based on the work of the Hering
opponent color spaces. The perceptual axes used in the
NCS are blacknesswhiteness, rednessgreenness, and
yellownessblueness; these colors are perceived as being
pure (see Fig. 19). The whitenessblackness describes the
z-dimension, whereas the elementary colors (red, green
yellow, andblue) are arrangedsuchthat theydivide the xy
plane into four quadrants. Between two unique hues, the
space is divided into 100 steps. A color is identied by its
blackness (s), its chromaticness (c), and its hue. For exam-
ple, a color notated by 3050-Y70R denotes a color with a
blackness value of 30 (on a scale of 0 to 100), a chromatic-
ness of 50 (an open-ended scale), and a hue described as a
yellow with 70% red in its mixture. A good reference that
details the history and science of color order systems was
published recently by Kuehni (14).
CONCLUSIONS
The multitide of effects and phenomena that need to be
explored in color vision and perception is profound. One
Figure 17. A plot of lightness, chroma, and hue ordered along the Euclidean axes.
G
R
RY
Y
B
BP
PB
YR
YG
GY
GB
BG
P
PR
RP
N
Value
Chroma
Figure 18. A plot of a constant value plane (left) that shows the various hue divisions of a constant chroma circle in the Munsell notation,
alongside a constant hue plane (right).
S
W
R
B
G
Y
Figure 19. A schematic plot of the NCS color space.
8 COLOR PERCEPTION
wouldimagine that color science, aeldwithsucheveryday
impact and so interwoven with spoken and written
languages, would be understood thoroughly by now and
formalized. But the mechanisms of vision and the
human brain are so involved that researchers only have
begun unraveling the complexities involved. Starting from
the works of ancient artisans and scientists and passing
through the seminal works of Sir Issac Newton in the mid-
1600s to the works of the most recent researchers in this
eld, our knowledge of the complexities of color has
increased greatly, but much remains to be understood.
BIBLIOGRAPHY
1. G. Wyszecki and W. S. Stiles, Color Science: Concepts and
Methods, Quantitative Data and Formulae, 2nd ed. New
York: Wiley-lnterscience, 2000.
2. M. D. Fairchild, Color Appearance Models, 2nd ed. New York:
John Wiley & Sons, 2005.
3. A. Roorda and D. Williams, The arrangement of the three
cone classes in the living human eye, Nature, 397: 520522,
1999.
4. D. Brainard, A. Roorda, Y. Yamauchi, J. Calderone, A. Metha,
M. Neitz, J. Neitz, D. Williams, and G. Jacobs, Consequences of
the relative numbers of 1 and m cones, J. Optical Society of
America A, 17: 607614, 2000.
5. K. Nassau, The Physics and Chemistry of Color: The Fifteen
Causes of Color. New York: John Wiley & Sons, 1983.
6. R. G. Kuehni, Color: An Introduction to Practice and Princi-
ples, 2nd ed. New York: Wiley-Interscience, 2004.
7. R. S. Berns, Billmeyer and Saltzman s Principles of Color
Technology, 3rd ed. New York: John Wiley & Sons, 2000.
8. P. K. Kaiser and R. Boynton, Human Color Vision, 2nd ed.
Optical Society of America, 1996.
9. CommissionInternationale de l Eclairage, AColor Appearance
Model for Colour Management Systems: CIECAM02. CIEPub.
159, 2004.
10. R. Hunt, The Reproduction of Colour, 6th ed. New York: John
Wiley & Sons, 2004.
11. D. Jameson and L. Hurvich, Essay concerning color constancy,
Ann. Rev. Psychol., 40: 122, 1989.
12. M. Fairchild and G. Johnson, On the salience of novel stimuli:
Adaptation and image noise, IS&T 13th Color Imaging Con-
ference, 2005, pp. 333338.
13. A. Ha rd and L. Sivik, NCS-natural color system: A Swedish
standard for color notation, Color Res. Applicat., 6(3): 129138,
1981.
14. R. G. Kuehni, Color Space andIts Divisions: Color Order from
Antiquity to the Present. NewYork: Wiley-Interscience, 2003.
RAJEEV RAMANATH
Texas Instruments Incorporated
Plano, Texas
MARK S. DREW
Simon Fraser University
Vancouver, British Columbia,
Canada
COLOR PERCEPTION 9
C
CONTOUR TRACKING
Object tracking is afundamental area of researchthat nds
application in a wide range of problem domains including
object recognition, surveillance, and medical imaging. The
main goal of tracking an object is to generate a trajectory
from a sequence of video frames. In its simplest form, an
object trajectory is constituted from the spatial positions of
the object centroidandresides inathree-dimensional space
denedbythe imageandtime coordinates. Inthe case when
the changes in the object size and orientation are tracked
also, such as by a bounding box around the object, the
dimensionality of the trajectory is increased by two and
includes the scale and orientation, in addition to time and
image dimensions.
A trajectory in a higher dimensional space provides a
more descriptive representation of the object and its
motion. Depending on the application domain, an increase
in the trajectory dimensionality may be desirable. For
instance, in the context of motion-based object recognition,
a trajectory that encodes the changes in the object shape
over a time period increases the recognition accuracy. The
additional information encoded in the trajectory also
provides a means to identify the actions performed by
the objects, such as sign language recognition, where the
shape of the hands and their interactions dene the sign
language vocabulary.
The most informative trajectory is the one that encodes
the deformation in the object shape. This task requires
tracking the area occupied by the object from one video
frame to the next. A common approach in this regard is to
track the contour of an object, which is known also as the
contour evolution. The contour evolution process is
achieved by minimizing a cost function that is constituted
of competing forces trying to contract or expand the curve.
The equilibrium of the forces in the cost function concludes
the evolution process. These forces include regularization
terms, image-based terms, and other terms that attract the
contour to a desired conguration. The latter of these terms
traditionally encodes a priori shape congurations that
may be provided ahead of time.
REPRESENTING THE OBJECT AND ITS CONTOUR
The object contour is a directional curve placed on the
boundary of the object silhouette [see Fig. 1(c) and (d)] .
The contours are used either in a contour-based represen-
tation or in a boundary condition in a region-based repre-
sentation. The region-based representation uses the
distance transform, Poisson equation, of the medial axis.
The distance transform assigns each silhouette pixel with
its shortest distance fromthe object contour (1). Ina similar
vein, the Poisson equation assigns the mean of the
distances computed by randomwalks reaching the object
contour (2). The medial axis generates skeletal curve seg-
ments that lie on the locus of circles that are tangent to the
object contour at two or more points [see in Fig. 1(c)]. Use of
the contour as a boundary condition requires explicit detec-
tion of the object and prohibits dening a cost function that
evolves an initial contour to its nal conguration. Hence,
in the remainder of the text, we will discuss the contour-
based representation and related contour evolution
techniques.
A contour can be represented explicitly or implicitly.
Explicit representations dene the underlying curve
para-metrically and perform tracking by changing the
parameters that, in turn, evolve the contour. Parametric
representations require analytical expressions that provide
a means to compute the geometric features used during the
contour evolution. The most common parametric represen-
tation in the context of contour tracking uses a set of control
points positioned on the object boundary. The use of differ-
ent control points for different objects generates a unique
coordinate systemfor eachobject, which is referredto as the
Lagrangian coordinates [see Fig. 1(e)]. In the Lagrangian
coordinates, the relations betweenthe control points playan
important role for computing the geometric properties of the
underlying curve. These relations can be realized by either
the nite difference approximation or the nite element
analysis. The nite difference approximation treats each
control point individually and assumes that they are con-
nected by lines. On the contrary, the nite element analysis
denes the relations by a linear combination of a set of
functions referred to as the splines. The splines generate
continuous curves that have parametric forms. Their para-
metric nature permits the computation of the geometric
curve features analytically. The contour tracking in these
representations is achieved by moving the control points
from one place to another. For more information, we refer
the reader to the article on Snakes: Active Contours.
Contrary to the explicit representation, the implicit
contour representations for different objects lie in the
same Cartesian coordinates, namely the Eulerian coordi-
nates (grid). The contour in the Eulerian coordinates is
dened based on the values for the grid positions. For
instance, one common approach used in uid dynamics
research, which investigates the motion of a uid in an
environment, is to use volumetric representation. In volu-
metric representation, each grid is considered a unit
volume that is lled with water, such that inside the
contour (or surface in higher dimensions) the unit volumes
are lled, whereas for outside they are empty. In the eld of
computer vision, the most common implicit representation
is the level-set method. In the level-set method, the grid
positions are assigneda signedEuclideandistance fromthe
closest contour point. This method is similar to the distance
transformation discussed for representing regions, with
the difference of including a sign. The sign is used to label
inside and outside the contour, such that grid positions
inside the closed contour are positive, whereas the outside
grid positions are negative. The signed distances uniquely
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
locate the contour, suchthat it resides onthe zero-crossings
in the grid. The zero-crossings are referred to as the zero
level-set. The evolution of the contour is governed by chan-
ging the grid values based on the speed computed at each
grid position. For more information, we refer the reader to
the article on the Level-Set Methods.
THE STATE SPACE MODELS FOR CONTOUR TRACKING
The state space models dene the object contour by a set of
states, X
t
: t 1; 2. . .. Tracking then is achieved by updat-
ing the contour state in every frame:
X
t
f
t
X
t1
W
t
1
where W
t
is the white noise. This update eventually max-
imizes the posterior probability of the contour. The poster-
ior probability depends on the prior contour state and the
current likelihood, which is dened in terms of the image
measurements Z
t
. A common measurement used for con-
tour trackingis the distance of the contour fromthe edges in
the image.
The state space models-based contour tracking involve
two major steps. The rst step predicts the current location
of the contour, such as the new position of each control
point, and the second step corrects the estimated state,
according to the image observations. The state prediction
and correction is performed by using various statistical
tools. Among others, the Kalman ltering and the particle
ltering are the most common statistical tools. Computa-
tionally, the Kalman lter is more attractive because only
one instance of the object state is required to perform
prediction and correction. However, the Kalman lter
assumes that the object state is distributed by a Gaussian,
which may result in a poor estimation of the state variables
that are not Gaussian distributed. The particle ltering
overcomes this limitation by representing the distribution
of the object state by a set of samples, referred to as the
particles (3). Each particle has an associated weight that
denes the importance of that particle. Keeping a set of
samples for representing the current state requires main-
tainingandupdatingall the instances duringthe correction
step, which is a computationally complex task.
Tracking the object contour using the state space
methods involves careful selection of the state variables
that represent the object shape and motion. For this pur-
pose, Terzopoulos and Szeliski (4) use a spring model to
governthe contour motion. Inthis model, the state variables
include the stiffness of the spring placed at each control
point. Oncetheobject stateis estimated, acorrectionis made
byevaluating the gradient magnitude fromthe image. Isard
and Blake (5) model the shape andrigid motionby using two
state variables that correspondto the spline parameters and
the afne motion. The image measurements used to correct
the estimated state include the edges in the image observed
in the normal direction to the contour [see Fig. 2(a)]. This
approach has been extended recently to include nonrigid
contour deformations that are computed after the rigid
object state is recovered (6).
DIRECT MINIMIZATION OF THE COST FUNCTION
The methods falling under this category iteratively evolve
the contour by minimizing an associated cost function. The
cost function is constituted of the optical ow eld or the
appearance observed inside and outside the object and is
minimized by a greedy algorithm or a gradient descent
method.
The contour tracking based on the optical ow eld
exploits the constancy of the brightness of a pixel in time:
I
t1
x; y I
t
x u; y v 0 2
where I is the imaging function, t denotes the frame num-
ber, and (u,v) is the optical ow vector. The optical ow
during the contour evolution can be computed by searching
for similar color in a neighborhood of each pixel (7). Once
the ow vectors for all the object pixels are computed, the
cost of moving the contour can be evaluated by accumulat-
ingthe brightness similarities usingEquation(2). Tracking
Figure 1. Possible representations for the object shape given in (a):(b) object silhouette, (c) skeleton and (d) its contour. Representing the
contour by using (e) a set of control points in the Lagrangian coordinates and (f) level-sets in the Eulerian coordinates.
Figure 2. Edge observations along the contour normals.
(Reprinted with permission from the IEEE.)
2 CONTOUR TRACKING
results of this approach are shown in Fig. 3(a). An alter-
native approach to computing the optical ow is to adopt a
morphing equation that morphs the intensities in the pre-
vious frame to the intensities in the current frame (8). The
intensity morphing equation, however, needs to be coupled
with a contour tracking function, such that the intensities
are morphed for the contour pixels in the previous and the
current frame. The speedof the contour is computedaccord-
ing to the difference between the intensities of the
corresponding pixels. For instance, if the difference is
high, then the contour moves with the maximum speed
in its normal direction and while the morphing function is
evaluated by considering the new position of the contour.
The tracking results using this approach are shown in
Fig. 3(b). The cost function based on the optical ow also
can be written in terms of the common motion constraint
(10). The common motion constraint assumes that the
motion inside the contour is homogenous, such that the
contour is evolved to a new position if the difference
between neighboring motion vectors is high.
In contrast to the cost functions using brightness
constancy, the statistics computed inside and outside the
object contour impose a less strict constraint. An important
requirement of statistics-based methods is the initialization
of the contour in the rst frame to generate the appearance
statistics. Region statistics can be computed by piecewise
stationary color models generated from the subregions
around each control point (11). This model can be extended
to include the texture statistics generated from a band
around the contour (9). Using a band around the contour
combines image gradient-based and region statistics-based
contour tracking methods into a single framework, such
that when the width of the band is set to one, the cost
function is evaluated by image gradients. The contour
tracking results using region statistics is shown in Fig. 3(c).
THE SHAPE PRIORS
Including a shape model in the contour cost function
improves the estimated object shape. A common approach
to generate a shape model of a moving object is to estimate
the shape distribution associated with the contour defor-
mations from a set of contours extracted online or off line.
The shape distribution can be in the form of a Gaussian
distribution, a set of eigenvectors, or a kernel density
estimate. The cost function associated with these distribu-
tions contains contour probabilities conditioned on the
estimated shape distribution.
For the explicit contour representations, the shape
model is generated using the spatial-position statistics of
the control points. Asimple shape prior in this context is to
use a Gaussian distribution (10):
px
i
1
s
2p
p exp
x
i
m
x
i
2
2s
2
x
i
y
i
m
y
i
2
2s
2
y
i
3
where m denotes the mean, s denotes the standard devia-
tion, and x
i
x
i
; y
i
is the position of the ith control point.
Before modeling, this approach requires registration of the
contours to eliminate the translational effects. Registration
canbe performedbymeannormalizationof all the contours.
An alternative shape model can be computed by apply-
ing the principal component analysis (PCA) to the vectors of
the control points. The PCA generates a new coordinate
system that emphasizes the differences between the con-
tours, such that selecting a subset of principal components
(eigenvectors with the highest eigenvalues) models the
underlying contour distribution. Given an input contour,
the distance is computed by rst reconstructing the input
using a linear combination of the selected principal com-
ponents and then evaluating the Euclidean distance
Figure 3. Tracking results of the methods proposed in (a) Ref. 7, (b) Ref. 8, and (c) Ref. 9. (Reprinted with permission from the IEEE.)
CONTOUR TRACKING 3
between the input vector and the reconstructed contour.
The weights in the linear combination are computed by
projecting the input contour to the principal components.
The shape priors generated for implicit contour
representations do not model explicitly the contour shape.
This property provides the exibility to model the objects
with two or more split regions. Considering the level-set
representation, which denes the contour by zero crossings
on the level-set grid, a shape model can be generated by
modelingdistancevalues ineachgridpositionbyaGaussian
distribution (9). This modeling two level-set functions for
eachset of contours, as showninFig. 4(ag), that correspond
tothe meanandthe standarddeviationof the distances from
the object boundary.
DISCUSSION
Compared with tracking the centroid, a bounding box, or a
bounding ellipse, contour tracking provides more detailed
object shape and motion that is required in certain applica-
tion domains. For instance, the contour trackers commonly
are used in medical imaging, where a more detailed ana-
lysis of the motion of an organ, such as the heart, is
required. This property, however, comes at the expense
of computational cost, which is evident from the iterative
updates performed on all the grid positions or the control
points, depending on the contour representation chosen. In
cases when the domain of tracking does not tolerate high
computational costs, such as real-time surveillance, the
contour trackers may be less attractive. This statement,
however, will change in coming years, considering the
ongoing research on developing evolution strategies that
will have real-time performance (12).
The designof acontour tracker requires the selectionof a
contour representation. Depending on the application
domain, both the implicit and explicit representations
have advantages and disadvantages. For instance,
although the implicit representations, such as the level-
set method, inherently canhandle breaking and merging of
the contours, the explicit representations require including
complex mechanisms to handle topology changes. In addi-
tion, the implicit representations naturallyextendtracking
two-dimensional contours to three or more-dimensional
surfaces. The implicit representation, however, requires
re-initializationof the gridat eachiteration, whichmakes it
a computationally demanding procedure compared with an
explicit representation.
The choice of the cost function is another important step
in the design of a contour tracker and is independent of the
contour representation. The cost functions traditionally
include terms related to the contour smoothness, image
observations, and additional constraints. Among these
three terms, recent research concentrates on developing
cost functions that effectively use the image observations
while adding additional constraints such as shape priors.
Especially, the researchonthe use of innovative constraints
to guide the evolution of the contour is not concluded. One
such constraint is the use of shape priors, which becomes
eminent inthe case of anocclusionduring whichparts of the
tracked object are not observed. Improved tracking during
an occlusion is shown in Fig. 4(h) where using the shape
priors successfully resolves the occlusion.
As with other object tracking approaches, in a contour
tracking framework, the start or end of an object trajectory
plays a critical role in its application to real-world
problems. The starting of a contour trajectory requires
segmentation of the object when it rst is observed. The
segmentationcanbe performed by using a contour segmen-
tationframework, as discussed inthe chapter onLevel-Set
Methods, or by using the background subtraction method,
which labels the pixels as foreground or background
depending on their similarity to learned background
models. Most segmentation approaches, however, do not
guarantee an accurate object shape and, hence, may result
in poor tracking performance.
BIBLIOGRAPHY
1. A. Rosenfeld and J. Pfaltz, Distance functions in digital pic-
tures, in Pattern Recognition, vol. l. 1968, pp. 3361.
2. L. Gorelick, M. Galun, W. Sharon, R. Basri, and A. Brandt,
Shape representation and classication using the poisson
equation, IEEE Conf. an Computer Vision and Pattern Recog-
nition, 2004.
3. H. Tanizaki, Non-gaussian state-space modeling of nonsta-
tionary time series, J. American Statistical Association, 82:
10321063, 1987.
Figure 4. (ae) Asequence of level-sets generated fromwalking action. (f) Mean level-set and (g) standard deviation level-set. (h) Tracking
results for occluded person using the shape model given in (f) and (g). (Reprinted with permissions from the IEEE.)
4 CONTOUR TRACKING
4. D. Terzopoulos and R. Szeliski, Tracking with kalman snakes,
in A. Blake and A. Yuille (eds.) Active Vision. MIT Press, 1992.
5. M. Isard and A. Blake, Condensationconditional density
propagation for visual tracking, Int. Jrn. on Computer Vision,
29(1): 528, 1998.
6. J. Shao, F. Porikli, and R. Chellappa, A particle lter based
non-rigid contour tracking algorithm with regulation, Int.
Conf. on Image Processing, 2006, pp. 3441.
7. A. Mansouri, Region tracking via level set pdes without motion
computation, IEEE Trans. on Pattern Analysis and Machine
Intelligence, 24(7): pp. 947961, 2002.
8. M. Bertalmio, G. Sapiro, and G. Randall, Morphing active
contours, IEEETrans. on Pattern Analysis and Machine Intel-
ligence, 22(7): pp. 733737, 2000.
9. A. Yilmaz, X. Li, and M. Shah, Contour based object tracking
withocclusionhandlinginvideoacquiredusingmobilecameras,
IEEE Trans. on Pattern Analysis and Machine Intelligence,
26(11): pp. 15311536, 2004.
10. D. Cremers and C. Schnorr, Statistical shape knowledge in
variational motion segmentation, Elsevier Jrn. on Image and
Vision Computing, 21: pp. 7786, 2003.
11. R. Ronfard, Region based strategies for active contour models.
Int. Jrn. on Computer Vision, 13(2): pp. 229251, 1994.
12. Y. Shi and W. Karl, Real-time tracking using level sets, IEEE
Conf. on Computer Vision and Pattern Recognition, 2005, pp.
3441.
ALPER YILMAZ
The Ohio State University
Columbus, Ohio
CONTOUR TRACKING 5
E
EDGE DETECTION IN GRAYSCALE, COLOR,
AND RANGE IMAGES
INTRODUCTION
In digital images, edge is one of the most essential and
primitive features for human and machine perception; it
provides an indication of the physical extent of objects in
an image. Edges are dened as signicant local changes
(or discontinuities) in brightness or color in an image.
Edges often occur at the boundaries between two differ-
ent regions. Edge detection plays an important role in
compute vision and image processing. It is used widely as
a fundamental preprocessing step for many computer
vision applications, including robotic vision, remote sen-
sing, ngerprint analysis, industrial inspection, motion
detection, and image compression (1,2). The success of
high-level computer visionprocesses heavily relies onthe
good output from the low-level processes such as edge
detection. Because edge images are binary, edge pixels
are marked with value equal to 1, whereas others are
0; edge detection sometimes is viewed as an informa-
tion reduction process that provides boundary informa-
tion of regions by ltering out unnecessary information
for the next steps of processes in a computer vision
system (3).
Many edge detection algorithms have been proposed in
the last 50 years. This article presents the important edge
detectiontechniques for grayscale, color, andrange images.
EDGE AND EDGE TYPES
Several denitions of edge exist in computer vision lit-
erature. The simplest denition is that an edge is a sharp
discontinuity in a gray-level image (4). Rosenfeld and
Kak (5) dened an edge as an abrupt change in gray level
or in texture where one region ends and another begins.
An edge point is a pixel in an image where signicant
local intensity change takes place. Anedge fragment is an
edge point and its orientation. An edge detector is an
algorithmthat produces a set of edges froman image. The
term edge is used for either edge points or edge frag-
ments (68).
Edge types canbe classied as stepedge, line edge, ramp
edge, and roof edge (7 9 ). Step edge is an ideal type that
occurs whenimage intensity changes signicantly fromone
value on one side to a different value on the other. Line
edges occur at the places wherethe imageintensitychanges
abruptly to adifferent value, stays at that value for the next
small number of pixels, and then returns to the original
value. However, in real-world images, step edge and line
edge are rare because of various lighting conditions and
noise introduced by image-sensing devices. The step edges
often become ramp edges, and the line edges become the
roof edges in the real-world image (7,8). Figure 1(a)(d)
show one-dimensional proles of step, line, ramp, and roof
edge, respectively.
EDGE DETECTION METHODS IN GRAY-LEVEL IMAGES
Because edges are, based on the denition, image pixels
that have abrupt changes (or discontinuities) in image
intensity, the derivatives of the image intensity function
(10) canbeusedto measure theseabrupt changes. As shown
in Fig. 2(a) and (b), the rst derivative of the image inten-
sity function has a local peak near the edge points. There-
fore, edges can detect by thresholding the rst derivative
values of an image function or by the zero-crossings in the
second derivative of the image intensity as shown in
Fig. 2(c).
Edge detection schemes based on the derivatives of
the image intensity function are very popular, and they
can be categorized into two groups: gradient-based and
zero-crossing-based (or called Laplacian) methods. The
gradient-based methods nd the edges by searching for
the maxima (maximum or minimum) in the rst
derivatives of the image function. The zero-crossing
(Laplacian) methods detect the edges by searching for
the zero crossings in the second derivatives of the image
function.
Gradient-Based Edge Detection
An edge is associated with a local peak in the rst deriva-
tive. One way to detect edges in an image is to compute the
gradient of local intensity at eachpoint inthe image. For an
image, f(x, y) with x and y, the row and the column coordi-
nates, respectively, its two-dimensional gradient is dened
as a vector with two elements:
G =
G
x
G
y
_ _
=
@f (x; y)
@x
@f (x; y)
@y
_
_
_
_
=
[ f (x dx; y) f (x; y)[=dx
[ f (x; y dy) f (x; y)[=dy
_ _
(1)
where G
x
and G
y
measure the change of pixel values inthe
x- and y-directions, respectively. For digital images, dx
and dy can be considered in terms of number of pixels
between two points, and the derivatives are approxi-
mated by differences between neighboring pixels. Two
simple approximation schemes of the gradient for
dx = dy = 1 are
G
x
~ f (x 1; y) f (x; y) ; G
y
~ f (x; y 1) f (x; y)
G
x
~ f (x1; y) f (x1; y); G
y
~ f (x; y1) f (x; y 1)
(2)
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
These derivative approximation schemes are called the
rst difference and the central difference, respectively.
The convolution masks for the rst difference and the
central difference can be represented as follows, respec-
tively (11):
G
x
=
1 0
1 0
_ _
; G
y
=
_
1 1
0 0
_
;
G
x
=
0 1 0
0 0 0
0 1 0
_
_
_
_; G
y
=
0 0 0
1 0 1
0 0 0
_
_
_
_
The rst difference masks cause the edge location bias
because the zero crossings of its vertical and horizontal
masks lie at different positions. On the other hand, the
central difference masks avoid this position mismatch
problem because of the common center of horizontal
and vertical masks (11). Many edge detectors have
been designed using convolution masks using 3 3
mask sizes or even larger.
Two important quantities of the gradient are the mag-
nitude and the direction of the gradient. The magnitude of
the gradient G
m
is calculated by
G
m
=
G
2
x
G
2
y
_
(3)
To avoid the square root computation, the gradient mag-
nitude is often approximated by
G
m
~[G
x
[ [G
y
[ or (4)
G
m
~max([G
x
[; [G
y
[) (5)
The direction of the gradient is given by
u
g
=
tan
1
G
y
G
x
_ _
if G
x
,=0
0
0
if G
x
= 0 G
y
= 0
90
0
if G
x
= 0 G
y
,=0
_
_
(6)
where the angle u
g
is measured with respect to the x-axis.
In general, a gradient-based edge detection procedure
consists of the following three steps:
1. Smoothing ltering: Smoothing is used as a prepro-
cessing step to reduce the noise or other uctua-
tions in the image. Gaussian ltering (1013) is a
well-known low-pass lter, and s parameter con-
trols the strength of the smoothing. However,
smoothing ltering can also blur sharp edges,
which may contain important features of objects
in the image.
2. Differentiation: Local gradient calculation is imple-
mented by convolving the image with two masks, G
x
and G
y
, which are dened by a given edge detector.
Let us denote the convolution of a mask M
kxk
and an
(a) step edge (b) line edge
(c) ramp edge (d) roof edge
Figure 1. One-dimensional proles of four different edge types.
Figure 2. Edge detection by the deriva-
tive operators: (a) 1-D prole of a smoothed
step edge, (b) the rst derivative of a step
edge, and (c) the second derivative of a step
edge.
150
100
50
0
50
0
50
20
0
20
40
f
"
(
x
)
f
'
(
x
)
f
(
x
)
Threshold
Threshold
Zero Crossing Zero Crossing
x0 x1
x2 x3
(a)
(b)
(c)
2 EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES
image f
mxn
as N
mxn
. For every pixel (i, j) inthe image f,
we calculate:
N(i; j) = M f (i; j)
=
k
2
k2=
k
2
k
2
k1=
k
2
M(k1; k2) f (i k1; j k2 (7)
for 1 _ i _ m; 1 _ j _ n
3. Detection: Detecting edge points based on the local
gradients. The most common approach is to apply a
threshold to image gradients. If the magnitude of the
gradient of an image pixel is above a threshold, the
pixel is marked as an edge. Some techniques such as
the hysteresis method by Canny (3) use multiple
thresholds to nd edges. The thinning algorithm is
also applied to remove unnecessary edge points after
thresholding as necessary (14,15).
Roberts Cross Edge Detector. The Roberts cross edge
detector is the simplest edge detector. It rotated the rst
difference masks by an angle of 458. Its mathematic expres-
sions are (10,12)
G
x
= f (x; y 1) f (x 1; y);
G
y
= f (x; y) f (x 1; y 1) (8)
G
x
and G
y
can be represented in the following convolution
masks:
G
x
=
0 1
1 0
_ _
; G
y
=
1 0
0 1
_ _
These two convolution masks respond maximally to edges
in diagonal directions (458) in the pixel grid. Each mask
is simply rotated 908 from the other. The magnitude of
the gradient is calculated by Equation (3). To avoid the
square root computation, the computationally simple form
of the Robert cross edge detector is the Roberts absolute
value estimation of the gradient given by Equation (4).
The main advantage of using Roberts cross edge opera-
tor is its fast computational speed. With a 2 2 convolu-
tion mask, only four pixels are involved for the gradient
computation. But the Roberts cross edge operator has
several undesirable characteristics. Because it is based
on the rst difference, its 2 2 diagonal masks lie off grid
and cause edge location bias. The other weak points
are that it is sensitive to noise and that its response to
the gradual( or blurry) edge is weak. Figure 3(b)(g) show
the Roberts cross edge maps with various threshold
values. The experiment shows that the Roberts cross
edge detector is sensitive to the noise with low threshold
values. As the threshold value increases, noise pixels are
removed, but also real edge pixels with weak response are
removed (11).
Prewitt Edge Operator. The Prewitt edge operator uses
the central difference to approximate differentiation. Two
Prewitt convolution masks at x- and y-directions are shown
below:
G
x
=
1
3
1 0 1
1 0 1
1 0 1
_
_
_
_
; G
y
=
1
3
1 1 1
0 0 0
1 1 1
_
_
_
_
Because it has the common center of G
x
and G
y
masks, the
Prewitt operator has less edge-location bias compared with
the rst difference-based edge operators. It also accom-
plishes noise reduction in the orthogonal direction by
means of local averaging (11).
The local gradient at pixel (x, y) can be estimated by
convolving the image with two Prewitt convolution
masks, G
x
and G
y,
respectively. Mathematically we
have
G
x
=
1
3
([ f (x 1; y 1) f (x; y 1) f (x 1; y 1)[
[ f (x 1; y 1) f (x; y 1) f (x 1; y 1)[)
G
y
=
1
3
([ f (x 1; y 1) f (x 1; y) f (x 1; y 1)[
[ f (x 1; y 1) f (x 1; y) f (x 1; y 1)[)
(9)
Figure 3. Roberts cross edge maps by
using various threshold values: as thresh-
old value increases, noise pixels are
removed, but also real edge pixels with
weak responses are removed.
EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES 3
The Prewitt operators can be extended to detect edges
tilting at 458 and 1358 by using the following two
masks.
1 1 0
1 0 1
0 1 1
_
_
_
_
;
0 1 1
1 0 1
1 1 0
_
_
_
_
Figure 4(b) shows the vertical edge map generated by
the Prewitt G
x
mask, and the horizontal edge map gener-
ated by the Prewitt G
y
mask is shown in Fig. 4(c). Combin-
ing these two horizontal and vertical edge maps, the
complete edge map is generated as shown in Fig. 4(d).
Sobel Edge Detector. The Sobel gradient edge detector is
very similar to the Prewitt edge detector except it puts
emphasis on pixels close to the center of the masks. The
Sobel edge detector (10,12) is one of the most widely used
classic edge methods. Its x- and y-directional 3 3 con-
volution masks are as follows:
G
x
=
1
4
1 0 1
2 0 2
1 0 1
_
_
_
_
; G
y
=
1
4
1 2 1
0 0 0
1 2 1
_
_
_
_
The local gradient at pixel (x,y) can be estimated by
convolving the image with two Sobel convolution masks,
G
x
and G
y,
respectively,
G
x
=
1
4
([ f (x 1; y 1) 2f (x; y 1) f (x 1; y 1)[
[ f (x 1; y 1)2 f (x; y 1) f (x 1; y 1)[)
G
y
=
1
4
([ f (x 1; y 1)2f (x 1; y) f (x 1; y 1)[
[ f (x 1; y 1)2 f (x 1; y) f (x 1; y 1)[)
(10)
The Sobel operator puts more weights onthe pixels near the
center pixel of theconvolutionmask. Bothmasks areapplied
to every pixel inthe image for the calculationof the gradient
at the x- and y-directions. The gradient magnitude is calcu-
lated by Equation (3). The Sobel edge detectors can be
extended to detect edges at 458 and 1358 by using the two
masks below:
2 1 0
1 0 1
0 1 2
_
_
_
_
;
0 1 2
1 0 1
2 1 0
_
_
_
_
Figure5(j)(l) showthe edge detection results generated
bythe Sobel edge detector withthe thresholdvalue, T = 20.
Figure 5 also shows the performance analysis of each
gradient-based edge detector in the presence of noises.
To evaluate the noise sensitivity of each edge detector,
5% and 10% Gaussian noise are added into the original
image as shown in Fig. 5(b) and (c), respectively. For fair
comparisons, a xed threshold value is used (T = 20) for all
edge detectors. Figure 5(e) and (f) show that the Roberts
cross edge detector is sensitive to noises. Onthe other hand,
the Sobel and the Prewitt edge detectors are less sensitive
to noises. The Sobel operators provide bothdifferencingand
smoothing effects at the same time. Because of these char-
acteristics, the Sobel operators are widely used for edge
extraction. The smoothing effect is achieved through the
involvement of 3 3 neighbors to make the operator less
sensitive to noises. The differencing effect is achieved
through the higher gradient magnitude by involving
more pixels in convolution in comparison with the Roberts
operator. The Prewitt operator is similar to the Sobel edge
detector. But the Prewitt operators response to the diag-
onal edge is weak compared with the response of the Sobel
edge operator. Prewitt edge operators generate slightly less
edge points than do Sobel edge operators.
Non-Maximum Suppressiona Postprocessing After Gra-
dient Operation. One difculty in gradient edge detectors
(and also in many other edge detectors) is how to select
the best threshold (16) used to obtain the edge points. If
the threshold is too high, real edges in the image can be
missed. If the threshold is too low, nonedge points such as
noise are detected as edges. The selection of the threshold
critically affects the edge output of an edge operator.
Another problem related with the gradient edge detectors
is that edge outputs from the gradient-based method
appear to be several pixels wide rather than a single pixel
(see Figs. 35). This problem is because most edges in the
real-world images are not step edges and the grayscales
around the edges change gradually from low to high
intensity, or vice versa. So a thinning process such as
nonmaximum suppression (14,15,1719) may be needed
after the edge detection.
The method of nonmaximum suppression is used to
remove weak edges by suppressing the pixels with non-
maximum magnitudes in each cross section of the edge
direction (15). Here, we introduce the algorithm proposed
by Rosenfeld and Thursten(14,15,1719). Let u(i) denote
the edge direction at pixel i, and let G
m
(i) denote the
gradient magnitude at i.
Figure 4. Prewitt edge maps: (a) original
image, (b) vertical edge map generated by
G
x
, (c) horizontal edge map generated by
G
y
, and (d) complete edge map, T = 15.
4 EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES
Step 0: For every edge point, p(x,y), do the following
steps.
Step 1: Find two nearest neighboring pixels q
1
and q
2
that are perpendicular to the edge direction
associated with the edge pixel p.
Step 2: If ([u( p) u(q
1
)[ _ a and [u( p) u(q
2
)[ _ a),
then go to Step 3 else return to Step 0.
Step 3: If (G
m
( p) _ G
m
(q
1
) or G
m
( p) _ G
m
(q
2
)), then
suppress the edge at pixel p.
Figure 6(a) shows how to choose q
1
and q
2
when edge
directionat pixel p is vertical (top-to-down), whichis shown
in an arrow. Four edge directions are often used in the
nonmaximum suppression method: 08, 458, 908, and 1358,
respectively, and all edge orientations are quantized to
these four directions. Figure 5(d) and (g) shows the results
after the nonmaximum suppression is applied to the edge
images in Fig. 5(c) and (f), respectively. Figure 5(d) gener-
ated71.6%less edge pixels comparedwiththe edge pixels in
Fig. 5(c). Figure 5(g) has 63.6% less edge pixels compared
with the edge pixels in Fig. 5(f). In our experiments, more
than 50% of the edge points that were generated by the
gradient-based edge operator are considered as nonlocal
maximal and suppressed.
Second-Derivative Operators
The gradient-based edge operators discussed above produce
a large response across an area where an edge presents. We
use anexample to illustrate this problem. Figure 2(a) shows
across cut of the smoothstepedge, andits rst derivative (or
gradient) is showninFig. 2(b). After athresholdis appliedto
the magnitude of the gradient and to all pixels above the
Figure 5. Performance evaluationinthe presence of noises for gradient-basededgeoperators: (a) original image, (b) add5%Gaussiannoise,
(c) add 10% Gaussian noise, (d)(f) Roberts edge maps, (g)(i) Prewitt edge maps, and (j)(l) Sobel edge maps. Used the same threshold
(T = 20) for fair comparisons.
EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES 5
threshold, for example, the pixels in Fig. 2(b) between
x0 ~ x1 and x2 ~ x3 are considered as edge points. As a
result, too many edge points occur, which causes an edge
localization problem: Where is the true location of the edge?
Therefore, a good edge detector should not generate
multiple responses to a single edge. To solve this problem,
the second-derivative-based edge detectors have been
developed.
Laplacian Edge Detector. The Laplacian of a function
f(x, y) is given by Equation (11)
\
2
f (x; y) =
@
2
f (x; y)
@x
2
@
2
f (x; y)
@y
2
(11)
Because the Laplacian edge detector denes an edge as the
zero crossing of the second derivative, a single response is
generated to a single edge location as observed in Fig. 2(c).
The discrete Laplacian convolution mask is constructed as
follows:
For adigital image, G
x
is approximatedby the difference
@ f (x; y)
@x
= G
x
~ f (x 1; y) f (x; y); so
@
2
f (x; y)
@x
2
=
@
@x
@f (x; y)
@x
_ _
=
@G
x
@x
~
a( f (x1; y)f (x; y))
@x
=
af (x 1; y)
@x
@f (x; y)
@x
= [ f (x 2; y) f (x1; y)[[ f (x 1; y) f (x; y)[
= f (x 2; y) 2 f (x 1; y) f (x; y)
(12)
This approximation is centered at the pixel (x 1, y). By
replacing x 1 with x, we have
@
2
f (x; y)
@x
2
= f (x 1; y) 2 f (x; y) f (x 1; y) =
1
2
1
_
_
_
_
Similarly,
@
2
f (x; y)
@y
2
= f (x; y 1) 2 f (x; y) f (x; y 1)
= [ 1 2 1[ (13)
By combining the x and y second partial derivatives, the
Laplacian convolution mask can be approximated as fol-
lows:
\
2
f (x; y) =
1
2
1
_
_
_
_
[ 1 2 1[ =
0 1 0
1 4 1
0 1 0
_
_
_
_
Other Laplacian convolution masks are constructed
similarly by using the different derivative estimation
and different mask size (11). Two other 3 3 Laplacian
masks are
1 1 1
1 8 1
1 1 1
_
_
_
_
or
1 2 1
2 4 2
1 2 1
_
_
_
_
Figure 6. Experiments withnonmaximumsuppression: (a) anexampleof howto select q
1
, andq
2
whenedgedirectionis top-to-down, (b) and
(e) original input images, (c) and(f) Sobel edge maps (T = 20) before nonmaximumsuppression, and(d) and(g) edge maps after nonmaximum
suppression is applied to (c) and (f), respectively.
6 EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES
After the convolutionof animage witha Laplacianmask,
edges are found at the places where convolved image values
change sign frompositive to negative (or vice versa) passing
through zero. The Laplacian edge detector is omnidirec-
tional (isotropic), and it highlights edges in all directions.
Another property of the Laplacian edge detector is that it
generates closed curves of the edge contours. But the Lapla-
cian edge detector alone is seldom used in real-world com-
puter vision applications because it is too sensitive to image
noise. As shown in Fig. 7(c), the very small local peak in the
rst derivative has its second derivative cross through zero.
The Laplacian edge detector may generate spurious edges
because of image noise. To avoid the effect of noise, a
Gaussian ltering is often applied to an image before the
Laplacian operation.
Marr HildrethLaplacian of Gaussian. Marr and Hil-
dreth (13) combined the Gaussian noise lter with the
Laplacian into one edge detector called the Laplacian of
Gaussian (LoG). They provided the following strong argu-
ment for the LoG:
1. Edge features in natural images can occur at various
scales and different sharpness. The LoGoperator can
be applied to detecting multiscales of edges.
2. Some form of smoothing is essential for removing
noise in many real-world images. The LoG is based
on the ltering of the image with a Gaussian smooth-
ing lter. The Gaussian ltering reduces the noise
sensitivity problem when zero crossing is used for
edge detection.
3. A zero crossing in the LoG is isotropic and corre-
sponds to an extreme value in the rst derivative.
Theoretically the convolution of the input image f(x,y)
with a two-dimensional (2-D) Gaussian lter G(x,y) can be
expressed as
S(x; y) = G(x; y) f (x; y); where G(x; y)
=
1
2p
_
s
e
(x
2
y
2
)
s
2
(14)
Then, the Laplacian (the second derivative) of the con-
volved image S is obtained by
\
2
S(x; y) = \
2
[G(x; y) f (x; y)[
= [\
2
G(x; y)[ f (x; y) (15)
The computation order of Laplacian and convolution
operations can be interchanged because these two opera-
tions are linear and shift invariant (11) as shown in Equa-
tion (15). Computing the Laplacian of the Gaussian ltered
image \
2
[G(x; y) f (x; y)[ yields the same result with con-
volving the image with the Laplacian of the Gaussian lter
([\
2
G(x; y)[ f (x; y)). The Laplacian of the Gaussian lter
\
2
G(x; y) is dened as follows:
LoG(x; y) = \
2
G(x; y) =
1
ps
4
1
(x
2
y
2
)
2s
2
_ _
e
x
2
y
2
2s
2
(16)
where s is the standard deviation, which also determines
the spatial scale of the Gaussian. Figure 8(a) shows a one-
dimensional (1-D) LoG prole and (b) shows a 2-D LoG
prole. A 2-D LoG operation can be approximated by a
convolution kernel. The size of the kernel is determined by
the scale parameter s. Adiscrete kernel that approximates
the LoG with s = 1.4 is shown in Fig. 9. In summary, edge
detection using the LoG consists of the following two steps:
1. Convolve the image witha2-DLoGmaskat aselected
scale s.
2. Consider as edges only those zero-crossing pixels
whose corresponding rst derivatives are above a
threshold. To nd a zero crossing, we need to check
four cases for the signs of the two opposing neighbor-
ing pixels: up/down, right/left, and the two diagonals.
Note that results of edge detection depend largely on
the s value of the LoG. The convolution mask is larger for
a larger s, which is suitable for detecting large-scale
edges. Figure 10 shows edge maps generated by the
Zero crossing
x0
x1
x2
x3 x4
150
100
50
0
30
20
10
0
10
20
40
20
0
20
40
f
(
x
)
f
'
(
x
)
f
"
(
x
)
(a)
(b)
(c)
Figure 7. Illustration of spurious edges
generated by zero crossing: (a) 1-D prole
of a step edge with noise, (b) the rst deri-
vative of a step edge, and (c) the second
derivative of a step edge. The zero crossing
of f(x) creates several spurious edges points
(x
0
, x
1
, x
2
, and x
3
).
EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES 7
LoG operator with various scales. Figure 10(b)(e) show
the LoG edge maps with s = 1.0, 1,5, 2.0, and 2.5, respec-
tively. In Fig. 10(f), two different scales s
1
= 0.7, s
2
= 2.3
are used, and the result is obtained by selecting the edge
points that occurred in both scales. Figure 10(g) and (h)
use the gradient magnitude threshold to reduce noise and
break contours.
In comparison with the edge images based on the
gradient methods in Figs. 35, the edge maps from the
LoG is thinner than the gradient-based edge detectors.
Because of the smoothing ltering in the LoGoperator, its
edge images (Fig. 10) are robust to noise; however, sharp
corners are lost at the same time. Another interesting
feature of the LoG is that it generates the closed-edge
contours. However, spurious edge loops may appear in
outputs and edge locations may shift at large scales (8).
Both the LoG and the Laplacian edge detectors are iso-
tropic lters, and it is not possible to extract directly the
edge orientation information from these edge operators.
Postprocesses such as nonmaxima suppression and hys-
teresis thresholding are not applicable.
Advanced Edge Detection MethodCanny Edge Detection
The Canny edge detector (3,16,20) is one of the most widely
used edge detectors. In 1986, John Canny proposed the
following three design criteria for an edge detector:
1. Good localization: Edge location founded by the edge
operator should be as close to the actual location in
the image as possible.
2. Good detection with low error rate: An optimal edge
detector should respond only to true edges. In other
words, no spurious edges should be found, and true
edges should not be missed.
3. Single response to a single edge: The optimal edge
detector should not respond withmultiple edge pixels
to the place where only a single edge exists.
Following these design criteria, Canny developed an
optimal edge detection algorithm based on a step edge
model with white Gaussian noise. The Canny edge detector
involves the rst derivative of a Gaussian smoothing lter
with standard deviation s. The choice of s for the Gaussian
lter controls the lter width and the amount of smoothing
(11). Steps for the Canny edge detector are described as
follows:
1. Smoothing using Gaussian ltering: A 2-D Gaussian
lter G(x,y) byEquation(14) is appliedto the image to
remove noise. The standard deviation s of this Gaus-
sian lter is a scale parameter in the edge detector.
2. Differentiation: Compute the gradient G
x
in the
x-direction and G
y
in the y-direction using any of
the gradient operators (Roberts, Sobel, Prewitt,
etc.). The magnitude G
m
and direction u
g
of the gra-
dient can be calculated as
G
m
=
G
2
x
G
2
y
_
; u
g
= tan
1
G
y
G
x
_ _
(16)
3. Nonmaximum suppression: Apply the nonmaximum
suppression operation (see the section on Nonmax-
imum Suppression for details) to remove spurious
edges.
4. Hysteresis process: The hysteresis step is the unique
feature of the Canny edge operator (20). The hyster-
esis process uses two thresholds, a lowthreshold t
low
,
and a high threshold t
high
. The high threshold is
usually two or three times larger than the lowthresh-
old. If the magnitude of the edge value at the pixel p is
lower than t
1ow
, then the pixel p is marked immedi-
ately as a non-edge point. If the magnitude of the edge
value at pixel p is greater than t
high
, then it is imme-
diately marked as an edge. Then any pixel that
is connected to the marked edge pixel p and its
Figure 8. (a) a 1-D LoG prole and (b)
a 2-D LoG prole.
4 2 2 4
0.5
1
1.5
2
x 10
3
1
0
1
2
3
4
4
4
2
2
0 0
2
2
X
y
(a) (b)
0 1 1 2 2 2 1 1 0
1 2 4 5 5 5 4 2 1
1 4 5 3 0 3 5 4 1
2 5 3 -12 -24 -12 3 5 2
2 5 0 -24 -40 -24 0 5 2
2 5 3 -12 -24 -12 3 5 2
1 4 5 3 0 3 5 4 1
1 2 4 5 5 5 4 2 1
0 1 1 2 2 2 1 1 0
Figure 9. A discrete kernel that approximates the LoG with
s = 1.4.
8 EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES
magnitude of the edge value is greater than a low
threshold t
low
, it is also marked as an edge pixel. The
edge marking of the hysteresis process is implemen-
ted recursively. This hysteresis producer generates
the effects of reducingthe brokenedgecontours andof
removing the noise in adaptive manner (11).
The performance of the Canny operator is determined
by three parameters: the standard deviation s of the
Gaussian lter and the two thresholds t
low
and t
high
,
which are used in the hysteresis process. The noise elim-
ination and localization error are related to the standard
deviation s of the Gaussian lter. If s is larger, the noise
eliminationis increased but the localizationerror canalso
be more serious. Figures 11 and 12 illustrate the results of
the Canny edge operator. Figure 11 demonstrate the
effect of various s values with the same upper and lower
threshold values. As the s value increases, noise pixels
are removed but the sharp corners are lost at the same
time. The effect of the hysteresis threshold is shown in
Fig. 12. Figure 12(a) and (c) are edge maps with the
hysteresis threshold. Edge maps witha hysteresis thresh-
old have less broken contours than edge maps with one
threshold: Compare Fig. 12(a) with Fig. 12(b). Table 1
summarized the computational cost of each edge opera-
tors used in gray images. It is noticed that the computa-
tional cost of the Canny operator is higher than those of
other operators, but the Canny operator generates the
more detailed edge map incomparisonwiththe edge maps
generated by other edge operators.
EDGE DETECTION IN COLOR IMAGES
What is the Color Image?
An image pixel in the color image is represented by a vector
that is consisted of three components. Several different
ways exist to represent a color image, such as RGB, YIQ,
Figure 10. The LoG operator edge maps:
(b) s = 1.0, (c) s = 1.5, (d) s = 2.0, (e)
s = 2.5, (f) two scales used s
1
= 0.7 and
s
2
= 2.3, (g) s = 2.0 and T = 15, and (h)
s = 2.0 and T = 20.
Figure 11. The effect of the standard
deviationsof the Gaussiansmoothinglter
in the Canny operator: (b) s = 0.5, (c)
s = 1.0, (d) s = 1.5, (e) s = 2.0, (f)
s = 2.5, and (g) s = 3.0, T
high
= 100,
T
low
= 40.
EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES 9
HIS, and CIE L
+
u
+
v
+
. The RGB, the YIQ, and the HSI are
the color models most often used for image processing (10).
The commonly known RGBcolor model consists of three
colors components: Red(R), Green (G) and Blue (B). The
RGB color model represents most closely to the physical
sensors for colored light in most color CCD sensors (21).
RGBis a commonly used color model for the digital pictures
acquired by digital cameras (22). The three components in
the RGBmodel are correlatedhighly, soif the light changes,
all three components are changed accordingly (23).
The YIQ color space is the standard for color television
broadcasts in North America. The YIQ color space is
obtained from the RGB color model by linear transforma-
tion as shown below (10):
Y
I
Q
_
_
_
_
=
0:299 0:587 0:114
0:596 0:275 0:321
0:212 0:523 0:311
_
_
_
_
R
G
B
_
_
_
_
(17)
In the YIQ color space, Y measures the luminance of the
color value andI andQare two chromatic components called
in-phase and quadrature. The main advantage of YIQ color
model in image processing is that luminance (Y) and two
chromatic components (I and Q) are separated (10,24).
The HSI (hue, saturation, intensity), also known as
HSB (hue, saturation, brightness), and its generalized
form HSV (hue, saturation, value) models are also used
frequently in color image processing (2426). The HSI
model corresponds more accurately to the human percep-
tionof color qualities. Inthe HSI model, hue, the dominant
color, is represented by an angle in a color circle where
primary colors are separated by 1208 withRed at 08, Green
at 1208, and Blue at 2408. Saturation is the purity of the
color. The high saturated values are assigned for pure
spectral colors and the low values for the mixed shades.
The intensity is associated with the overall brightness of a
pixel (21).
Color Edge Detection versus Gray-Level Edge Detection
The use of color in edge detection implies that more infor-
mation for edges is available in the process, and the edge
detection results should be better than those of the gray-
level images. Novak and Shafer (27) found that 90%of edge
pixels are about the same in edge images obtained from
gray-level andfromcolor images. But 10%of the edge pixels
left are as undetected when gray-level information is used
(28). It is because edge may not be detected in the intensity
component in lowcontrast images, but it can be detected in
the chromatic components. For some applications, these
undetected edges may contain crucial information for a
later processing step, such as edge-based image segmenta-
tion or matching (29). Figure 13 demonstrates the typical
example of the differences betweencolor edge detectors and
gray-level edge detectors. The gray-level Sobel edge detec-
tor missed many (or all) real edge pixels because of the low
contrast in the intensity component as shown in Fig. 13(c)
and (d).
A color image is considered as a two-dimensional array
f(x,y) with three components for each pixel. One major
concern in color edge detection is the high computational
complexity, which has increased signicantly in compar-
ison with the edge detection in gray value images (see
Table 2 for the computational cost comparison between
the color Sobel operator and the gray-level Sobel operator
with various size of images).
Denition of Edge in Color Images
The approaches for detecting edges in color images depend
on the denition of an edge. Several denitions have been
Table 1. Computational cost comparison
Image Size
Computational Time (Seconds) 680 by 510 1152 by 864 1760 by 1168
Robert 0.032 0.125 0.219
Prewitt 0.062 0.234 0.422
Sobel 0.047 0.187 0.343
Gray image Robert+NMS
+
0.141 0.438 0.1
Edge operator Prewitt+NMS
+
0.14 0.516 1.188
Sobel+NMS
+
0.109 0.531 1.234
LOG 0.5 1.453 2.984
Canny 0.469 1.531 3.172
+
NMS = Nonmaximum suppression.
Figure 12. The effect of the hysteresis
threshold in the Canny operator: xed
s = 1.5 (a) t
high
= 100, t
low
= 20; (b)
t
high
= t
low
= 100; (c) t
high
= 60, t
low
= 20;
(d) t
high
= 60, t
low
= 60.
10 EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES
proposed, but the precise denition of color edge has not
been established for color images so far (30). G. S. Robinson
(24) dened a color edge as the place where a discontinuity
occurs in the image intensity. Under this denition, edge
detection would be performed in the intensity channel of a
color image in the HIS space. But this denition provides no
explanation of possible discontinuities in the hue or satura-
tion values. Another denition is that an edge in a color
image is where a discontinuity occurs in one color compo-
nent. This denition leads to various edge detection meth-
ods that perform edge detection in all three color
components and then fuses these edges to an output edge
image (30). One problem facing this type of edge detection
methods is that edges in each color component may contain
inaccurate localization. The third denition of color edges is
based on the calculation of gradients in all three color
components. This type of multidimensional gradient meth-
ods combines three gradients into one to detect edges. The
sum of the absolute values of the gradients is often used to
combine the gradients.
Until now, most color edge detection methods are based
on differential grayscale edge detection methods, such as
nding the maximuminthe rst derivative or zero crossing
in the second derivative of the image function. One dif-
culty in extending these methods to the color image origi-
nates from the fact that the color image has vector values.
The monochromatic-based denition lacks the considera-
tion about the relationship among three color components.
After the gradient is calculated at each component, the
question of how to combine the individual results remains
open (31).
Because pixels in a color image are represented by a
vector-valued function, several researchers have proposed
vector-based edge detection methods (3236). Cumani
(32,35) and Zenzo (36) dened edge points at the locations
of directional maxima of the contrast function. Cumani
suggested a method to calculate a local measure of direc-
tional contrast based on the gradients of the three color
components.
Color Edge Detection Methods
Monochromatic-Based Methods. The monochromatic-
based methods extend the edge detection methods devel-
oped for gray-level images to each color component. The
results from all color components are then combined to
generate the nal edge output. The following introduces
commonly used methods.
Method 1: the Sobel operator and multidimensional
gradient method
(i) Apply the Sobel operator to each color component.
(ii) Calculate the mean of the gradient magnitude
values in the three color components.
(iii) Edge exists if the mean of the gradient magnitude
exceeds a given threshold (28,30).
Note the sum of the gradient magnitude in the three
color components can also be used instead of the mean in
Step ii).
Method 2: the Laplacian and fusion method
(i) Apply the Laplacian mask or the LoGmask to each
color component.
(ii) Edge exists at a pixel if it has a zero crossing in at
least one of the three color components (28,31).
Table 2. Computational cost comparison: the gray-level Sobel edge detector versus the color-Sobel edge detector
Image Size
Computational time (Seconds) 680 by 510 1152 by 864 1760 by 1168
Gray-level Sobel operator 0.047 0.187 0.343
Color Sobel operator 0.172 0.625 1.11
Figure 13. Color versus gray edge detec-
tors: (a) color image [used with permission
from John Owens at the University of
California, Davis], (b) edge map generated
by the color edge detector, and (c) and
(d) edge map generated by the gray-level
Sobel operator.
EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES 11
Experimental results with multidimensional Sobel
operator are shown in Fig. 14(b)(c). The color Sobel opera-
tor generates the more detailed edge maps [Fig. 14(b) and
(c)] compared with the edge map generated by the gray-
level Sobel operator in Fig. 14(d). But the computational
cost of the color Sobel operator increases three times more
than the cost of the gray-level Sobel operator as shown in
Table 2.
Vector-Based Methods. Color Variants of the Canny
Edge Operator. Kanade (37) introduced an extension of
the Canny edge detector (3) for edge detection in color
images. Let a vector C(r(x,y),g(x,y),b(x,y)) represent a color
image in the RGBcolor space. The partial derivatives of the
color vector can be expressed by a Jacobian matrix J as
below:
J =
@C
@x
;
@C
@y
_ _
=
@r
@x
@r
@y
@g
@x
@g
@y
@b
@x
@b
@y
_
_
_
_
= (G
x
; G
y
) (18)
The direction u and the magnitude G
m
of a color edge are
given by
tan(2u) =
2 G
x
G
y
|G
x
|
2
|G
y
|
2
(19)
G
m
=
|G
x
|
2
cos
2
(u)2 G
x
G
y
sin(u) cos(u)|G
y
|
2
sin
2
(u)
_
(20)
where G
x
, G
y
are the partial derivatives of the three color
components and | | is a mathematical norm. Several
variations exist based on different mathematical norms
such as the L
1
-norm (sum of the absolute value), L
2
-norm
(Euclidean distance), and L
= (u
1
; u
2
) is
S(P; u) = u
t
Ju = (u
1
; u
2
)
E F
F G
_ _
u
1
u
2
_ _
= Eu
2
1
2Fu
1
u
2
Gu
2
2
where
J = \ f (\ f )
T
=
E F
F G
_ _
; E =
3
i=1
@ f
i
@x
_ _
2
;
F =
3
i=1
@ f
i
@x
_ _
@ f
i
@y
_ _
; and G =
3
i=1
@ f
i
@y
_ _
2
(21)
The maximal value l of S(P, u) is
l
= (E G)
z
(E G)
2
4
F
2
(22)
This maximum l occurs when u
is the corresponding
eigenvector (35)
u
1 C
2
_
;
1 C
2
_ _ _
; where C
=
E G
(E G)
2
(2F)
2
_ (23)
Edge points are dened at the locations where l has a local
maximum along the direction of the unit vector u
. So the
Figure 14. Experiments with the color
Sobel operator: (a) color input; (b) and (c)
color Sobel operator with T = 15, and
T = 20, respectively; and (d) gray Sobel
operator with T = 20.
12 EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES
zero crossings of l in the direction of u
are candidates of
edges (33). To nd the zero crossing of l, the directional
derivative is dened as
\l
= \S(P; u
) u
=
@E
@x
u
3
1
@E
@y
2
@F
@x
_ _
u
2
1
u
2
2
@F
@y
@G
@x
_ _
u
1
u
2
2
@G
@y
u
3
2
(24)
where E, F, and G are dened in Equation (21). Finally,
edge points are determined by computing the zero crossing
of \l
andthe signof \l
alongacurve tangent to
u
at point P.
Cumani tested this edge detector with color images in
the RGB space with the assumption that the Euclidean
metric exists for the vector space (32). Figure 16(b), (c), and
(d) show the edge detection results generated by the
Cunami color edge operator at different scales. It seems
that the Cunami color edge operator generated the more
detailed edge images in comparison with the edge images
generated by the monochromatic-based methods.
EDGE DETECTION METHOD IN RANGE IMAGES
Rangeimages are aspecial class of digital images. Therange
images encode the distance information from the sensor to
the objects. Pixel values in the range images are related to
the positions of surface geometry directly. Therefore, range
images provide an explicit encoding of the local structure
and geometry of the objects in the scene. Edge detection
methods developed for intensity images mainly focused on
the detection of step edges. In the range imagers, it is
possible to detect correctly both the step edges and the
roof edges because of the available depth information.
Hoffman and Jain (38) described three edge types in
range images: step edges, roof edges, and smooth edges.
Step edges are those composed pixels in which the actual
depth values are signicantly discontinuous as compared
with their neighboring pixels. Roof edges are where the
depth values are continuous, but the directions of the sur-
face normal change abruptly. Smooth edges are related
with discontinuities in the surface curvature. But smooth
edges relatively seldom occur in range images. Step edges
in range images can be detected with ordinary gradient
edge operators, but roof edges are difcult to be detected
(39). Thus, an edge detection method for a range image
must take into account these two types of edges such as
discontinuities indepthand discontinuities inthe direction
of surface normal.
Edge Detection Using Orthogonal Polynomials
Besl and Jain (40,41) proposed a method that uses ortho-
gonal polynomials for estimating the derivatives in range
images. To estimate the derivatives, they used the locally t
discrete orthogonal polynomials with an unweighted least-
squares technique. Usingsmoothsecond-order polynomials,
range images were approximated locally. This method pro-
vided the smoothing effect and computational efciency by
obtaining the coefcient of the polynomials directly. But
unweighted least squares could cause errors in image dif-
ferentiation. To overcome this problem, a weighted least-
squares approach was proposed by Baccar and Colleagues
(42,43).
Extraction of Step Edges. The step edge detection method
proposed by Baccar and Colleagues (42) is based on the use
of locally t discrete orthogonal polynomials with a
weighted least-squares technique. For the weighted
least-squares approximation W(x), a one-dimensional
Gaussian kernel of unit amplitude with zero mean and
Figure 15. Experiments with the color
Canny operator: color Canny edge maps
(a) s = 0.8, t
high
= 50, t
low
= 10; (b)
s = 0.8, t
high
= 100, t
low
= 10; (c) s = 1.0,
t
high
= 100, t
low
= 10; and (d) the edge map
generated by the gray-level Canny operator
s = 1.5, t
high
= 100, t
low
= 20.
Figure 16. Experiments withthe Cumani
operator: (a) color input image and the
Cumani edge maps (b) s = 1.0, (c)
s = 1.2, and (d) s = 1.5.
EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES 13
standard deviation s is used. The two-dimensional Gaus-
sian kernel at the center of the window can be represented
bythe product of W(x) andW(y). Let W(x) = e
x
2
2s
2
. Thenaone-
dimensional set of second-degree orthogonal polynomials
w
0
, w
1
, w
2
is dened as
w
0
(x) = 1; w
1
(x) = x; w
2
(x) = x
2
A; where
A =
xW(x)w
1
(x)w
0
(x)
W(x)w
2
0
(x)
(25)
A locally approximated range image ^ r(x; y) is calculated
with a second-degree Gaussian weighted orthogonal poly-
nomial as follows (42):
^ r(x; y) =
ij_2
a
i j
w
i
(x)w
j
(y)
= a
00
a
10
w
1
(x) a
01
w
1
(y) a
11
w
1
(x)w
1
(y)
a
20
w
2
(x) a
02
w
2
(y)
= a
00
a
10
x a
01
y a
11
xy a
20
(x
2
a
20
)
a
02
(y
2
a
02
)
= a
10
x a
01
y a
11
xy a
20
x
2
a
02
y
2
a
00
A(a
02
a
20
)
(26)
At a differentiable point of the surface, the quantity of the
surface normal is dened as
a
i j
=
1
@
i
@
j
x;y
r(x; y)W(x)W(y)w
i
(x)w
j
(y); where
@
i
=
M
x=M
W(x)w
2
i
(x) (27)
The partial derivatives of the approximated range image
^ r(x; y) are dened by the following equations:
^ r
x
(x; y) = a
10
a
11
y 2a
20
x;
^ r
y
(x; y) = a
01
a
11
x 2a
02
y (28)
At the center of the discrete window for (x, y) = (0, 0), the
partial derivatives are computed by
^ r
x
(0; 0) = a
10
; ^ r
y
(0; 0) = a
01
(29)
The gradient magnitude at the center of this discrete
window is
a
2
10
a
2
01
_
. The step edge image, g
step
(x,y), is
obtained directly from the coefcients of the polynomials.
Extraction of Roof Edges. The roof edges are dened as
the discontinuities in the orientation of the surface normal
of objects in a range image. The quantity that denes the
surface normal at a differentiable point of a surface is
dened in Equation (27). The approximation to the surface
normal, which is the angle between the two projections of
the normal vector n on the (x, z)- and (y, z)-planes, is
computed using
g(x; y) = tan
1
(r
y
=r
x
); where
n = (r
x
; r
y
; 1)
T
=(1 r
2
x
r
2
y
)
1=2
(30)
The partial derivatives r
x
andr
y
of the functionr(x, y) are
calculated using the same Gaussianweighted least squares
inEquation(28). The quantity g(x, y) represents the surface
normal, and a 5 5 median ltering is applied to produce
the nal surface normal image. The roof edge image
g
roof
(x,y) is computed from the nal surface image by using
the weighted Gaussian approach (42,43). The nal edge
map is generated after implementing a fusion step that
combined step edge image g
step
(x,y) and roof edge image
g
roof
(x,y) and a subsequent morphological step (42).
Edge Detection via Normal Changes
Sze et al. (44) as well as Mallet and Zhong (45) presented
an edge detection method for range images based on
normal changes. They pointed out that depth changes of
a point in a range image with respect to its neighbors are
not sufcient to detect all existent edges and that normal
changes are much more signicant than those of depth
changes. Therefore, the step and roof edges in range
images are identied by detecting signicant normal
changes.
Let ~p(u; v) be a point on a differentiable surface S and
~p(u; v) = (u; v; f (u; v)). If we denote ~p
u
and ~p
v
as the
partial derivatives of ~p(u; v) at u- and v-directions, respec-
tively; then the partial derivatives of ~p(u; v) are given as
follows (44):
~p
u
=
@~p(u; v)
@u
= (1; 0; f
u
(u; v))
~p
v
=
@~p(u; v)
@v
= (0; 1; f
v
(u; v))
(31)
The normal of a tangent plane at ~p(u; v) is dened as
~
N(u; v) =
~p
u
~p
v
|~p
u
~p
v
|
(32)
If we replace Equation (32) with Equation (31), the value of
norm
~
N(u; v) can be rewritten as below:
N
(u; v) =
f
u
1 f
2
u
f
2
v
_ ;
f
v
1 f
2
u
f
2
v
_ ;
1
1 f
2
u
f
2
v
_
_ _
= (n
1
(u; v); n
2
(u; v); n
3
(u; v));
where f
u
= @ f (u; v)=@u and f
v
= @ f (u; v)=@v (33)
Steps for edge detection in range images via normal
changes are summarized as follows (44):
1. Calculate the normal of every point in a range image:
the partial derivatives are essential to derive the nor-
mal value at eachdatapoint as showninEquation(33).
14 EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES
However, the partial derivative showninEquation(33)
cannot be computed directly because the range image
data points are discrete. To calculate the normal on a
set of discrete data points, the locally t discrete ortho-
gonal polynomials, originally proposed by Besl and
Jain (40,41) and explained earlier, can be used. Other
exiting methods are the orthogonal wavelet-based
approach (46) and the nonorthogonal wavelet-based
approach (45).
2. Find the signicant normal changes as edge point:
Using the dyadic wavelet transform proposed by
Mallat and Zhong (45), the signicant normal
changes (or local extrema) are selectedas edge points.
The dyadic wavelet transform of a f(x,y) at scale 2
j
along the x- and y-directions can be expressed by
W
1
2 j
f (x; y) = f + (1=2
2 j
)w
1
(x=2
j
; y=2
j
)
W
2
2 j
f (x; y) = f + (1=2
2 j
)w
2
(x=2
j
; y=2
j
)
(34)
where w
1
(x; y) = @u(x; y)=@x, w
2
(x; y) = @u(x; y)=@y, and
u(x; y) is asmoothing functionthat satises the following
conditions: Its integration over the full domain is equal
to 1 and converges to 0 at innity. The dyadic wavelet
transformation of the vector of normal changes
~
N(u; v)
at scale 2
j
is given by
W
j
~
N(u; v) = W
1
2 j
~
N(u; v)du W
2
2j
~
N(u; v)dv (35)
where
W
i
2j
~
N(u; v) = (W
i
2 j
n
1
(u; v); W
i
2j
n
2
(u; v); W
i
2 j
n
3
(u; v)),
i = 1; 2.
Their associated weights can be the normal changes
(or gradient) of
~
N(u; v) along the du- and dv-directions.
Two important values for edge detection can be calcu-
lated as follows: The magnitude of the dyadic wavelet
transformation of W
j
~
N(u; v) at scale 2
j
is computed as
M
2j
N
(u; v) =
|W
1
2j
~
N
(u; v)|
2
|W
2
2j
N
(u; v)|
2
_
(36)
and the angle with respect to the du-direction is
A
2j
N
(u; v) = argument |W
1
2j
~
N(u; v)| i|W
2
2j
N
(u; v)|
_ _
(37)
Everypoint intherangeimagewill beassociatedwithtwo
important values: magnitude of normal changes with
respect to its certain neighbors and the direction tendency
of the point (44). Edgepoints canbedetectedbythresholding
the normal changes. Experimental results are provided for
synthetic and real 240 240 range images in Ref 44. Three
different methods such as quadratic surface tting and
orthogonal and nonorthogonal wavelet-based approaches
are used to calculate the normal values for the comparison
purpose. After the normal values are decided, the dyadic
transforms proposedbyMallat andZhong(45) are appliedto
detect thenormal changesat everypoint inarangeimage. In
their experiments, the nonorthogonal wavelet-based
approach used to estimate the normal values generated
the best results in comparison with the other methods.
CONCLUSION
Edge detection has been studied extensively in the last 50
years, and many algorithms have been proposed. In this
article, we introduced the fundamental theories and the
popular technologies for edge detection in grayscale, color,
and range images. More recent edge detection work can be
found in Refs.16, 30, 47 and 48. We did not touch on the
topic of evaluating edge detectors. Interested readers can
nd such research work in Refs. 4953.
BIBLIOGRAPHY
1. Z. He andM. Y. Siyal, Edge detectionwithBPneural networks,
Signal Processing Proc. ICSP98, 2: 382384, 1988.
2. E. R. Davies, Circularity- anewprinciple underlyingthe design
of accurate edge orientation operators, Image and Vision com-
puting, 2(3): 134142, 1984.
3. J. Canny, A computational approach to edge detection, IEEE
Trans. Pattern Anal. Machine Intell., 8(6): 679698, 1986.
4. Z. Hussain, Digital Image Processing- Partial Application of
Parallel Processing Technique, Cluchester: Ellishorwood, 1991.
5. A. Rosenfeld and A. C. Kak, Digital Picture Processing, New
York: Academic Press, 1982.
6. R. M. Haralick and L. G. Shapiro, Computer and Robot Vision,
Reading, MA: Addison-Wesley Publishing Company, 1992.
7. R. Jain, R. Kasturi, and B. G. Schunck, Machine Vision, New
York: McGraw-Hill, Inc., 1995.
8. Y. Lu and R. C. Jain, Behavior of edges in scale space, IEEE
Trans. on Pattern Anal. Machine Intell., 11(4): 337356, 1989.
9. M. Shah, A. Sood, andR. Jain, Pulse andstaircase edge models,
Computer Vis. Graphics Image Proc., 34: 321343, 1986.
10. R. Gonzalez and R. Woods, Digital Image Processing, Reading,
MA: Addison Wesley, 1992.
11. P. Mlsna and J. Rodriguez, Gradient andLaplacian-Type Edge
Detection, Handbook of Image and Video Processing, New
York: Academic Press, 2000.
12. E. R. Davies, Machine Vision, NewYork: Academic Press, 1997.
13. D. Marr andE. C. Hildreth, Theoryof edge detection, Proc. Roy.
Society, London, Series B, vol. 207(1167): 187217, 1980.
14. L. Kitchen and A. Rosenfeld, Non-maximum suppression of
gradient magnitude makes them easier to threshold, Pattern
Recog. Lett., 1(2): 9394, 1982.
15. K. Paler and J. Kitter, Grey level edge thinning: a newmethod,
Pattern Recog. Lett., 1(5): 409416, 1983.
16. S. Wang, F. Ge, and T. Liu, Evaluation edge detection through
boundary detection, EURASIP J. Appl. Signal Process., 2006:
115, 2006.
17. T. Pavlidis, Algorithms for Graphics and Image Processing,
New York: Springer, 1982.
18. L. Kitchen and A. Rosenfeld, Non-maximum suppression of
gradient magnitude makes them easier to threshold, Pattern
Recogn. Lett., 1(2): 9394, 1982.
EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES 15
19. J. Park, H. C. Chen, and S. T. Huang, A new gray-level edge
thinning method, Proc. of the ISCA 13th International Con-
ference: Computer Applications in Industry and Engineering,
2000, pp. 114119.
20. J. R. Parker, Home page. University of Calgary. Available:
http://pages.cpsc.ucalgary.co/~parker/501/edgedetect.pdf.
21. S. Wesolkowski and E. Jernigan, Color edge detection in RGB
using jointly Euclideandistance and vector angle, VisionInter-
face99: Troi-Rivieres, Canada, 1999, pp. 916.
22. H. D. Cheng, X. H. Jiang, Y. Sun, and J. Wang, Color image
segmentation: advance and prospects, Pattern Recogn., 34:
22592281, 2001.
23. M. Pietika inen, S. Nieminen, E. Marszalec, and T. Ojala,
Accurate color discrimination with classication based on fea-
ture distributions, Proc. 13
th
International Conference on Pat-
tern Recognition, Vienna, Austria, 3, 1996, pp. 833838.
24. G. Robinson, Color edge detection, Optical Engineering, 16(5):
126133, 1977.
25. T. Carron and P. Lambert, Color edge detector using jointly
hue, saturation and intensity, ICIP 94, Austin, Texas, 1994,
pp. 977981.
26. P. Tsang and W. Tang, Edge detection on object color, IEEE
International Conference on Image Processing, C, 1996, pp.
10491052.
27. C. Novak and S. Shafer, Color edge detection, Proc. of DARPA
Image Understanding Workshop, vol. I, Los Angeles, CA, 1987,
pp. 3537.
28. A. Koschan, Acomparative study on color edge detection, Proc.
2nd Asian Conference on Computer Vision ACCV95, vol III,
Singapore, 1995, pp. 574578.
29. J. Fan, W. Aref, M. Hacid, and A. Elmagarmid, An improved
isotropic color edge detection technique, Pattern Recogn. Lett.,
22: 14191429, 2001.
30. A. KoshenandM. Abidi, Detectionandclassicationof edges in
color images, IEEE Signal Proc. Mag., Jan: 6473, 2005.
31. T. Huntsberger and, M. Descalzi, Color edge detection, Pattern
Recogn. Lett., 3: 205209, 1985.
32. A. Cumani, Edge detection in multispectral images, CVGIP:
Grap. Models Image Proc., 53(I): 4051, 1991.
33. L. Shafarenko, M. Petrou, and J. Kittler, Automatic watershed
segmentation of randomly textured color images, IEEETrans.
Image Process., 6: 15301544, 1997.
34. Y. Yang, Color edge detection and segmentation using vector
analysis, Masters Thesis, University of Toronto, Canada, 1995.
35. A. Cumani, Efcient contour extraction in color image, Proc. of
3
rd
Asian Conference on Computer Vision, vol. 1, 1998, pp.
582589.
36. S. Zenzo, A note on the gradient of a multi-image, CVGIP, 33:
116125, 1986.
37. T. Kanade, Image understanding research at CMU, Proc.
Image Understading Workshop, vol II, 1987, pp. 3240.
38. R. Hoffman and A. Jain, Segmentation and classication of
range image, IEEE Trans. On PAMI 9-5, 1989, pp. 643649.
39. N. Pal and S. Pal, A review on Image segmentation technique,
Pattern Recogn., 26(9): 12771294, 1993.
40. P. Besl and R. Jain, Invariant surface characteristics for 3D
object recognition in range images, Comp. Vision, Graphics
Image Image Process., 33: 3380, 1984.
41. P. Besl and R. Jain, Segmentation through variable-order
surface tting, IEEE Trans. Pattern Anal. Mach. Intell.,
10(3): 167192, 1988.
42. M. Baccar, L. Gee, R. Gonzalez, and M. Abidi, Segmentation of
range images via data fusion and Morphological watersheds,
Pattern Recogn., 29(10): 16731687, 1996.
43. R. G. Gonzalez, M. Baccar, and M. A. Abidi, Segmentation of
range images via data fusion and morphlogical watersheds,
Proc. of the 8
th
Scandinavian Conf. on Image Analysis, vol. 1,
1993, pp. 2139.
44. C. Sze, H. Liao, H. Hung, K. Fan, and J. Hsieh, Multiscale edge
detection on range images via normal changes, IEEE Trans.
Circuits Sys. II: Analog Digital Signal Process., vol. 45(8):
10871092, 1998.
45. S. Mallat and S. Zhong, Characterization of signal from multi-
scale edges, IEEE Trans. Pattern Anal. Machine Intell., 14(7):
710732, 1992.
46. J. W. Hsieh, M. T. Ko, H. Y. Mark Liao, and K. C. Fan, A new
wavelet-based edge detector via constrained optimization,
Image Vision Comp., 15: 511527, 1997.
47. R. Zhang, G. Zhao, and L. Su, A new edge detection method in
image processing, Proceedings of ISCIT 2005, 2005, pp.
430433.
48. S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu, Statis-
tical edge detection: learning and evaluating edge cues, IEEE
Trans. Pattern Anal. Machine Intell., 25(1): 5773, 2003.
49. M. C. Shin, D. B. Goldgof, K. W. Bowyer, and S. Nikiforou,
Comparison of edge detection algorithms using a structure
from motion task, IEEE Trans. on System, Man, and Cyberne.
Part B: Cybernetics, 31(4): 589601, 2001.
50. T. Peli and D. Malah, A study of edge detection algorithms,
Comput. Graph. Image Process., 20(1): 121, 1982.
51. P. Papachristou, M. Petrou, and J. Kittler, Edge postprocessing
using probabilistic relaxation, IEEETrans. Syst., Man, Cybern.
B, 30: 383402, 2000.
52. M. Basu, Gaussian based edge detection methodsa survey,
IEEETrans. Syst., Man, Cybern.-part C: Appl. Rev., 32(3): 2002.
53. S. Wang, F. Ge, and T. Liu, Evaluating Edge Detectionthrough
Boundary Detection, EURASIP J. Appl. Signal Proc., Vol.
2006, pp. 115.
JUNG ME PARK
YI LU MURPHEY
University of MichiganDearborn
Dearborn, Michigan
16 EDGE DETECTION IN GRAYSCALE, COLOR, AND RANGE IMAGES
F
FACE RECOGNITION TECHNIQUES
INTRODUCTION TO FACE RECOGNITION
Biometrics
1
is becoming a buzzword due to increasing
demandfor user-friendly systems that provide bothsecure
and efcient services. Currently, one needs to remember
numbers and/or carry IDs all the time, for example, a
badge for entering an ofce building, a password for
computer access, a password for ATM access, and a
photo-ID and an airline ticket for air travel. Although
very reliable methods of biometric personal identication
exist, e.g., ngerprint analysis and iris scans, these meth-
ods rely on the cooperation of the participants, whereas a
personal identication systembased on analysis of frontal
or prole images of the face is often effective without the
participants cooperation or knowledge. It is due to this
important aspect, and the fact that humans carry out face
recognition routinely, that researchers started an inves-
tigation into the problem of machine perception of human
faces. In Fig. 1, we illustrate the face recognition task of
which the important rst step of detecting facial regions
from a given image is shown in Fig. 2.
After 35 years of investigation by researchers from
various disciplines (e.g., engineering, neuroscience, and
psychology), face recognition has become one of the most
successful applications of image analysis and understand-
ing. One obvious applicationfor face recognitiontechnology
(FRT) is law-enforcement. For example, police can set up
cameras in public areas to identify suspects by matching
their imagaes against a watch-list facial database. Often,
low-quality video and small-size facial images pose signi-
cant challenges for these applications. Other interesting
commercial applications include intelligent robots that can
recognize human subjects and digital cameras that offer
automatic focus/exposure based on face detection. Finally,
image searching techniques, including those based on
facial image analysis, have been the latest trend in the
booming Internet search industry. Such a wide range of
applications pose a wide range of technical challenges and
require an equally wide range of techniques from image
processing, analysis, and understanding.
The Problem of Face Recognition
Face perception is a routine task of human perception
system, although building a similar robust computer sys-
tem is still a challenging task. Human recognition pro-
cesses use a broad spectrum of stimuli, obtained from
many, if not all, of the senses (visual, auditory, olfactory,
tactile, etc.).
Inmanysituations, contextual knowledge is also applied
(e.g., the context plays an important role in recognizing
faces in relation to where they are supposed to be located).
However, the human brain has its limitations in the total
number of persons that it canaccuratelyremember. Akey
advantage of a computer system is its capacity to handle
large numbers of facial images.
A general statement of the problem of the machine
recognition of faces can be formulated as follows: Given
still or video images of ascene, identify or verify one or more
persons in the scene using a stored database of faces.
Available collateral information, such as race, age, gender,
facial expression, or speech, may be used to narrow the
search(enhancingrecognition). The solutionto the problem
involves face detection (recognition/segmentation of face
regions from cluttered scenes), feature extraction from the
face regions (eyes, nose, mouth, etc.), recognition, or iden-
tication (Fig. 3).
Brief Development History
The earliest work on face recognition in psychology can be
traced back at least to the 1950s (4) and to the 1960s in the
engineering literature (5). Some of the earliest studies
include work on facial expression of emotions by Darwin
(6) [see also Ekman (7) and on facial prole-based bio-
metrics by Galton (8)]. But research on automatic machine
recognition of faces really started in the 1970s after the
seminal work of Kanade (9) and Kelly (10). Over the past 30
years, extensive research has been conducted by psycho-
physicists, neuroscientists, and engineers on various
aspects of face recognition by humans and machines.
Psychophysicists and neuroscientists have been con-
cerned with issues such as whether face perception is a
dedicated process [this issue is still being debated in the
psychology community(11,12)], and whether it is done
holistically or by local feature analysis. With the help of
powerful engineering tools such as functional MRI, new
theories continue to emerge (13). Many of the hypotheses
andtheories put forwardby researchers inthese disciplines
have been based on rather small sets of images. Never-
theless, many of the ndings have important consequences
for engineers who design algorithms and systems for the
machine recognition of human faces.
Until recently, most of the existing work formulates the
recognition problem as recognizing 3-D objects from 2-D
images. As a result, earlier approaches treated it as a 2-D
pattern recognition problem. During the early and middel
1970s, typical pattern classication techniques, were used
that measured attributes of features (e.g., the distances
between important points) in faces or face proles (5,9,10).
During the 1980s, work on face recognition remained lar-
gely dormant. Since the early 1990s, research interest in
FRThas grownsignicantly. One canattribute this growth
to several reasons: the increase in interest in commercial
opportunities, the availability of real-time hardware, and
the emergence of surveillance-related applications.
1
Biometrics: the study of automated methods for uniquely recog-
nizing humans based on one or more intrinsic physical or behavior
traits.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Over the past 18 years, research has focused on how to
make face recognition systems fully automatic by tackling
problems such as localization of a face in a given image or a
video clip and by extraction features such as eyes, mouth,
and so on. Meanwhile, signicant advances have been
made in the design of classiers for successful face recogni-
tion. Among appearance-based holistic approaches, eigen-
faces (14,15) and Fisherfaces (1618) have proved to be
effective in experiments with large databases. Feature-
based graph matching approaches (19) have also been
successful. Compared with holistic approaches, feature-
based methods are less sensitive to variations in illumina-
tion and viewpoint and to inaccuracy in face localization.
However, the feature extraction techniques needed for this
type of approach are still not reliable or accurate (20).
During the past 815 years, much research has been
concentrated on video-based face recognition. The still
image problem has several inherent advantages and dis-
advantages. For applications such as airport surveil-
lance, the automatic location and segmentation of a
face could pose serious challenges to any segmentation
algorithm if only a static picture of a large, crowded area
is available. On the other hand, if a video sequence is
available, segmentation of a moving person can be accom-
plished more easily using motion as a cue. In addition, a
sequence of images might help to boost the recognition
performance if we can use all these images effectively.
But the small size and low image quality of faces captured
from video can increase signicantly the difculty in
recognition.
More recently, signicant advances have been made on
3-D based face recognition. Although it is known that face
recognitionusing3-Dimageshasmoreadvantagesthanface
recognition using a single or sequence of 2-D images, no
serious effort was made for 3-D face recognition until
recently. This delay caused by was mainly the feasibility,
complexity and computational cost to acquire 3-D data in
real-time. Now, the availability of cheap, real-time 3-Dsen-
sors (21) makes it much easier to apply 3-Dface recognition.
Recognizing a 3-Dobject fromits 2-Dimages poses many
challenges. The illumination and pose problems are two
prominent issues for appearance-based or image-based
approaches (22). Many approaches have been proposed to
handle these issues, and the key here is to model the 3-D
geometry and reectance properties of a face. For example,
3-D textured models can be built from given 2-D images,
andthe images canthenbeusedto synthesize images under
various poses and illumination conditions for recognition
or animation. By restricting the image-based 3-D object
modeling to the domain of human faces, fairly good recon-
struction results can be obtained using the state-of-the-art
algorithms. Other potential applications inwhichmodeling
is crucial includes computerized aging, where an appro-
priate model needs to be built rst and then a set of model
parameters are used to create images that simulate the
aging process.
Figure 1. An illustration of the face recognition task [55]: given an input facial image (left column: many variants of the facial image are
used to illustrate image appearance change due to natural variations in lighting and pose, and electronic modifications that simulate more
complex variations), matching it against a database of facial images (center column), and finally outputting the matched database image
and/or the ID of the input image (right column).
2 FACE RECOGNITION TECHNIQUES
Methods for Machine Recognition of Faces
As illustrated in Fig. 4, the problem of automatic face
recognition involves three key steps/subtasks:
1. Detection and coarse normalization of faces
2. Feature extraction and accurate normalization of
faces
3. Identication and/or verication
Sometimes, different subtasks are not totally separated.
For example, facial features (eyes, nose, mouth) are often
used for both face recognition and face detection. Face
detection and feature extraction can be achieved simulta-
neously as indicated in Fig. 4. Depending on the nature of
the application, e.g., the sizes of the training and testing
databases, clutter and variability of the background,
noise, occlusion, and speed requirements, some subtasks
can be very challenging. A fully automatic face recognition
system must perform all three subtasks, and research
on each subtask is critical. This is not only because the
techniques used for the individual subtasks need to
be improved, but also because they are critical in many
different applications (Fig. 5). For example, face detection
is needed to initialize face tracking, and extraction of
facial features is needed for recognizing human emotion,
which in turn is essential in humancomputer inter-
action (HCI) systems. Without considering feature loca-
tions, face detection is declared as successful if the
presence and rough location of a face has been correctly
identied.
Face Detection and Feature Extraction
Segmentation/Detection. Up to the mid-1990s, most
work on segmentation was focused on single-face segmen-
tation from a simple or complex background. These
approaches included using a whole-face template, a
deformable feature-based template, skin color, and a
neural network.
Figure 2. Detection/Segmentation/Recognition of facial regions an image .
FACE RECOGNITION TECHNIQUES 3
Signicant advances have been made in recent years in
achieving automatic face detection under various condi-
tions. Compared with feature-based methods and tem-
plate-matching methods, appearance, or image-based
methods (53) that train machine systems on large numbers
of samples have achieved the best results (refer to Fig. 4).
This may not be surprising since complicated face objects
are different from non-face objects, although they are very
similar to each other. Through extensive training, compu-
ters can be good at detecting faces.
Feature Extraction. The importance of facial features for
face recognition cannot be overstated. Many face recogni-
tion systems need facial features in addition to the holistic
face, as suggested by studies inpsychology. It is well known
that even holistic matching methods, e.g., eigenfaces (46)
and Fisherfaces (43), need accurate locations of key facial
features such as eyes, nose, and mouth to normalize the
detected face (54).
Three types of feature extraction methods can be dis-
tinguished:
1. Generic methods based on edges, lines, and curves
2. Feature-template-based methods that are used to
detect facial features such as eyes
3. Structural matching methods that take into consid-
eration geometrical constraints on the features
Early approaches focused on individual features; for
example, a template-based approach is described in
Ref. 36 to detect and recognize the human eye in a frontal
face. These methods have difculty when the appearances
of the features change signicantly, e.g., closed eyes, eyes
with glasses, or open mouth. To detect the features more
reliably, recent approaches use structural matching meth-
ods, for example, the active shape model (ASM) that repre-
sents any face shape (a set of landmark points) via a mean
face shape and principle components through training (46).
Compared with earlier methods, these recent statistical
methods are much more robust in terms of handling varia-
tions in image intensity and in feature shape. The advan-
tages of using the so-called analysis through synthesis
Figure 3. Configuration of a generic face recognition/proceeding system. We use a dotted line to indicate cases when both face detection
and feature extraction work together to achieve accurate face localization and reliable feature extraction [e.g. (3)].
Figure 4. Mutiresolution seach from a displaced position using a face model (30).
4 FACE RECOGNITION TECHNIQUES
approachcome fromthe fact that the solutionis constrained
by a exible statistical model. To account for texture varia-
tion, the ASM model has been expanded to statistical
appearance models including a exible appearance model
(28) and an active appearance model (AAM)(29). In Ref. 29,
the proposed AAM combined a model of shape variation
(i.e., ASM) with a model of the appearance variation of
shape-normalized (shape-free) textures. A training set of
400 images of faces, each labeled manually with 68 land-
mark points and approximately 10,000 intensity values
sampled from facial regions were used. To match a given
image with a model, an optimal vector of parameters (dis-
placement parameters between the face region and the
model, parameters for linear intensity adjustment, and
the appearance parameters) are searched by minimizing
the difference between the synthetic image and the given
image. After matching, a best-tting model is constructed
that gives the locations of all the facial features so that the
original image canbe reconstructed. Figure 5 illustrates the
optimization/search procedure to t the model to the image.
Face Recognition
As suggested, three types of FRT systems have been inves-
tigated:recognitionbased onstill images, recognitionbased
on a sequence of images, and, more recently, recognition
based on 3-D images. All types of FRT technologies have
their advantages and disadvantages. For example, video-
based face recognition can use temporal information to
enhance recognition performance. Meanwhile, the quality
of video is low and the face regions are small under typical
acquisition conditions (e.g., in surveillance applications).
Rather than presenting all three types of FRT systems, we
focus on still-image-based FRT systems that form the
foundations for machine recognition of faces. For details
on all three types of FRT systems, please refer to a recent
review article (the rst chapter in Ref. 3).
Face recognition is such an interesting and challenging
problem that it has attracted researchers from different
elds: psychology, pattern recognition, neural networks,
computer vision, and computer graphics. Often, a single
system involves techniques motivated by different princi-
ples. To help readers that are newto this eld, we present a
class of linear projection/subspace algorithms based on
image appearances. The implementation of these algo-
rithms is straightforward, yet they are very effective under
constrained situations. These algorithms helped to revive
the research activities in the 1990s with the introduction of
eigenfaces (14,15) andare still being researchedactivelyfor
continuous improvements.
Eigenface and the Projection-Based Appearance Meth-
ods. The rst successful demonstration of the machine
recognition of faces was made by Turk and Pentland (15)
using eigenpictures (also known as eigenfaces) for face
detection and identication. Given the eigenfaces, every
face in the database is represented as a vector of weights
obtained by projecting the image into a subset of all eigen-
face components (i.e., a subspace) by a simple inner product
operation. When a new test image whose identication is
requiredis given, the newimage is representedbyits vector
of weights. The test image is identied by locating the
image in the database whose weights are the closest (in
Euclidean distance) to the weights of the test image. By
using the observation that the projection of a facial image
and a nonface image are different, a method to detect the
presence of a face in a given image is obtained. Turk and
Pentland illustrate their method using a large database of
2500 facial images of 16 subjects, digitized at all combina-
tions of three head orientations, three head sizes, and three
lighting conditions.
In a brief summary, eigenpictures/eigenfaces are effec-
tive low-dimensional representations of facial images
based on KarhunenLoeve (KL) or principal component
analysis projection (PCA)(14). Mathematically speaking,
sample facial images (2-D matrix format) can be converted
into vector representations (1-D format). After collecting
enoughsample vectors, one canperformstatistical analysis
(i.e., PCA) to construct new orthogonal bases and then can
represent these samples in a coordinate system dened by
these new bases. More specically, mean-subtracted sam-
ple vectors x canbe expressed as a linear combinationof the
orthogonal bases F
i
(typically m<<n):
x
X
n
i1
a
i
F
i
X
m
i1
a
i
F
i
(1)
via solving an eigenproblem
CF FL (2)
where C is the covariance matrix for input x and L is a
diagonal matrix consisting of eigenvalues l
i
. The main
point of eigenfaces is that it is an efcient representation
of the original images (i.e., from n coefcients to m coef-
cient). Figure 5 shows a real facial image and several
reconstructed images based on several varying number
of leading principal components F
i
that correspond to
the largest eigenvalues l
i
. As can be seen from the plots,
300 leading principal components are sufcient to repre-
sent the original image of size 48 42 (2016).
2
2
Please note that the reconstructed images are obtained by con-
verting from 1-D vector format back to 2-D matrix format and
adding back the average/mean facial image.
Figure 5. Original image [size 48 42 (i.e., 2016)] and the reconstructed image using 300, 200, 100, 50, 20, and 10 leading components,
respectively (32).
FACE RECOGNITION TECHNIQUES 5
Another advantage of using such a compact representa-
tion is the reduced sensitivity to noise. Some of these noise
could be caused by small occlusions as long as the topologic
structure does not change. For example, good performance
against blurring, (partial occlusion) has beendemonstrated
in many eigenpicture based systems and was also reported
in Ref. 18 (Fig. 6), which should not come as a surprise
because the images reconstructed using PCA are much
better than the original distorted images in terms of the
global appearance (Fig. 7).
In addition to eigenfaces, other linear projection algo-
rithms exist, including ICA (independent component ana-
lysis) and LDA/FLD (linear discriminant analysis/Fishers
linear discriminant analysis) to name a few. In all these
projection algorithms, classication is performed (1) by
projecting the input x into a subspace via a projection/basis
matrix P
roj
(P
roj
is F for eigenfaces, W for Fisherfaces with
pure LDA projection, and WFfor Fisherfaces with sequen-
tial PCAand LDAprojections; these three bases are shown
for visual comparison in Fig. 8),
z P
ro j
x (3)
(2) by comparing the projection coefcient vector z of the
input with all the pre-stored projection vectors of labeled
classes to determine the input class label. The vector com-
parison varies in different implementations and can inu-
ence the systems performance dramatically (33). For
Figure 7. Reconstructed images using 300 PCA projection coefcients for electronically modied images (Fig. 7) (26).
Figure 8. Different projection bases constructed froma set of 444 individuals, where the set is augmented via adding noise and mirroring.
(Improved reconstruction for facial images outside the training set using an extended training set that adds mirror-imaged faces was
suggested in Ref. 14.).The rst row shows the rst ve pure LDA basis images W, the second row shows the rst ve subspace LDA basis
images WF, the average face and rst four eigenfaces F are shown on the third row (18).
Figure 6. Electronically modied images that have been identied correctly using eigenface representation (18).
6 FACE RECOGNITION TECHNIQUES
example, PCA algorithms can use either the angle or the
Euclidean distance (weighted or unweighted) between two
projectionvectors. For LDAalgorithms, the distance canbe
unweighted or weighted.
Face recognition systems using LDA/FLD (called Fish-
erfaces in Ref. 16) have also been very successful (16,17,34).
LDAtraining is carried out via scatter matrix analysis (35).
For an M-class problem, the within-class and between-class
scatter matrices S
w
, S
b
are computed as follows:
S
w
X
M
i1
Prw
i
C
i
(4)
S
b
X
M
i1
Prw
i
m
i
m
0
m
i
m
0
T
where Prv
i
is the prior class probability and usually is
replaced by 1=M in practice with the assumption of equal
priors. Here S
w
is the within-class scatter matrix, showing
the average scatter C
i
of the sample vectors x of different
classes v
i
around their respective means m
i
: C
i
Exv
m
i
xv m
i
T
jv v
i
. Similarly, S
b
is the between-class
scatter matrix, which represents the scatter of the condi-
tional mean vectors m
i
around the overall mean vector m
0
.
A commonly used measure to quantify the discriminatory
power is the ratio of the determinant of the between-class
scatter matrixof the projectedsamples tothe determinant of
the within-class scatter matrix: JT jT
T
S
b
Tj=jT
T
S
w
Tj.
The optimal projection matrix W that maximizes JT
can be obtained by solving a generalized eigenvalue
problem:
S
b
W S
w
WL
L
(5)
There are several ways to solve the generalized eigen-
problem of Equation (5). One is to compute directly the
inverse of S
w
and solve a nonsymmetric (in general) eigen-
problem for matrix S
1
w
S
b
. But this approach is unstable
numerically because it involves the direct inversion of a
potentially very large matrix that probably is close to being
singular. A stable method to solve this equation is to solve
the eigen-problem for S
w
rst (32,35), [i.e., to remove the
within-class variations (whitening)]. Because S
w
is a real
symmetric matrix, orthonormal W
w
and diagonal L
w
exist
such that S
w
W
w
W
w
L
w
. After whitening, the input x
becomes y:
y L
1=2
w
W
T
w
x (6)
The between-class scatter matrix for the newvariable y can
be constructed similar to Equation (5).
S
y
b
X
M
i1
Prv
i
m
y
i
m
y
0
m
y
i
m
y
0
T
(7)
Now the purpose of FLD/LDA is to maximize the class
separation of the now whitened samples y, which leads to
another eigenproblem: S
y
b
W
b
W
b
L
b
. Finally, we apply the
change of variables to y:
z W
T
b
y (8)
Combining Equations (6) and (8), we have the following
relationship: z W
T
b
L
1=2
w
W
T
w
x and W simply is
W W
w
L
1=2
w
W
b
(9)
To improve the performance of LDA-based systems, a
regularized subspace LDA system that unies PCA and
LDA was proposed in Refs. 26 and 32. Good generalization
capability of this system was demonstrated by experiments
on new classes/individuals without retraining the PCA
bases F, and sometimes even the LDA bases W. Although
the reasonfor not retrainingPCAis obvious, it is interesting
to test the adaptive capability of the system by xing the
LDA bases when images from new classes are added.
3
The
xed PCA subspace of dimensionality 300 was trained from
alarge number of samples. Anaugmentedset of 4056mostly
frontal-view images constructed from the original 1078
images of 444 individuals by adding noisy and mirrored
images was used in Ref. 32. At least one of the following
three characteristics separates this systemfromother LDA-
based systems (33,34)): (1) the unique selection of the uni-
versal face subspace dimension, (2) the use of a weighted
distance measure, and (3) a regularized procedure that
modies the within-class scatter matrix S
w
. The authors
selected the dimensionality of the universal face subspace
based on the characteristics of the eigenvectors (face-like or
not) instead of the eigenvalues (18), as is done commonly.
Later, it was concluded in Ref. 36 that the global face sub-
space dimensionality is on the order of 400 for large data-
bases of 5000 images. The modication of S
w
into S
w
dI
has two motivations (1): rst, to resolve the issue of small
sample size (37) and second, to prevent the, signicantly
discriminative information
4
contained in the null space of
S
w
(38) from being lost.
To handle the non-linearity caused by pose, illumina-
tion, and expression variations presented in facial images,
the above linear subspace methods have been extended to
kernel faces (3941) and tensorfaces (42).
Categorizationof Still-Image BasedFRT. Manymethods of
face recognition have been proposed during the past 30
years from researchers with different backgrounds.
Because of this fact, the literature on face recognition is
vast and diverse. To have a clear and high-level categoriza-
tion, we follow a guideline suggested by the psychological
study of how humans use holistic and local features. Spe-
cically, we have the following categorization (see table 1
for more information):
3
This makes sense because the nal classication is carried out in
the projection space z by comparison with pre-stored projection
vectors with nearest-neighbor rule.
4
The null space of S
w
contains important discriminant information
because the ratio of the determinants of the scatter matrices would
be maximized in the null space.
FACE RECOGNITION TECHNIQUES 7
1. Holistic matching methods. These methods use
the whole face regionas the rawinput to a recognition
system. One of the widely used representations of the
face region is eigenpictures (14), which are based on
principal component analysis.
2. Feature-based (structural) matching methods.
Typically, inthese methods, local features suchas the
eyes, nose, and mouth are rst extracted and their
locations and local statistics (geometric and/or
appearance) are fed into a structural classier.
3. Hybrid methods. Just as the human perception
system uses both local features and the whole face
region to recognize a face, a machine recognition
system should use both. One can argue that these
methods could offer potentially the best of the two
types of methods
Within each of these categories, additional classication
is possible. Using subspace analysis, many face recognition
techniques have beendeveloped: eigenfaces (15), which use
a nearest-neighbor classier; feature-line-based methods,
which replace the point-to-point distance with the distance
between a point and the feature line linking two stored
sample points (46); Fisherfaces (16,18,34), which use Fish-
ers Linear Discriminant Analysis (57); Bayesian methods,
which use a probabilistic distance metric (43); and SVM
methods, which use a support vector machine as the clas-
sier (44). Using higher-order statistics, independent com-
ponent analysis is argued to have more representative
power than PCA, and, in theory, it can provide better
recognition performance than PCA (47). Being able to offer
potentially greater generalization through learning,
neural networks/learning methods have also been applied
to face recognition. One example is the probabilistic
decision-based neural network method (48) and the other
is the evolution pursuit method (45).
Most earlier methods belongto the categoryof structural
matching methods, which use the width of the head, the
distances between the eyes and fromthe eyes to the mouth,
and so on (10), or the distances and angles between eye
corners, mouth extrema, nostrils, and chin top (9).
Recently, a mixture-distance-based approach using manu-
ally extracted distances was reported. Without nding the
exact locations of facial features, hidden markov model
based methods use strips of pixels that cover the forehead,
eye, nose, mouth, and chin (51,52). Reference 52 reported
better performance than Ref. 51 by using the KL projection
coefcients instead of the strips of raw pixels. One of the
most successful systems in this category is the graph
matching system (39), which is based on the Dynamic
Link Architecture (58). Using an unsupervised learning
methodbased ona self-organizing map, a systembasedona
convolutional neural network has been developed (53).
In the hybrid method category, we have the modular
eigenface method (54), a hybrid representation based on
PCA and local feature analysis (55), a exible appearance
model-based method (28), and a recent development (56)
along this direction. InRef. 54, the use of hybrid features by
combining eigenfaces and other eigenmodules is explored:
eigeneyes, eigenmouth, and eigennose. Although experi-
ments show only slight improvements over holistic eigen-
faces or eigenmodules based on structural matching, we
believe that these types of methods are important and
deserve further investigation. Perhaps many relevant pro-
blems need to be solved before fruitful results can be
expected (e.g., howto arbitrate optimally the use of holistic
and local features).
Many types of systems have been applied successfully to
the task of face recognition, but they all have some advan-
tages and disadvantages. Appropriate schemes should be
chosen based on the specic requirements of a given task.
COMMERCIAL APPLICATIONS AND ADVANCED
RESEARCH TOPICS
Face recognition is a fascinating research topic. On one
hand, many algorithms andsystems have reached a certain
level of maturity after 35 years of research. On the other
hand, the success of these systems is limited by the condi-
tions imposed by many real applications. For example,
automatic focus/exposure based on face detection has
been built into digital cameras. However, recognition of
face images acquired in an outdoor environment with
changes in illumination and/or pose remains a largely
unsolved problem. In other words, current systems are still
far away from the capability of the human perception
system.
Commercial Applications of FRT
In recent years, we have seen signicant advances in
automatic face detection under various conditions. Conse-
quently, many commercial applications have emerged. For
example, face detection has been employed for automatic
exposure/focus in digital cameras, such as Powershot
SD800 IS from Canon and FinePix F31fd from Fuji. These
smart cameras can zero in automatically on faces, and
photos will be properly exposed. In general, face detection
technology is integrated into the cameras processor for
increased speed. For example, FinePix F31fd can identify
faces and can optimize image settings in as little as 0.05
seconds.
One interesting application of face detection technology
is the passive driver monitor system installed on 2008
Lexus LS 600hL. The systemuses a camera on the steering
column to keep an eye on the driver. If he or she should be
looking away from the road ahead and the pre-collision
system detects something beyond the car (through stereo
vision and radar), the system will sound a buzzer, ash a
light, and even apply a sharp tap on the brakes.
Finally, the popular application of face detection tech-
nology is an image-based search of Internet or photo
albums. Many companies (Adobe, Google, Microsoft, and
many start-ups) have been working on various prototypes.
Often, the bottle-neck for such commercial applications is
the difculty to recognize and to detected faces. Despite the
advances, todays recognition systems have limitations.
Many factors exist that could defeat these systems: facial
expression, aging, glasses, and shaving.
Nevertheless, face detection/recognition has been criti-
cal for the success of intelligent robots that may provide
8 FACE RECOGNITION TECHNIQUES
important services in the future. Prototypes of intelligent
robots have been built, including Hondas ASIMO and
Sonys QRIO.
Advanced Research Topics
To build a machine perception systemsomeday that is close
to or even better than the human perception system,
researchers need to look at both aspects of this challenging
problem: (1) the fundamental aspect of howhuman percep-
tion system works and (2) the systematic aspect of how to
improve system performance based on best theories and
technologies available. To illustrate the fascinating char-
acteristics of the human perception system and how it is
different from currently available machine perception sys-
tems, we plot the negative and upside-down photos of a
person in Fig. 9. It is well known that negative or upside-
downphotos make humanperceptionof faces more difcult
(59)). Also we know that no difference exists in terms of
information (bits used to encode images) between a digi-
tized normal photo and a digitized negative or upside-down
photo (except the sign and orientation information).
Fromthe systemperspective, many research challenges
remain. For example, recent system evaluations (60) sug-
gested at least two major challenges: the illumination
variation problem and the pose variation problem.
Although many existing systems build in some sort of
performance invariance by applying pre-processes, such
as histogram equalization or pose learning, signicant
illumination or pose change can cause serious performance
degradation. In addition, face images could be partially
occluded, or the systemneeds to recognize a person froman
image in the database that was acquired some time ago. In
an extreme scenario, for example, the search for missing
children, the time interval could be up to 10 years. Such a
scenario poses a signicant challenge to build a robust
system that can tolerate the variations in appearances
across many years.
Real problems exist when face images are acquired
under uncontrolled and uncooperative environments, for
example, in surveillance applications. Although illumina-
tion and pose variations are well-dened and well-
researched problems, other problems can be studied sys-
tematically, for example, through mathematical modeling.
Mathematical modeling allows us to describe physical
entities mathematically and hence to transfer the physical
Table 1. Categorization of Still-Image-Based Face Recognition Techniques
Approach Representative Work
Holistic methods
Principal Component Analysis
Eigenfaces Direct application of PCA (15)
Probabilistic Eigenfaces Two-class problem with prob. measure (43)
Fisherfaces/Subspace LDA FLD on eigenspace (16,18)
SVM Two-class problem based on SVM (44)
Evolution Pursuit Enhanced GA learning (45)
Feature Lines Point-to-line distance based (46)
ICA ICA-based feature analysis (47)
Other Representations
LDA/FLD FLD/LDA on raw image (17)
PDBNN Probabilistic decision-based NN (18)
Kernel faces Kernel methods (3941)
Tensorfaces Multilinear analysis (42,49)
Feature-based methods
Pure geometry methods Earlier methods (9,10); recent methods (20,50)
Dynamic Link Architecture Graph matching methods (19)
Hidden Markov Model HMM methods (51,52)
Convolution Neural Network SOM learning based CNN methods (53)
Hybrid methods
Modular Eigenfaces Eigenfaces and eigenmodules (54)
Hybrid LFA Local feature method (55)
Shape-normalized Flexible appearance models (28)
Component-based Face region and components (56)
Figure 9. Typical limitation of human perception system with
negative and upside-downphotos: which makes it difcult or takes
much longer for us to recognize people from the photos. Interest-
ingly, we can manage eventually to overcome this limitation when
recognizing famous people (President Bill Clinton in this case).
FACE RECOGNITION TECHNIQUES 9
phenomena into a series of numbers (3). The decomposition
of a face image into a linear combination of eigenfaces is a
classic. example of mathematical modeling. In addition,
mathematical modeling can be applied to handle the issues
of occlusion, low resolution, and aging.
As an application of image analysis and understanding,
machine recognition of faces benets tremendously from
advances in many relevant disciplines. To conclude our
article, we list these disciplines for further reading and
mention their direct impact on face recognition briey
a
min
a
max
.
v
Triangle handedness f. Let Z
i
= x
i
jy
i
be the com-
plex number corresponding to the location (x
i
; y
i
) of
point P
i
,i = 1,2, 3. Dene Z
21
= Z
2
Z
1
, Z
32
= Z
3
Z
2
,
and Z
13
= Z
1
Z
3
. Let triangle handedness f = sign
(Z
21
Z
32
), where sign is the signumfunction and is
the cross product of two complex numbers. Points P
1
,
P
2
, and P
3
are noncolinear points, so f = 1 or 1.
v
Triangle direction h. Search the minutiae in the image
from top to bottom and left to right. If the minutiae is
the start point, then y = 1; otherwise y = 0. Let
h = 4y
1
2y
2
y
3
, where y
i
is the y value of point P
i
,
i = 1,2,3, and 0 _ h _ 7.
v
Maximum side l. Let l = maxL
i
, where L
1
=
[Z
21
[; L
2
= [Z
32
[, and L
3
= [Z
13
[.
v
Minutiae density x. In a local area (32 32 pixels)
centered at the minutiae P
i
, if x
i
minutiae exists then
the minutiae density for P
i
is x
i
. Minutiae density x is a
vector consisting of all x
i
.
v
Ridge counts j. Let j
1
, j
2
, and j
3
be the ridge counts of
sides P
1
P
2
, P
2
P
3
, and P
3
P
1
, respectively. Then, j is a
vector consisting of all j
i
.
During the ofine hashing process, the above features
for each template ngerprint (33) are computed and a hash
table H(a
min
; a
med
; f; h; l; x; j) is generated. During the
online hashing process, the same features are computed
for each query ngerprint and compared with the features
represented by H. If the difference in features is small
enough, then the query ngerprint is probably the same
as the stored ngerprints that have similar features.
Figure 10 is an example of two corresponding triangles
in a pair of ngerprints that are two impressions of
one ngerprint. In the rst impression, three noncolinear
minutiae A, B, and C are picked randomly to form a
triangle DABC. The features in this triangle are a
min
=
30
; a
med
= 65
; f = 1; h = 6; l = [AC[; x = 0; 0; 0; Dj =
6; 5; 12. Similarly, three noncolinear minutiae a, b,
and c in the second impression form Dabc. Its features
are a
min
= 31
; a
med
= 63
; f = 1; h = 6; l = [ac[; x =
0; 2; 0; j = 6; 5; 12. If the error between these two
triangles are within the error tolerance (34), then these
two triangles are considered the corresponding triangles.
The output of this process, carriedout for all the triplets, is a
list of hypotheses, whichis sorted inthe descending order of
the number of potential corresponding triangles. Top T
hypotheses are the input to the verication process.
PERFORMANCE EVALUATION
Two classes in the ngerprint recognition systems exist:
match and nonmatch. Let s and n denote match and
non-match. Assume that x is the similarity score. Then,
f(x|s) is the probability density function given s is true, and
f(x|n) is the probability density function given n is true.
Figure 11 is an example of these two distributions. For a
criterion k, one can dene
Figure 10. An example of two corresponding triangles in a pair
of ngerprints.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.002
0.004
0.006
0.008
0.010
0.012
0.014
0.016
0.018
Similarity scores
P
r
o
b
a
b
i
l
i
t
y
f(x|n)
f(x|s)
k
CR
Hit
FA FR
Figure 11. Densities of the match and nonmatch scores.
6 FINGERPRINT IDENTIFICATION
v
Hit: the probability that x is above k given s, where Hit
=
R
k
f (x[s)dx
v
False alarm: the probability that x is above k given n,
where FA =
R
k
f (x[n)dx
v
False rejection: the probability that x is belowkgivens,
where FR =
R
k
f (x[s)dx
v
Correct rejection: the probabilitythat x is belowkgiven
n, where CR =
R
k
f (x[n)dx
A receiver operating characteristic (ROC) curve is a
graphical plot whose x-axis is the false alarm (FA) rate
and y-axis is the hit (Hit) rate. A ROC curve is used to
evaluate the performance of a recognition systembecause
it represents the changes in FA rate and Hit rate with
different discrimination criteria (thresholds). A detection
error tradoff (DET) curve plots the FA rate and the false
rejection (FR) rate with the change of discrimination
criteria. A condence interval is an interval within which
the estimation is likely to be determined. The ISO
standard performance testing report gives the detail of
recommendations and requirements for the performance
evaluations (9).
Public ngerprint databases are used to evaluate the
performance of different ngerprint recognitionalgorithms
and ngerprint acquisition sensors. The National Institute
of Standards and Technology (NIST) provides several spe-
cial ngerprint databases with different acquisition meth-
ods and scenarios. Most images in these databases are
rolling ngerprints that are scanned from paper cards.
The NIST special database 24 is a digital video of live-
scan ngerprint data. The detail of the NIST special data-
bases can be found in Ref. (35). Since 2000, ngerprint
verication competition (FVC) has provided public data-
bases for a competition that is held every 2 years. For each
competition, four disjoint databases are created that are
collected with different sensors and technologies. The per-
formance of different recognition algorithms is addressed
in the reports of each competition (3639). The ngerprint
vendor technologyevaluation(F
p
VTE) 2003is conductedby
NIST to evaluate the performance of the ngerprint recog-
nition systems. A total of 18 companies competed in the
FpVTE, and 34 systems from U.S government were exam-
ined. The performance of different recognition algorithms
is discussed in Ref. (40).
PERFORMANCE PREDICTION
Several research efforts exist for analyzing the perfor-
mance of ngerprint recognition. Galton (41) assumes
that 24 independent square regions could cover a nger-
print and that he could reconstruct correctly any of the
regions with a probability of 1/2 by looking at the sur-
rounding ridges. Accordingly, the Galton formulation of
the distinctiveness of a ngerprint is given by (1/16)
(1/256) (1/2)
24
, where 1/16 is the probability of the
occurrence of a ngerprint type and 1/256 is the prob-
ability of the occurrence of the correct number of ridges
entering and exiting each of the 24 regions. Pankanti et al.
(8) present a ngerprint individuality model that is based
on the analysis of feature space and derive an expression
to estimate the probability of false match based on the
minutiae between two ngerprints. It measures the
amount of information needed to establish correspon-
dence between two ngerprints. Tan and Bhanu (42)
present a two-point model and a three-point model to
estimate the error rate for the minutiae-based ngerprint
recognition. Their approach not only measures the posi-
tion and orientation of the minutiae but also the relations
between different minutiae to nd the probability of cor-
respondence between ngerprints. They allowthe overlap
of uncertainty area of any two minutiae. Tabassi et al. (43)
and Wein and Baveja (44) use the ngerprint image
quality to predict the performance. They dene the quality
as an indication of the degree of separation between
the match score and non-match score distributions. The
farther these two distributions are from each other, the
better the system performs.
Predicting large population recognition performance
based on a small template database is another important
topic for the ngerprint performance characterization.
Wang and Bhanu (45) present an integrated model that
considers data distortion to predict the ngerprint identi-
cation performance on large populations. Learning is
incorporated in the prediction process to nd the optimal
small gallery size. The Chernoff and Chebychev inequal-
ities are used as a guide to obtain the small gallery size
given the margin of error and condence interval. The
condence interval can describe the uncertainty associated
with the estimation. This condence interval gives an
interval within which the true algorithm performance for
a large population is expected to fall, along with the prob-
ability that it is expected to fall there.
FINGERPRINT SECURITY
Traditionally, cryptosystems use secret keys to protect
information. Assume that we have two agents, called Alice
and Bob. Alice wants to send a message to Bob over
the public channel. Eve, the third party, eavesdrops over
the public channel and tries to gure out what Alice and
Bobare sayingto eachother. WhenAlice sends amessage to
Bob, she uses a secret encryption algorithm to encrypt the
message. After Bob gets the encrypted message, he will
use a secret decryption algorithm to decrypt the message.
The secret keys are used in the encryption and decryption
processes. Because the secret keys can be forgotten, lost,
and broken, biometric cryptosystems are possible for
security.
Uludag et al. (46) propose a cryptographic construct that
combines the fuzzy vault with ngerprint minutiae data to
protect information. The procedure for constructing the
fuzzy vault is like what follows in the example of Alice
and Bob. Alice places a secret value k in a vault and locks it
using anunordered set Aof the polynomial coefcients. She
selects a polynomial p of variable x to encode k. Then, she
computes the polynomial projections for the elements of A
and adds some randomly generated chaff points that do not
lie on p to arrive at the nal point set R. Bob uses an
unordered set B of the polynomial coefcients to unlock
FINGERPRINT IDENTIFICATION 7
the vault only if Boverlaps withAto a great extent. He tries
to learnk, that is to nd p. By using error-correctioncoding,
he can reconstruct p. Uludag et al. (46) present a curve-
based transformed minutia representation in securing the
fuzzy ngerprint vault.
We know that the biometrics is permanently related
with a user. So if the biometrics is lost, then the bio-
metrics recognition system will be compromised forever.
Also, a user can be tracked by cross-matching with all the
potential uses of the ngerprint biometrics, such as
access to the house, bank account, vehicle, and laptop
computer. Ratha et al. (47) presents a solution to over-
come these problems in the ngerprint recognition sys-
tems. Instead of storing the original biometrics image,
authors apply a one-way transformation function to the
biometrics. Then, they store the transformed biometrics
and the transformation to preserve the privacy. If a
biometrics is lost, then it can be restored by applyings
a different transformation function. For different appli-
cations of the same biometrics, they use different trans-
formation functions to avoid a user being tracked by
performing a cross match.
Like other security systems, ngerprint sensors are
prone to spoong by fake ngerprints molded with arti-
cial materials. Parthasaradhi et al. (48) developed an
anti spoong method that is based on the distinctive
moisture pattern of live ngers contacting ngerprint
sensors. This method uses the physiological process of
perspiration to determine the liveness of a ngerprint.
First, they extract the gray values along the ridges to form
a signal. This process maps a 2 ngerprint image into a
signal. Then, they calculate a set of features that are
represented by a set of dynamic measurements. Finally,
they use a neural network to perform classication
(live vs. not live).
CONCLUSIONS
Because of their characteristics, such as distinctiveness,
permanence, and collectability, ngerprints have been
used widely for recognition for more than 100 years.
Fingerprints have three levels of features: pattern, point,
and shape. With the improvement of the sensor resolu-
tion, more andbetter ngerprint features canbe extracted
to improve the ngerprint recognition performance.
Three kinds of approaches exist to solve the ngerprint
identication problem: (1) Repeat the verication proce-
dure for each ngerprint in the database and select the
best match; (2) performngerprint classication followed
by verication; and (3) create ngerprint indexing, fol-
lowed by verication. Fingerprint verication is done by
matching a query ngerprint witha template. The feature
extraction, matching, classication, indexing and perfor-
mance prediction are the basic problems for ngerprint
recognition.
Fingerprint prediction, security, liveness detection, and
cancelable biometrics are the important current research
problems. The area of biometric cryptosystems, especially
the ngerprint cryptosystem, is an upcoming area of inter-
est because the traditional secret key can be forgotten, lost,
and broken.
BIBLIOGRAPHY
1. R. Wang and B. Bhanu, Predicting ngerprint biometric per-
formance from a small gallery, Pattern Recognition Letters, 28
(1): 4048, 2007.
2. E. R. Henry, Classication and Uses of Fingerprints. George
Routledge and Sons, 1900.
3. G. T. Candela, P. J. Grother, C. I. Watson, R. A. Wilkinson, and
C. L. Wilson, PCASYS-A pattern-level classication automa-
tion system for ngerprints, NIST Technical Report. NISTIR
5467, 1995.
4. D. Maio, D. Maltoni, R. Cappelli, J. L. Wayman, and A. K. Jain,
FVC2000: ngerprint verication competition, IEEE Trans.
Pattern Analysis and Machine Intelligence, 24 (3): 402412,
2002.
5. A. K. Jain, Y. Chen, and M. Demirkus, Pores and ridges: High
resolution ngerprint matching using level 3 features, IEEE
Trans. Pattern Analysis and Machine Intelligence, 29 (1):
1527, 2007.
6. http://authentec.com.
7. Scientic Working Group on Friction Ridge Analysis,
Study and Technology (SWGFAST), Available: http://
ngerprint.nist.gov/standard/cdef f s/Docs/SWGFAST_
Memo.pdf.
8. D. Maltoni, D. Maio, A. K. Jain, andS. Prabhakar, Handbook of
Fingerprint Recognition. New York: Springer, 2003.
9. ISO/IEC19795-1, Information Technology-Biometric Perfor-
mance Testing and Reporting-Part 1: Principles and Frame-
work. ISO/IEC JTC1/SC37 N908, 2006.
10. X. Tan and B. Bhanu, A robust two step approach for nger-
print identication, PatternRecognitionLetters, 24 (13): 2127
2134, 2003.
11. R. S. Germain, A. Califano, and S. Colville, Fingerprint match-
ing using transformation parameter clustering, IEEE Compu-
tational Science and Engineering, 4 (4): 4249, 1997.
12. A. M. Bazen, G. T. B. Verwaaijen, S. H. Gerez, L. P. J. Vee-
lenturf, and B. J. vander Zwaag, A correlation-based nger-
print verication system, Proc. IEEE Workshop on Circuits
Systems and Signal Processing, Utrecht, Holland, 2000,
pp. 205213.
13. X. Tan and B. Bhanu, Fingerprint matching by genetic algo-
rithm, Pattern Recognition, 39 (3): 465477, 2006.
14. B. Bhanu and X. Tan, Computational Algorithms for Finger-
print Recognition. Kluwer Academic Publishers, 2003.
15. B. Bhanuand X. Tan, Learned templates for feature extraction
in ngerprint images, Proc. IEEE Computer Society Conf. on
Computer VisionandPatternRecognition, Hawaii, 2001, Vol 2,
pp. 591596.
16. X. Jiang and W. Y. Yau, Fingerprint minutiae matching based
onthe local andglobal structures, Proc. IEEEInt. Conf. Pattern
Recognition, Barcelona, Spain, 2000, pp. 10381041.
17. Z. M. Kovacs-Vajna, Angerprint verication systembased on
triangular matching and dynamic time warping, IEEE Trans.
PatternAnalysis andMachine Intelligence, 22(11): 12661276,
2000.
18. U. Halici and G. Ongun, Fingerprint classication through
self-organizing feature maps modied to treat uncertainties,
Proc. IEEE, 84 (10): 14971512, 1996.
8 FINGERPRINT IDENTIFICATION
19. R. Cappelli, D. Maio, and D. Maltoni, Fingerprint classication
based on multi-space KL, Proc. Workshop Autom. Identic.
Adv. Tech., 1999, pp. 117120.
20. N. K. Ratha and R. Bolle, Automatic Fingerprint Recognition
Systems. Springer, 2003.
21. A. Senior, A combination ngerprint classier, IEEE Trans.
PatternAnalysis andMachine Intelligence, 23(10): 11651174,
2001.
22. Y. Yao, G. L. Marcialis, M. Pontil, P. Frasconi, and F. Roli,
Combining at and structured representations for nger-
print classication with recursive neural networks and sup-
port vector machines, Pattern Recognition, 36, (2): 397406,
2003.
23. X. Tan, B. Bhanu, and Y. Lin, Fingerprint classication based
on learned features, IEEETrans. on Systems, Man and Cyber-
netics, Part C, Special issue on Biometrics, 35, (3): 287300,
2005.
24. K. Karu and A. K. Jain, Fingerprint classication, Pattern
Recognition, 29, (3): pp. 389404, 1996.
25. Y. Qi, J. Tian and R. W. Dai, Fingerprint classication system
with feedback mechanism based on genetic algorithm, Proc.
Int. Conf. Pattern Recog., 1: 163165, 1998.
26. R. Cappelli, A. Lumini, D. Maio, and D. Maltoni, Fingerprint
classication by directional image partitioning, IEEE Trans.
Pattern Analysis and Machine Intelligence, 21 (5): 402421,
1999.
27. M. Kamijo, Classifying ngerprint images using neural net-
work: Deriving the classication state, Proc. Int. Conf. Neural
Network, 3: 19321937, 1993.
28. F. Su, J. A. Sun, and A. Cai, Fingerprint classication based on
fractal analysis, Proc. Int. Conf. Signal Process., 3: 14711474,
2000.
29. M. S. Pattichis, G. Panayi, A. C. Bovik, and S. P. Hsu, Finger-
print classication using an AM-FM model, IEEE Trans.
Image Process., 10 (6): 951954, 2001.
30. S. Bernard, N. Boujemaa, D. Vitale, and C. Bricot, Fingerprint
classication using Kohonen topologic map, Proc. Int. Conf.
Image Process.,3: 230233, 2001.
31. A. K. Jain and S. Minut, Hierarchical kernel tting for nger-
print classication and alignment, Proc. IEEE Int. Conf. Pat-
tern Recognition, 2: 469473, 2002.
32. S. M. Mohamed and H. O. Nyongesa, Automatic ngerprint
classication system using fuzzy neural techniques, Proc. Int.
Conf. Fuzzy Systems, 1: 358362, 2002.
33. B. Bhanu and X. Tan, Fingerprint indexing based on novel
features of minutiae triplets, IEEE Trans. Pattern Analysis
and Machine Intelligence, 25, (5): 616622, 2003.
34. X. Tan and B. Bhanu, Robust ngerprint identication, Proc.
IEEEInt. Conf. onImage Processing, NewYork, 2002, pp. 277
280.
35. http://www.itl.nist.gov/iad/894.03/databases/defs/dba-
ses.html.
36. http://bias. csr. unibo. it/fvc2000/.
37. http://bias.csr.unibo.it/fvc2002/.
38. http://bias. csr. unibo. it/fvc2004/.
39. http://bias.csr.unibo.it/fvc2006/.
40. C. Wilson, R. A. Hicklin, M. Bone, H. Korves, P. Grother, B.
Ulery, R. Micheals, M. Zoep, S. Otto, and C. Watson, Finger-
print Vendor TechnologyEvaluation2003: Summaryof Results
and Analysis Report, 2004.
41. F. Galton, Finger Prints. McMillan, 1892.
42. X. Tan and B. Bhanu, On the fundamental performance for
ngerprint matching, Proc. IEEE Computer Society Conf.
on Computer Vision and Pattern Recognition, Madison,
Wisconsin, 2003, pp. 1820.
43. E. Tabassi, C. L. Wilson, and C. I. Watson, Fingerprint image
quality, National Institute of Standards and Technology Inter-
national Report 7151, 2004.
44. L. M. Wein and M. Baveja, Using ngerprint image quality to
improve the identication performance of the U.S. visitor and
immigrant status indicator technology program, The National
Academy of Sciences, 102 (21): 77727775, 2005.
45. R. Wang and B. Bhanu, Learning models for predicting recog-
nition performance, Proc. IEEEInt. Conf. on Computer Vision,
Beijing, China, 2005, pp. 16131618.
46. U. Uludag, S. Pankanti, and A. K. Jain, Fuzzy vault for
ngerprints, Proc. Audio- and Video-based Biometric Person
Authentication, Rye Brook, New York, 2005, pp. 310319.
47. N. K. Ratha, S. Chikkerur, J. H. Connell, and R. M. Bolle,
Generating cancelable ngerprint templates, IEEE Trans.
Pattern Analysis and Machine Intelligence, 29 (4): 561572,
2007.
48. S. T. V. Parthasaradhi, R. Derakhshani, L. A. Hornak, and
S. A. C. Schuckers, Time-series detection of perspiration as a
liveness test in ngerprint devices, IEEE Trans. on System,
Man, and Cybernetics-Part C: Applications and Reviews, 35
(3): 335343, 2005.
RONG WANG
BIR BHANU
University of California
Riverside, California
FINGERPRINT IDENTIFICATION 9
L
LEVEL SET METHODS
The shape of a real-world object can be represented by a
parametric or a nonparametric contour in the image plane.
The parametric contour representation is referred to as an
explicit representation and is dened in the Lagrangian
coordinate system. In this coordinate system, two different
objects have two different representations stemming from
the different sets of control points that dene the object
contours. The control points constitute the nite elements,
which is a common formalism used to represent shapes in
the Lagrangian coordinates. In contrast to the parametric
Lagrangian representation, the nonparametric represen-
tation denes the object contour implicitly in the Eulerian
coordinates, which remains constant for two different
objects. The level set method is a nonparametric represen-
tation dened in the Eulerian coordinate system and is
used commonly inthe computer visioncommunity to repre-
sent the shape of an object in an image.
The level set method has been introduced in the eld of
uiddynamics by the seminal workof Osher andSethianin
1988 (1). After its introduction, it has been applied success-
fullyinthe elds of uidmechanics, computational physics,
computer graphics, and computer vision. In the level set
representation, the value of each grid point (pixel) is set
traditionally to the Euclidean distance between the grid
point and the contour. Hence, moving the contour fromone
conguration to the other is achieved by changing
the distance values in the grid points. During its motion,
the contour can change implicitly its topology by splitting
into two disjoint contours or by merging from two contours
to one.
The implicit nature of the representationbecomes essen-
tial to handle the topology changes for the case when an
initial conguration is required to solve a time-dependent
problem. Upon initialization, the level set converges to a
solution by re-evaluating the values at the grid points
iteratively. This iterative procedure is referred to as the
contour evolution.
CONTOUR EVOLUTION
Without loss of generality, I will discuss the level set
formalism in the two-dimensional image coordinates
(x, y), which can be extended to higher dimensions without
complicating the formulation. Let there be a closed contour
G dened in the image coordinates as shown in Fig. 1(a).
The contour can be visualized as the boundary of an object
in an image. To represent the evolution of the contour, the
level set formalism introduces two additional dimensions
that dene the surface in z and the time t: [0,T):
z Fx; y; t
The surface dimension, z 2 R, encodes the signed Eucli-
deandistance fromthe contour. More specically, the value
of z inside the closed contour has a negative sign, and
outside the contour has a positive sign [see Fig. 1(b)]. At
any given time instant t, the cross section of the surface at
z 0corresponds to the contour, whichis also referredto as
the zero-level set.
The time dimension t is introduced as an articial
dimension to account for the iterative approach to nding
the steady state of an initial conguration. In computer
vision, the initial conguration of the level set surface
relates to an initial hypothesis of the object boundary,
which is evolved iteratively to the correct object boundary.
The iterations are governed by a velocity eld, which
species how the contour moves in time. In general, the
velocity eld is dened by the domain-related physics. For
instance, in uid mechanics, the velocity of the contour is
dened by the physical characteristics of the uid and the
environment in which it will dissolve; whereas in computer
vision, the velocity is dened by the appearance of the
objects in the image.
At each iteration, the equations that govern the contour
evolution are derived from the zero-level set at time
t: Fxt; t 0, where x x; y. Because the moving con-
tour is always at the zero level, the rate of contour motionin
time is given by:
@Fxt; t
@t
0 1
By applying the chain rule, Equation (1) becomes F
t
DF
xt; tx
0
t 0. In Fig. 2, a contouris present that evolves
using the curvature ow, which moves rapidly in the areas
where the curvature is high.
An important aspect of the level set method is its ability
to compute the geometric properties of the contour G from
the level set grid by implicit differentiation. For instance,
the contour normal is computed by ~ n
DF
jDFj
. Hence, divid-
ing bothsides by jDFj and replacing xt~ nby Fresults inthe
well-known level set evolution equation:
F
t
jDFjF 0 2
which evolves the contour in the normal direction with
speed F. The sign of F denes whether the contour will
move inward or outward. Particularly, a negative F value
shrinks the contour and a positive F value expands the
contour. In computer vision, F is computed at each grid
point based on its appearance and the priors dened from
the inside and the outside of the contour. As Equation (2)
includes only rst-order derivatives, it can be written as a
HamiltonJacobi equation F
t
Hfx; fy 0, where
Hfx; fy jDFjF. Based on this observation, the
numerical approximations of the implicit derivatives can
becomputedbyusingthe forwardEuler time-discretization
(see Ref. 2 for more details).
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
REINITIALIZATION OF THE LEVEL SET FUNCTION
To eliminate numeric instabilities during the contour eva-
luation, it is necessary to reinitialize the level set function
at eachevolution iteration. This requirement, however, is a
major limitation of the level set framework when fast
convergence is required, such as in video-based surveil-
lance. Anintuitive approachto reinitialize the level set grid
is to follow a two-step approach. The rst step is to recover
the contour by detecting the zero crossings. The second step
is to regenerate the level set surface to preserve the geo-
metric properties of the new contour. In computer vision,
the second step is to apply the Euclidean distance trans-
form to nd the distance of each grid point from the recov-
ered contour.
Application of the distance transformto update the level
set is a time-consuming operation, especially considering
that it needs to be done at every iteration. One way to reduce
the complexity of the level set update is to apply the narrow
band approach (3). The narrow band approach performs
reinitialization only in a neighborhood denned by a band.
This band also denes the limits in which the distance
transform is applied. The procedure involves recovering
the contour, positioning the band-limits from the extracted
contour, reinitializing the values residing in the band, and
updating the level set bounded by the band. The narrow
band approach still must recover the contour and reinitia-
lizate the level set function before a solution is reached;
hence, the error during the evolution may accumulate from
one iteration to the next.
An alternate approach to level set reinitialization is to
use an additional partial differential equation (PDE) eval-
uated on the level set function. In this approach, after the
PDE, which operates on the level set, moves the contour, a
second PDE re-evaluates the grid values while preserving
the zero crossings and geometric properties of the contour
(4). The level set reinitialization PDE is given by:
f
t
signf
0
1 jDfj
where the sign function denes the update of all the levels
except for the zero-level. An example sign function is
sign
e
f
0
f
0
f
2
0
e
2
q
FAST MARCHING LEVEL SET
Constraining the contour to to either shrink or to expand
eliminates the limitations posed by the level set reinitiali-
zation. Inthis case, the signof FinEquation (2) is constant.
Practically, the constant sign guarantees a single visit to
each level set grid point. This property results in a very
useful formulation referred to as the fast marching level
set.
Compared withthe aforementioned approaches, the fast
marching method moves the contour one pixel at a time.
Hence the distance traveled by the contour becomes con-
stant, d 1. The traveled distance also is attributed to the
speed of the contour, F(x, y), as well as the time it takes to
travel it, T(x, y). Constrained by these observations, the
Figure 1. (a) The contour is denned in the spatial image
coordinates, (x,y). Each black dot represents a grid point on
the plane, (b) The level set function denning the contour given
in (a). The blue-colored grid points denote positive values, and
the red-colored grid points denote negative values. The dis-
tance of a grid point from the contour is expressed by the
variation in the color value from light to dark.
Figure 2. A contour moving with the
curvature ow in the level set frame-
work. Note the fast evolution in the
regions where the contour bends
more than the other regions.
2 LEVEL SET METHODS
level set becomes a stationary formulation given by
1 FjDTj, such that at time t the contour is denned by
fGjTx; y tg. Based on the stationary formation, the time
required to move the contour is computed by Tx; y
1
Fx;y
.
Comparatively, the implementation of the fast matching
method is easier than its alternatives. Given an initial
contour, the pixels on the contour are labeled as active,
the pixels within the reach of the contour in the next
iteration are labeled as alive (pixels with distance of 1),
and the pixels that the contour cannot reach are labeled as
faraway. The iterations start by computing the time
required T(x, y) to travel to each alive pixel. From among
alive pixels, the pixel with the lowest arrival time changes
its status to active (new contour position). This change
introduces newset of alive points. This process is iteratively
performed until F 0 for all live pixels. See Fig. 3 for a
visualization of this iterative process.
AN EXAMPLE: OBJECT SEGMENTATION
The level set method has become a very successful tool to
solve problems ranging fromnoise removal, image segmen-
tation, registration, object tracking, and stereo. A common
property inall these domains is to formulate the problemby
a partial differential equation that is solved using the level
set formalism.
In the context of level sets, the cost or gain function is
required to be in the form of a HamiltonJacobi equation,
f
t
Hf
x
; f
y
0 that allows only the rst-order differen-
tiationof (f). Hence, givenatask, the rst goal is to come up
with a cost function that can be converted to a Hamiltonian
Hf
x
; f
y
. Let us consider the object segmentation problem
as an example. In the following discussion, a region-based
object segmentation approach will be considered that is
formulatedinterms of the appearance similarity inside and
outside of the contour (see Fig. 4). Practically, the appear-
ance of the object inside and outside the contour should be
different, hence, to minimize this similarity would result in
the segmentationof the object. Let us assume the contour is
initialized outside the object (inside initialization is also
possible), suchthat the appearance of the regionoutside the
contour serves as a prior. Using this prior, the likelihood of
object boundary canbe denned interms of the probability of
inside pixels given the outside prior: pIR
inside
jR
outside
.
Maximizing the gain function formulated using this like-
lihood measure segments the object as follows:
EG
Z
x 2R
inside
log pIxjR
outside
dx 3
Equation (3), however, is not in the form of a Hamiltonian.
Application of Greens theorem converts this gain function
Figure 3. The fast marching level set iterations. The black nodes denote the active contour pixels, the red nodes denote the alive level set
grid points, the blue nodes are the nodes that are not considered during the iteration (faraway), and the white nodes denote already visited
nodes.
Figure 4. (a) The inside and outside regions denned by the object contour, (b) Some segmentation results, (c) The zero level set during the
segmentation iterations based on the appearance similarity between the inside and outside regions.
LEVEL SET METHODS 3
to a Hamiltonian and results in the level set propagation
speed given by Fx; y log pIx; yjR
outside
(see Ref. (5),
Chapter 5] for application of Greens theorem). In Fig. 5,
several segmentation results are shown for different
domains.
DISCUSSION
Because of its robustness, its efciency, and its applicability
to a wide range of problems, the level set methodhas become
very popular among many researchers in different elds.
The HamiltonJacobi formulation of the level set can be
extended easily to higher dimensions. It can model any
topology including sharp corners, and can handle the
changes to that topology during its evolution. The level
set methodhas overcomethecontourinitializationproblems
associated with the classic active contour approach to seg-
ment images, such that the initial contour can include part
of the object and part of the background simultaneously (see
Fig. 5 for possible contour initialization). These properties,
however, come at the price of requiring careful thinking to
formulate the problem to be solved using the level set
framework. Especially, converting a computer vision pro-
blem to a PDE may require considerable attention.
On the down side, because of the reinitialization of the
level set after eachiteration, the level set method becomes a
computationally expensive minimization technique. The
narrow band approach is proposed to overcome this
limitation by bounding the region for reinitialization to a
band around the contour. Despite reduced complexity, the
narrow band method remains unsuitable for tasks that
require realtime processing, such as object tracking in a
surveillance scenario. The other possible solution to the
complexity problem is the fast marching approach, which
works in real time when implemented carefully. However,
the fast marching method evolves the contour only in one
direction. Hence, the initial position of the contour becomes
very important. For instance, the contour has to be either
outside or inside the object of interest, and the initializa-
tions that involve both inside and outside of the object may
not converge.
BIBLIOGRAPHY
1. S. Osher and J. Sethian, Fronts propagating with curvature
dependent speed: Algorithms based on hamilton-jacobi formu-
lation, Computat. Phys., 79: 1249, 1988.
2. J. Sethian, Level Set methods: Evolving Interfaces inGeometry,
Fluid mechanics Computer Vision and Material Sciences,
Cambridge, UK: Cambridge University Press, 1996.
3. D. Chop, Computing minimal surfaces via level set curvature
ow, J. Computat. Phys., 106: 7791, 1993.
4. I. Sussman, P. Smereka, and S. Osher, Alevel set approach for
computing solutions to incompressible two-phase ow, J. Com-
putat. Phys.,114: 146159, 1994.
5. A. Yilmaz, Object Tracking and Activity Recognition in Video
Acquired Using Mobile Cameras. PhD Thesis, Orlando, FL:
University of Central Florida, 2004.
ALPER YILMAZ
The Ohio State University
Columbus, Ohio
Figure 5. Initialization of the
contour, (a) The contour is com-
pletely inside the object, (b) out-
side the object, (c) includes both
inside and outside regions of the
object.
4 LEVEL SET METHODS
M
MEDICAL IMAGE PROCESSING
INTRODUCTION
The discovery of x rays in 1895 by Professor Wilhelm
Conrad Rontgen led to a transformation of medicine and
science. In medicine, x rays provided a noninvasive way to
visualize the internal parts of the body. Abeamof radiation
passing through the body is absorbed and scattered by
tissue and bone structures in the path of the beam to
varying extents depending on their composition and the
energy level of the beam. The resulting absorption and
scatter patterns are captured by a lm that is exposed
during imaging to produce an image of the tissues and
bone structures. By using varying amounts of energy levels
of different sources of radiant energy, radiographic images
can be produced for different tissues, organs and bone
structures.
The simple planar x-ray imaging, the main radiologic
imaging method used for most of the last century, produced
high-quality analog two-dimensional (2-D) projected
images of three-dimensional (3-D) organs. Over the last
few decades, increasingly sophisticated methods of diag-
nosis have been made possible by using different types of
radiant energy, including Xrays, gammarays, radio waves,
and ultrasound waves. The introduction of the rst x-ray
computed tomography (x-ray CT) scanner in the early
1970s totally changed the medical imaging landscape.
The CT scanner uses instrumentation and computer tech-
nology for image reconstruction to produce images of cross
sections of the human body. With the clinical experience
accumulated over the years and the establishment of its
usefulness, the CT scanner became very popular.
The exceptional multidimensional digital images of
internal anatomy produced by medical imaging technology
can be processed and manipulated using a computer to
visualize subtle or hidden features that are not easily
visible. Medical image analysis and processing algorithms
for enhancing the features of interest for easy analysis and
quantication are rapidly expanding the role of medical
imaging beyond noninvasive examination to a tool for aid-
ing surgical planning and intraoperative navigation.
Extracting information about the shape details of anato-
mical structures, for example, enables careful preoperative
planning of surgical procedures.
In medical image analysis, the goal is to accurately and
efciently extract information frommedical images to sup-
port a range of medical investigations andclinical activities
from diagnosis to surgery. The extraction of information
about anatomical structures from medical images is fairly
complex. This information has led to many algorithms that
have beenspecically proposed for biomedical applications,
such as the quantication of tissue volumes (1), diagnosis
(2), localization of pathology (3), study of anatomical struc-
tures (4), treatment planning (5), partial volume correction
of functional imaging data (6), and computer-integrated
surgery (7,8).
Technological advances in medical imaging modalities
have provided doctors withsignicant capabilities for accu-
rate noninvasive examination. In modern medical imaging
systems, the ability to effectively process and analyze
medical images to accurately extract, quantify, and inter-
pret information to achieve an understanding of the inter-
nal structures being imaged is critical in order to support a
spectrumof medical activities fromdiagnosis, to radiother-
apy, to surgery. Advances in computer technology and
microelectronics have made available signicant computa-
tional power in small desktop computers. This capability
has spurred the development of software-based biomedical
image analysis methods such as image enhancement and
restoration, segmentation of internal structures and fea-
tures of interest, image classication, and quantication.
Inthe next section, different medical imaging modalities
are discussed before the most common methods for medical
image processing and analysis are presented. An exhaus-
tive survey of such methods can be found in Refs. 9 and 10.
ACQUISITION OF MEDICAL IMAGES
Abiomedical image analysis systemcomprises three major
elements: an image acquisition system, a computer for
processing the acquired information, and a display system
for visualizing the processed images. In medical image
acquisition, the primary objective is to capture and record
information about the physical and possibly functional
properties of organs or tissues by using either external
or internal energy sources or a combination of these energy
sources.
Conventional Radiography
In conventional radiography, a beam of x rays from an
external source passing through the body is differentially
absorbed and scattered by structures. The amount of
absorption depends on the composition of these structures
and on the energy of the x ray beam. Conventional imaging
methods, which are still the most commonly used diagnos-
tic imaging procedure, forma projection image on standard
radiographic lm. With the advent of digital imaging tech-
nology, radiographs, which are x-ray projections, are
increasingly being viewed, stored, transported, and
manipulated digitally. Figure 1 shows a normal chest x-
ray image.
Computed Tomography
The realization that x-ray images taken at different angles
containsufcient informationfor uniquely determining the
internal structures, led to the development of x-ray CT
scanners in the 1970s that essentially reconstruct accurate
cross-sectional images from x-ray radiographs. The
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
conventional x-ray CT consists of a rotating frame that has
an x-ray source at one end and an array of detectors that
accurately measure the total attenuation along the path of
the x ray at the other end. Afan beamof x rays is created as
the rotating frame spins the x-ray tube and detectors
around the patient. During the 3608 rotation, the detector
captures numerous snapshots of the attenuatedx-ray beam
corresponding to a single slice of tissue whose thickness is
determined by the collimation of the x-ray beam. This
information is then processed by a computer to generate
a 2-D image of the slice. Multiple slices are obtained by
moving the patient inincremental steps. Inthe more recent
spiral CT(also known as the helical CT), projection acquisi-
tion is carried out in a spiral trajectory as the patient
continuously moves through the scanner. This process
results in faster scans and higher denition of internal
structures, which enables greater visualization of blood
vessels and internal tissues. CT images of a normal liver
and brain are shown in Fig. 2. In comparison with conven-
tional x-ray imaging, CT imaging is a major breakthrough.
It can image the structures with subtle differences in x-ray
absorption capacity even when almost obscured by a struc-
ture with a strong ability on x-radiation absorption. For
example, the CT can image the internal structures of the
brain, which is enclosed by the skull [as shown in Fig. 2(b)],
whereas the x ray fails to do so.
Magnetic Resonance Imaging
Medical imaging methods using magnetic resonance
include MRI, PET, and SPECT. MRI is based on the prin-
ciples of nuclear magnetic resonance (NMR), which is a
spectroscopic technique used to obtain microscopic chemi-
cal and physical information about molecules. An MRI can
produce high-quality multidimensional images of the
inside of the human body, as shown in Fig. 3, providing
both structural and physiologic information of internal
organs and tissues. Unlike the CT, which depicts the x-
ray opacity of the structure being imaged, MRIs depict the
density as well as biochemical properties based on physio-
logic function, including blood ow and oxygenation. A
major advantage of the MRI is the fast signal acquisition
with a very high spatial resolution.
Radiographic imaging modalities such as those based on
x rays provide anatomical information about the body but
not the functional or metabolic information about an organ
or tissue. In addition to anatomical information, MRI
methods are capable of providing some functional and
metabolic information. Nuclear medicine-based imaging
systems image the distribution of radioisotopes distributed
within specic organs of interest by injection or inhalation
of radio-pharmaceuticals that metabolize the tissue, which
makes them a source of radiation. The images acquired by
these systems provide a direct representation of the meta-
bolism or function of the tissue or organ being imaged as it
becomes a source of radiation that is used in the imaging
Figure 1. X-ray image of a chest.
Figure 2. CT images of normal liver and brain.
Figure 3. MRI of the brain.
2 MEDICAL IMAGE PROCESSING
process. SPECT and PET are nuclear medicine-based ima-
ging systems.
SPECT systems use gamma cameras to image photons
that are emitted during decay. Like the x-ray CT, many
SPECT systems rotate a gamma camera around the object
being imaged and process the acquired projections using a
tomographic reconstruction algorithmto yield a 3-Drecon-
struction. SPECT systems do not provide good resolution
images of anatomical structures like CT or MRimages, but
they show distribution of radioactivity in the tissue, which
represents a specic metabolism or blood ow as shown in
Fig. 4(a). PET systems, like SPECT, also produce images of
the body by detecting the emitted radiation, but the radio-
active substances used in PET scans are increasingly used
with CT or MRI scans so as to provide both anatomical and
metabolic information. Some slices of a PET brain image
are shown in Fig. 4(b).
Ultrasound Imaging
Ultrasoundor acoustic imaging is anexternal source-based
imaging method. Ultrasound imaging produces images of
organs and tissues by using the absorption and reection of
ultrasoundwaves traveling throughthe body(Fig. 5). It has
been successfully used for imaging of anatomical struc-
tures, blood ow measurements, and tissue characteriza-
tion. A major advantage of this method, which does not
involve electromagnetic radiation, is that it is almost non-
intrusive, and hence, the examined structures can be sub-
jected to uninterrupted and long-term observations while
the subject does not suffer any ill effects.
IMAGE ENHANCEMENT AND RESTORATION
In image enhancement, the purpose is to process an
acquired image to improve the contrast and visibility of
the features of interest. The contrast and visibility of
images actually depend on the imaging modality and on
the nature of the anatomical regions. Therefore, the type of
image enhancement to be applied has to be suitably chosen.
Image restoration also leads to image enhancement, and
generally, it involves mean-squared error operations and
other methods that are based on an understanding of the
type of degradation inherent in the image. However, pro-
cedures that reduce noise tend to reduce the details,
whereas those that enhance details also increase the noise.
Image enhancement methods can be broadly classied into
two categories: spatial-domain methods(11) and frequency-
domain methods(12).
Spatial-domain methods involve manipulation on a
pixel-to-pixel basis and include histogram-based methods,
spatial ltering, and so on. The histogram provides infor-
mation about the distribution of pixel intensities in an
image and is normally expressed as a 2-D graph that
provides informationof the occurrence of specic gray-level
pixel values in the image. Histogram equalization is a
commonly used technique that involves spreading out
or stretching the gray levels to ensure that the gray levels
are redistributed as evenly as possible (12,13). Minor var-
iations in structures or features are better visualized
withinregions that originally looked uniformby this opera-
tion. Figure 6 shows the result of applying histogram
equalization to a CT image.
Figure 4. SPECT and PET
sequences of a normal brain.
Figure 5. Ultrasound image of a normal liver.
MEDICAL IMAGE PROCESSING 3
In some instances, however, global equalization across
all possible pixel values could result in loss of important
details and/or high-frequency information. For such situa-
tions, local histogram modications, including localized or
regional histogram equalization, can be applied to obtain
good results (14). In spatial-domain methods, pixel values
inanimage are replacedwithsome functionof the pixel and
its neighbors. Image averaging is a form of spatial ltering
where each pixel value is replaced by the mean of its
neighboring pixels and the pixel itself. Edges in an image
can be enhanced easily by subtracting the pixel value with
the mean of its neighborhood. However, this approach
usually increases the noise in the image. Figure 7 shows
the results of applying local spatial-domain methods.
Frequency-domain methods are usually faster and sim-
pler to implement than spatial domain methods (12). The
processing is carried out in the Fourier domain for remov-
ing or reducing image noise, enhancing edge details, and
improving contrast. A low-pass lter can be used to sup-
press noise by removing high frequencies in the image,
whereas a high-pass lter can be used to enhance the high
frequencies resulting in an increase of detail and noise.
Different lters can be designed to selectively enhance or
suppress features or details of interest.
MEDICAL IMAGE SEGMENTATION
A fundamental operation in medical image analysis is the
segmentation of anatomical structures. It is not surprising
that segmentationof medical images has beenanimportant
research topic for a long time. It essentially involves parti-
tioning an image into distinct regions by grouping together
neighboring pixels that are related. The extraction or seg-
mentation of structures from medical images and recon-
structing a compact geometric representation of these
structures is fairly complex because of the complexity
and variability of the anatomical shapes of interest. Inher-
ent shortcomings of the acquired medical images, such as
sampling artifacts, spatial aliasing, partial volume effects,
noise, and motion, may cause the boundaries of structures
to be not clearly distinct. Furthermore, each imaging mod-
ality withits owncharacteristics could produce images that
are quite different when imaging the same structures.
Thus, it is challenging to accurately extract the boundaries
of the same anatomical structures. Traditional image pro-
cessing methods are not easily applied for analyzing med-
ical images unless supplemented with considerable
amounts of expert intervention (15).
There has been a signicant body of work on algorithms
for the segmentation of anatomical structures and other
Figure 6. CT images of the liver and the associated histo-
grams: (a) original image and (b) enhanced image.
Figure 7. Ultrasound images of
liver. (a) Original image. (b) The
result of image averaging where
the image becomes smoother. (c)
The image of edge enhancement
where the image becomes sharper
but with increased noise levels.
4 MEDICAL IMAGE PROCESSING
regions of interest that aims to assist and automate specic
radiologic tasks (16). They vary depending on the specic
application, imaging modality, and other factors. Cur-
rently, no segmentation method yields acceptable results
for different types of medical images. Although general
methods can be applied to a variety of data (10,15,17),
they are specic for particular applications and can often
achieve better performance by taking into account the
specic nature of the image modalities. In the following
subsections, the most commonly used segmentation meth-
ods are briey introduced.
Thresholding
Thresholding is a very simple approach for segmentation
that attempts to partition images by grouping pixels that
have similar intensities or range of intensities into one class
and the remaining pixels into another class. It is often
effective for segmenting images with structures that
have contrasting intensities (18,19). A simple approach
for thresholding involves analyzing the histogram and
setting the threshold value to a point between two major
peaks in the distribution of the histogram. Although there
are automated methods for segmentation, thresholding is
usually performed interactively based on visual assess-
ment of the resulting segmentation (20). Figure 8 shows
the result of a thresholding operation on an MR image of
brainwhere only pixels withina certainrange of intensities
are displayed.
The main limitations of thresholding are that only two
classes are generated, and that it typically does not take
into account the spatial characteristics of an image, which
causes it to be sensitive to noise and intensity inhomogene-
ities which tend to occur in many medical imaging mod-
alities. Nevertheless, thresholding is often used as an
initial step in a sequence of image-processing operations.
Region-Based Segmentation
Region growing algorithms (21) have proven to be an
effective approach for image segmentation. The basic
approach in these algorithms is to start from a seed point
or region that is considered to be inside the object to be
segmented. Neighboring pixels with similar properties are
evaluated to determine whether they should also be con-
sidered as being part of the object, and those pixels that
should be are added to the region. The process continues as
long as new pixels are added to the region. Region growing
algorithms differ inthat they use different criteria to decide
whether a pixel should be included in the region, the
strategy used to select neighboring pixels to evaluate,
and the stopping criteria that stop the growing. Figure 9
illustrates several examples of region growing-based seg-
mentation. Like thresholding-based segmentation, region
growing is seldom used alone but usually as part of set of
image-processing operations, particularly for segmenting
small and simple structures such as tumors and lesions
(23,24). Disadvantages of region growing methods include
the need for manual intervention to specify the initial seed
point and its sensitivity to noise, which causes extracted
regions to have holes or to even become disconnected.
Regionsplittingandmergingalgorithms (25,26) evaluate
the homogeneity of a region based on different criteria such
as the mean, variance, and so on. If a region of interest is
found to be inhomogeneous according to some similarity
constraint, it is split into two or more regions. Since some
neighboringregions mayhaveidentical or similar properties
after splitting, a merging operation is incorporated that
comparesneighboringregionsandmergesthemif necessary.
Segmentation Through Clustering
Segmentationof images canbe achievedthroughclustering
pixel data values or feature vectors whose elements consist
of the parameters to be segmented. Examples of multi-
dimensional feature vectors include the red, green, and
blue (RGB) components of each image pixel and different
attenuation values for the same pixel in dual-energy x-ray
images. Such datasets are very useful as each dimension of
data allows for different distinctions to be made about each
pixel in the image.
In clustering, the objective is to group similar feature
vectors that are close together in the feature space into a
single cluster, whereas others are placed in different clus-
ters. Clustering is thus a form of classication. Sonka and
Figure 8. Binary thresholding of MR image of the brain.
MEDICAL IMAGE PROCESSING 5
Fitzpatrick (27) provide a thorough review of classication
methods, many of which have been applied in object recog-
nition, registration, segmentation, and feature extraction.
Classication algorithms are usually categorized as unsu-
pervised and supervised. In supervised methods, sample
feature vectors exist for each class (i.e., a priori knowledge)
andthe classier merelydecides onhowtoclassifynewdata
based on these samples. In unsupervised methods, there is
no a priori knowledge and the algorithms, which are based
on clustering analysis, examine the data to determine
natural groupings or classes.
Unsupervised Clustering. Unlike supervised classica-
tion, very few inputs are needed for unsupervised classi-
cation as the data are clustered into groupings without any
user-dened training. In most approaches, an initial set of
grouping or classes is dened. However, the initial set could
be inaccurate and possibly split along two or more actual
classes. Thus, additional processing is required to correctly
label these classes.
The k-means clustering algorithm(28,29) partitions the
data into k clusters by optimizing an objective function of
feature vectors of clusters in terms of similarity and dis-
tance measures. The objective function used is usually the
sum of squared error based on the Euclidean distance
measure. In general, an initial set of k clusters at arbitrary
centroids is rst created by the k-means algorithm. The
centroids are then modied using the objective function
resulting in new clusters. The k-means clustering algo-
rithms have been applied in medical imaging for segmenta-
tion/classication problems (30,31). However, their
performance is limited when compared with that achieved
using more advanced methods.
While the k-means clustering algorithm uses xed
values that relate a data point to a cluster, the fuzzy k-
means clustering algorithm(32) (also known as the fuzzy c-
means) uses a membership value that canbe updated based
on distribution of the data. Essentially, the fuzzy k-means
method enables any data sample to belong to any cluster
withdifferent degrees of membership. Fuzzypartitioning is
carried out through an iterative optimization of the objec-
tive function, which is also a sumof squared error based on
the Euclidean distance measure factored by the degree of
membership with a cluster. Fuzzy k-means algorithms
have been applied successfully in medical image analysis
(3336), most commonly for segmentation of MRI images of
the brain. Pham and Prince (33) were the rst to use
adaptive fuzzy k-means in medical imaging. In Ref. 34,
Boudraa et al. segment multiple sclerosis lesions, whereas
the algorithmof Ahmed et al. (35) uses a modied objective
function of the standard fuzzy k-means algorithm to com-
pensate for inhomogeneities. Current fuzzy k-means meth-
ods use adaptive schemes that iteratively vary the number
of clusters as the data are processed.
Supervised Clustering. Supervised methods use sample
feature vectors (known as training data) whose classes are
known. New feature vectors are classied into one of the
known classes on the basis of how similar they are to the
known sample vectors. Supervised methods assume that
classes inmultidimensional feature spaces canbedescribed
by multivariate probability density functions. The prob-
ability or likelihood of a data point belonging to a class is
related to the distance from the class center in the feature
space. Bayesian classiers adopt such probabilistic
approaches and have been applied to medical images
usually as part of more elaborate approaches (37,38).
The accuracy of such methods depends very much on hav-
ing good estimates for the mean (center) and covariance
matrix of each class, which in turn requires large training
datasets.
When the training data are limited, it would be better to
use minimum distance or nearest neighbor classiers that
merely assign unknown data to the class of the sample
vector that is closest in the feature space, which is mea-
sured usually in terms of the Euclidean distance. In the k-
nearest neighbor method, the class of the unknown data is
the class of majority of the k-nearest neighbors of the
unknown data. Another similar approach is called the
Parzen windows classier, which labels the class of the
unknowndata as that of the class of the majority of samples
ina volume centered about the unknowndata that have the
same class. The nearest neighbor and Parzen windows
methods may seem easier to implement because they do
not require a priori knowledge, but their performance is
strongly dependent on the number of data samples avail-
able. These supervised classication approaches have been
used in various medical imaging applications (39,40).
Figure 9. Segmentation results of region growing with various seed points obtained by using Insight Toolkit (22).
6 MEDICAL IMAGE PROCESSING
Model Fitting Approaches
Model tting is a segmentation method where attempts are
made to t simple geometric shapes to the locations of
extracted features in an image (41). The techniques and
models used are usually specic to the structures that need
to be segmented to ensure good results. Prior knowledge
about the anatomical structure to be segmentedenables the
construction of shape models. In Ref. 42, active shape
models are constructed froma set of training images. These
models can be tted to an image by adjusting some para-
meters and can also be supplemented with textural infor-
mation (43). Active shape models have been widely used for
medical image segmentation (4446).
Deformable Models
Segmentation techniques that combine deformable models
with local edge extraction have achieved considerable suc-
cess in medical image segmentation (10,15). Deformable
models are capable of accommodating the often signicant
variability of biological structures. Furthermore, different
regularizers can be easily incorporated into deformable
models to get better segmentation results for specic types
of images. In comparison with other segmentation meth-
ods, deformable models can be considered as high-level
segmentation methods (15).
Deformable models (15) are referred by different names
in the literature. In 2-D segmentation, deformable models
are usually referred to as snakes (47,48), active contours
(49,50), balloons (51), and deformable contours (52). They
are usually referred to as active surfaces (53) and deform-
able surfaces (54,55) in 3-D segmentation.
Deformable models were rst introduced into computer
vision by Kass et al. (47) as snakes or active contours, and
they are now well known as parametric deformable models
because of their explicit representation as parameterized
contours ina Lagrangian framework. By designing a global
shape model, boundary gaps are easily bridged, and overall
consistency is more likely to be achieved. Parametric
deformable models are commonly used when some prior
information of the geometrical shape is available, which
can be encoded using, preferably, a small number of para-
meters. They have been used extensively, but their main
drawback is the inability to adapt to topology (15,48).
Geometric deformable models are represented implicitly
as a level set of higher dimensional, scalar-level set func-
tions, and they evolve in an Eulerian fashion (56,57). Geo-
metric deformable models were introduced more recently
by Caselles et al. (58) and by Malladi et al. (59). A major
advantage of these models over parametric deformable
models is topological exibility because of their implicit
representation. During the past decade, tremendous effo-
rts have beenmade onvarious medical image segmentation
applications based on level set methods (60). Many new
algorithms have beenreportedto improve the precisionand
robustness of level set methods. For example, Chan and
Vese (61) proposed an active contour model that can detect
objects whose boundaries are not necessarily dened by
gray-level gradients.
When applying for segmentation, an initialization of the
deformable model is needed. It can be manually selected or
generated by using other low-level methods such as thresh-
oldingorregiongrowing. Anenergyfunctional isdesignedso
that the model lies on the object boundary when it is mini-
mized. YanandKassim(62,63) proposedthe capillary active
contour for magnetic resonance angiography (MRA) image
segmentation. Inspired by capillary action, a novel energy
functional is formulated, which is minimized when the
activecontour snaps to the boundaryof bloodvessels. Figure
10 shows the segmentation results of MRA using a special
geometric deformable model, the capillary active contour.
MEDICAL IMAGE REGISTRATION
Multiple images of the same subject, acquired possibly
using different medical imaging modalities, contain useful
information usually of complementary nature. Proper inte-
gration of the data and tools for visualizing the combined
information offers potential benets to physicians. For this
integration to be achieved, there is a need to spatially align
the separately acquired images, and this process is called
image registration. Registration involves determining a
transformation that can relate the position of features in
one image withthe positionof the corresponding features in
another image. To determine the transformation, which is
also knownas spatial mapping, registrationalgorithms use
geometrical features suchas points, lines, andsurfaces that
correspond to the same physical entity visible in both
images. After accurate registration, different images will
have the same coordinate system so that each set of points
in one image will occupy the same volume as the corre-
sponding set of points in another image. In addition to
combining images of the same subject from different mod-
alities, the other applications of image registration include
aligning temporal image sequences to compensate for
motion of the subject between scans and image guidance
during medical procedures.
Figure 10. Different views of the
MRA segmentation results using
the capillary active contour.
MEDICAL IMAGE PROCESSING 7
The evaluation of the transformation parameters can be
computationally intensive but can be simplied by assum-
ing that the structures of interest do not deform or distort
between image acquisitions. However, many organs do
deform during image acquisition, for example, during the
cardiac and respiratory cycles. Many medical image regis-
tration algorithms calculate rigid body or afne transfor-
mations, and thus, their applicability is restricted to parts
of the body where deformation is small. As bones are rigid,
rigid body registration is widely used where the structures
of interest are either bone or are enclosed in bone. The
brain, which is enclosed by the skull, is reasonably non-
deformable, and several registration approaches have
been applied to the rigid body registration of brain images.
Figure 11 shows an example of rigid registration of brain
MR images. Image registration based on rigid body trans-
formations is widely used for aligning multiple 3-D tomo-
graphic images of the same subject acquired using different
modalities (intermodality registration) or using the same
modality (intramodality registration) (6466).
The problem of aligning images of structures with
deformedshapes is anactive area of research. Suchdeform-
able (nonafne) registration algorithms usually involve the
use of an initial rigid body or afne transformation to
provide a starting estimate (6568).
After registration, fusion is required for the integrated
display. Some fusion methods involve direct combination of
multiple registered images. One example is the superim-
position of PET data on MRI data to provide a single image
containing bothstructure andfunction(69). Another exam-
ple involves the use of MR and CT combined to delineate
tumor tissues (70).
Atlas-Based Approaches
In atlas-based registration and segmentation of medical
images, prior anatomical and/or functional knowledge is
exploited. The atlas is actually a reference image in which
objects of interest have been carefully segmented. In this
method (65,66,71), the objective is essentially to carry out a
nonrigid registration between the image of the patient and
the atlas. The rst step, known as atlas warping, involves
nding a transformation that maps the atlas image to the
target image to be segmented. The warpingusually consists
of a combination of linear and nonlinear transformations to
ensure good registration despite anatomical variability.
Even with nonlinear registration methods, accurate
atlas-based segmentation of complex structures is difcult
because of anatomical variability, but these approaches are
generally suited for segmentation of structures that are
stable in large numbers of people. Probabilistic atlases
7275 help to overcome anatomical variability.
CONCLUDING REMARKS
Signicant advances made in medical imaging modalities
have led to several new methodologies that provide signi-
cant capabilities for noninvasive and accurate examination
of anatomical, physiological, metabolic, and functional
structures and features. three- and four dimensional med-
ical images contain a signicant amount of information
about the structures being imaged. Sophisticated software-
based image processing and analysis methods enhance the
information acquired by medical imaging equipment to
improve the visibility of features of interest and thereby
enable visual examination, diagnosis, and analysis. There
are, however, several challenges in medical imaging ran-
ging fromaccurate analysis of cardiac motionandtumors to
nonafne registration applications involving organs other
than the brain and so on (76,77). Also, new imaging mod-
alities, such as optical (78), microwave, and electrical
impedance (79) imaging methods holdthe promise of break-
throughs when new associated algorithms for processing
and analyzing the resulting information acquired via these
modalities are available.
ACKNOWLEDGMENTS
The authors would like to thankthe Department of Nuclear
Medicine at the Singapore General Hospital for providing
the PET image and Prof. S. C. Wang of the Department of
Diagnostic Imaging at the National University Hospital for
providing the images used in this chapter. The authors are
also grateful to Dr. P. K. Sadasivan for his valuable inputs.
Figure 11. (a) Result of direct overlapping of brain
MR images without registration. (b) Result of over-
lapping of images after rigid registration.
8 MEDICAL IMAGE PROCESSING
BIBLIOGRAPHY
1. S. M. Lawrie and S. S. Abukmeil, Brain abnormality in schizo-
phrenia: A systematic and quantitative review of volumetric
magnetic resonance imaging studies, Br. J. Psychiat., 172:
110120, 1998.
2. P. Taylor, Computer aidsfor decision-makingindiagnosticradi-
ologyaliteraturereview, Br. J. Radiol. 68(813): 945957, 1995.
3. A. P. Zijdenbos and B. M. Dawant, Brain segmentation and
white matter lesiondetectioninMRimages, Crit. Rev. Biomed.
Engineer., 22 (6): 401465, 1994.
4. A. Worth, N. Makris, V. Caviness, and D. Kennedy, Neuroa-
natomical segmentation in MRI: technological objectives,
Internat. J. Patt. Recog. Articial Intell., 11: 11611187, 1997.
5. V. Khoo, D. P. Dearnaley, D. J. Finnigan, A. Padhani, S. F.
Tanner, and M. O. Leach, Magnetic resonance imaging (MRI):
Considerations and applications in radiotherapy treatment
planning, Radiother. Oncol., 42: 115, 1997.
6. H. Muller-Gartner, J. Links, J. Prince, R. Bryan, E. McVeigh, J.
P. Leal, C. Davatzikos, and J. Frost, Measurement of tracer
concentration in brain gray matter using positron emission
tomography: MRI-based correction for partial volume effects,
J. Cerebral Blood Flow Metabol., 12 (4): 571583, 1992.
7. N. Ayache, P. Cinquin, I. Cohen, L. Cohen, F. Leitner, and O.
Monga, Segmentation of complex three-dimensional medical
objects: A challenge and a requirement for computer-assisted
surgeryplanningandperformance, inR. Taylor, S. Lavallee, G.
Burdea, and R. Mosges, (eds), Computer-Integrated Surgery,
Cambridge MA: The MIT Press, 1996, pp. 5974.
8. W. E. L. Grimson, G. J. Ettinger, T. Kapur, M. E. Leventon, W.
M. Wells, and R. Kikinis, Utilizing segmented MRI data in
image-guided surgery, Internat. J. Patt. Recog. Articial
Intell., 11 (8): 13671397, 1997.
9. D. L. Pham, C. Xu, and J. L. Prince, Current methods in
medical image segmentation, Annu. Rev. Biomed. Eng., 2:
315337, 2000.
10. J. S. Duncan and N. Ayache, Medical image analysis: Progress
over two decades and the challenges ahead, IEEE Trans.
Pattern Anal. and Machine Intell., 22 (1): 85106, 2000.
11. P. Perona and J. Malik, Scale-space and edge detection using
anisotropic diffusion, IEEETrans. Pattern Anal. and Machine
Intell., 12: 629639, 1990.
12. R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd
ed. Upper Saddle River: Prentice Hall, 2002.
13. T. Acharya and A. K. Ray, Image Processing: Principles and
Applications, New York: Wiley-Interscience, 2005.
14. K. Zuiderveld, Contrast limited adaptive histogram equaliza-
tion. P. Heckbert in (ed.), Graphic Gems IV, New York: Aca-
demic Press, 1994.
15. T. McInerney and D. Terzopoulos, Deformable models in med-
ical image analysis: Asurvey, Med. Image Anal., 1 (2): 91108,
1996.
16. A. Dhawan, Medical Image Analysis. New York: Wiley, 2003.
17. J. Suri, K. Liu, S. Singh, S. Laxminarayana, and L. Reden,
Shape recovery algorithms using level sets in 2-D/3-D medical
imagery: A state-of-the-art review, IEEE Trans. Inform. Tech-
nol. Biomed., 6: 828, 2002.
18. P. Sahoo, S. Soltani, and A. Wong, A survey of thresholding
techniques, Comput. Vision. Graph. Image Process., 42 (2):
233260, 1988.
19. M. Sezgin and B. Sankur, Survey over image thresholding
techniques and quantitative performance evaluation, J. Elec-
tron. Imaging, 13 (1): 146168, 2004.
20. N. Otsu, A threshold selection method from gray level histo-
grams, IEEE Trans. Sys., Man Cybernet., 9: pp. 6266, 1979.
21. R. Adams and L. Bischof, Seeded region growing, IEEETrans.
Pattern Anal. Mach. Intell., 16 (6): 641647, 1994.
22. L. Ibanez, W. Schroeder, L. Ng, and J. Cates, The ITKSoftware
Guide. Kitware Inc., 2003. Available: http://www.itk.org.
23. P. Gibbs, D. Buckley, S. Blackband, and A. Horsman, Tumour
volume determination from MR images by morphological seg-
mentation, Phys. Med. Biol., 41: 24372446, 1996.
24. S. Pohlman, K. Powell, N. Obuchowski, W. Chilcote, and S.
Broniatowski, Quantitative classication of breast tumors in
digitized mammograms, Med. Phys., 23: 13371345, 1996.
25. R. Ohlander, K. Price, and D. Reddy, Picture segmentation
using recursive region splitting method, Comput. Graph.
Image Proc., 8: 313333, 1978.
26. S.-Y. Chen, W.-C. Lin, and C.-T. Chen, Split-and-merge image
segmentation based on localized feature analysis and statis-
tical tests, CVGIP: Graph. Models Image Process., 53 (5): 457
475, 1991.
27. M. Sonka and J. M. Fitzpatrick, (eds.), Handbook of Medical
Imaging, vol. 2, ser. Medical Image Processing and Analysis.
Bellingham, WA: SPIE Press, 2000.
28. A. K. Jain and R. C. Dubes, Algorithms for Clustering Data.
Upper Saddle River, NJ: Prentice-Hall, Inc., 1988.
29. S. Z. SelimandM. Ismail, K-means-type algorithms: Ageneral-
ized convergence theorem and characterization of local optim-
ality, IEEETrans. PatternAnal. andMachine Intell., 6: 8187,
1984.
30. M. Singh, P. Patel, D. Khosla, and T. Kim, Segmentation of
functional MRI by k-means clustering, IEEETrans. Nucl. Sci.,
43: 20302036, 1996.
31. A. P. Dhawan and L. Arata, Knowledge-based 3-D analysis
from 2-D medical images, IEEE Eng. Med. Biol. Mag., 10:
3037, 1991.
32. L. C. Bezdek, Pattern Recognition with Fuzzy Objective Func-
tion Algorithms. New York: Plenum Press, 1981.
33. D. L. Pham and J. L. Prince, Adaptive fuzzy segmentation of
magnetic resonance images, IEEE Trans. Med. Imag., 18:
193199, 1999.
34. A. O. Boudraa, S. M. Dehak, Y. M. Zhu, C. Pachai, Y. G. Bao,
and J. Grimaud, Automated segmentation of multiple sclerosis
lesions in multispectral MR imaging using fuzzy clustering,
Comput. Biol. Med., 30: 2340, 2000.
35. M. N. Ahmed, S. M. Yamany, N. Mohamed, A. A. Farag, and T.
Morianty, A modied fuzzy c-means algorithm for bias eld
estimation and segmentation of MRI data, IEEE Trans. Med-
ical Imaging, 21: 193199, 2002.
36. C. Zhu and T. Jiang, Multicontext fuzzy clustering for separa-
tion of brain tissues in magnetic resonance images, Neurom-
image, 18: 685696, 2003.
37. P. Spyridonos, P. Ravazoula, D. Cavouras, K. Berberidis, and
G. Nikiforidis, Computer-based grading of haematoxylin-eosin
stained tissue sections of urinary bladder carcinomas, Med.
Inform. Internet Med., 26: 179190, 2001.
38. F. Chabat, G.-Z. Yang, and D. M. Hansell, Obstructive lung
diseases: texture classication for differentiation at CT, Radi-
ology, 228: 871877, 2003.
39. K. Jafari-Khouzani and H. Soltanian-Zadeh, Multiwavelet
grading of pathological images of prostate, IEEE Trans.
Biomed. Eng., 50: 697704, 2003.
40. C. I. Christodoulou, C. S. Pattichis, M. Pantziaris, and A.
Nicolaides, Texture-based classication of atherosclerotic aro-
tid plaques, IEEE Trans. Med. Imag., 22: 902912, 2003.
MEDICAL IMAGE PROCESSING 9
41. S. D. Pathak, P. D. Grimm, V. Chalana, and Y. Kim, Pubic arch
detection in transrectal ultrasound guided prostate cancer
therapy, IEEE Trans. Med. Imag., 17: 762771, 1998.
42. T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, Active
shape models their training and application, Comput. Vision
Image Understand., 61 (1): 3859, 1995.
43. T. F. Cootes, C. Beeston, G. J. Edwards, and C. J. Taylor, A
unied framework for atlas matching using active appearance
models, Proc. Int. Conf. on Image Processing in Medical Ima-
ging., 1999, pp. 322333.
44. T. F. Cootes and C. J. Taylor, Statistical models of appearance
for medical image analysis and computer vision, SPIE Med.
Imag., San Diego, CA: 2001, 236248.
45. J. Xie, Y. Jiang, and H. T. Tsui, Segmentation of kidney from
ultrasound images based on texture and shape priors, IEEE
Trans. Med. Imag., 24 (1): 4557, 2005.
46. P. Yan and A. A. Kassim, Medical image segmentation using
minimal path deformable models with implicit shape priors,
IEEE Trans. Inform. Technol. Biomed., 10 (4): 677684, 2006.
47. M. Kass, A. Witkin, andD. Terzopoulos, Snakes: Activecontour
models, Int. J. Comp. Vision, 1 (4): 321331, 1987.
48. C. Xu and J. L. Prince, Snakes, shapes, and gradient vector
ow, IEEE Trans. Image Process., 7: 359369, 1998.
49. V. Caselles, R. Kimmel, and G. Sapiro, Geodesic active con-
tours, Int. J. Comp. Vision, 22 (1): 6179, 1997.
50. S. Kichenassamy, A. Kumar, P. J. Olver, A. Tannenbaum, and
A. J. Yezzi, Gradient ows and geometric active contour mod-
els, inIEEEInt. Conf. Computer Vision, Cambridge, MA: 1995,
pp. 810815.
51. L. D. Cohen, On active contour models and balloons, CVGIP:
Image Understand., 53 (2): 211218, 1991.
52. L. H. Staib and J. S. Duncan, Parametrically deformable con-
tour models, Proc. IEEE Conf. Computer Vision and Pattern
Recognition., San Diego, CA, 1989, pp. 98103.
53. J. W. Snell, M. B. Merickel, J. M. Ortega, J. C. Goble, J. R.
Brookeman, and N. F. Kassell, Model-based boundary estima-
tion of complex objects using hierarchical active surface tem-
plates, Patt. Recog., 28 (10): 15991609, 1995.
54. I. Cohen, L. Cohen, and N. Ayache, Using deformable surfaces
to segment 3Dimages andinfer differential structures, CVGIP:
Image Understand., 56 (2): 242263, 1992.
55. L. H. Staib and J. S. Duncan, Model-based deformable surface
nding for medical images, IEEE Trans. Med. Imag., 15 (5):
720731, 1996.
56. S. Osher andJ. A. Sethian, Fronts propagating withcurvature-
dependent speed: Algorithms based on Hamilton-Jacobi for-
mulations, J. Computational Physics, 79: 1249, 1988.
57. J. A. Sethian, Level Set Methods and Fast Marching Methods,
2nd ed. New York: Cambridge University Press, 1999.
58. V. Caselles, F. Catte, T. Coll, and F. Dibos, A geometric model
for active contours, Numerische Mathematik, 66: 131, 1993.
59. R. Malladi, J. A. Sethian, and B. C. Vermuri, Shape modeling
with front propagation: A level set approach, IEEE Trans.
Pattern Anal. Mach. Intell., 17 (2): 158174, 1995.
60. J. S. Suri, K. Liu, L. Reden, and S. Laxminarayan, Areviewon
MR vascular image processing: skeleton versus nonskeleton
approaches: part II, IEEE Trans. Inform. Technol. Biomed., 6
(4): 338350, 2002.
61. T. F. ChanandL. A. Vese, Active contours without edges, IEEE
Trans. Image Proc., 10 (2): 266277, 2001.
62. P. Yan and A. A. Kassim, MRA image segmentation with
capillary active contours, Proc. Medical Image Computing
and Computer-Assisted Intervention, Palm Springs, CA,
2005, pp. 5158.
63. P. Yan and A. A. Kassim, MRA image segmentation with
capillary active contours, Med. Image Anal., 10 (3): 317329,
2006.
64. L. G. Brown, A survey of image registration techniques, ACM
Comput. Surveys, 24 (4): 325376, 1992.
65. J. B. Antoine Maintz and M. A. Viergever, A survey of medical
image registration, Med. Image Anal., 2 (1): 136, 1998.
66. D. L. G. Hill, P. G. Batchelor, M. Holden, and D. J. Hawkes,
Medical image registration, Phys. Med. Biol., 46: 145, 2001.
67. W. M. Wells, W. E. L. Grimson, R. Kikinis, and F. A. Jolesz,
Adaptive segmentation of MRI data, IEEE Trans. Med. Imag.,
15 (4): 429442, 1996.
68. F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P.
Suetens, Multimodality image registration by maximization of
mutual information, IEEETrans. Med. Imag., 16 (2): 187198,
1997.
69. Y. Shimada, K. Uemura, B. A. Ardekani, T. Nagakota, K.
Ishiwata, H. Toyama, K. Ono, M. Senda, Application of PET-
MRI registration techniques to cat brain imaging, J. Neurosci.
Methods, 101: 17, 2000.
70. D. L. Hill, D. J. Hawkes, M. J. Gleason, T. C. Cox, A. J. Strang,
and W. L. Wong, Accurate frameless registrationof MRand CT
images of the head: Applications in planning surgery and
radiation theraphy, Radiology, 191: 447454, 1994.
71. B. Zitova and J. Flusser, Image registrationmethods: a survey,
Image Vis. Comput., 21: 9771000, 2003.
72. M. R. Kaus, S. K. Wareld, A. Nabavi, P. M. Black, F. A. Jolesz,
andR. Kikinis, Automatedsegmentationof MRimages of brain
tumors, Radiology, 218: 586591, 2001.
73. [Online]. Available: http://www.loni.ucla.edu/ICBM/.
74. B. Fischl, D. H. Salat, E. Busa, M. Albert, M. Dieterich, C.
Haselgrove, et al., Whole brain segmentation: Automated
labeling of neuroanatomical structures in the human brain,
Neuron, 33 (3): 341355.
75. M. B. Cuadra, L. Cammoun, T. Butz, O. Cuisenaire, and J.-P.
Thiran, Validation of tissue modelization and classication
techniques in t1-weighted MR brain images. IEEE Trans.
Med. Imag., 24: 15481565, 2005.
76. K. Wong, H. Liu, A. J. Sinusas, andP. Shi, Multiframe nonrigid
motion analysis with anisotropic spatial constraints: Applica-
tions to cardiac image analysis, Internat. Conf. Image Proc., 1:
131134, 2004.
77. A. Mohamed, D. Shen, and C. Davatzikos, Deformable regis-
tration of brain tumor images via a statistical model of tumor-
induced deformation, Proc. Medical Image Computing and
Computer-Assisted Intervention, 2, 263270, 2005.
78. H. Jiang, Y. Xu, N. Iftimia, L. Baron, and J. Eggert, Three-
dimensional optical tomographic imaging of breast in a human
subject, IEEE Trans. Med. Imag., 20 (12): 13341340, 2001.
79. J. Jossinet, E. Marry, and A. Montalibet, Inverse impedance
tomography: Imaging tissue from inside, IEEE Trans. Med.
Imag., 21 (6): 560565, 2002.
ASHRAF KASSIM
PINGKUN YAN
National University of Singapore
Singapore
10 MEDICAL IMAGE PROCESSING
ST
SCALE-SPACE
THE NEED FOR MULTI-SCALE REPRESENTATION
OF IMAGE DATA
Aninherent property of real-worldobjects is that only they
exist as meaningful entities over certain ranges of scale. A
simple example is the concept of a branch of a tree, which
makes sense only at a scale from, say, a fewcentimeters to
at most a few meters; it is meaningless to discuss the tree
concept at the nanometer or kilometer level. At those
scales, it is more relevant to talk about the molecules
that form the leaves of the tree or the forest in which
the tree grows. When observing such rea-world objects
with a camera or an eye, an addition scale problem exists
because of perspective effects. A nearby object will seem
larger in the image space than a distant object, although
the two objects may have the same size in the world. These
facts, that objects in the world appear in different ways
depending on the scale of observation and in addition may
undergo scale changes during an imaging process, have
important implications If one aims to describe them. It
shows that the notion of scale is fundamental to under-
stand both natural and articial perception.
In computer vision and image analysis, the notion of
scale is essential to design methods for deriving informa-
tion from images and multidimensional signals. To
extract any information from image data, obviously
one must interact with the data in some way, using
some operator or measurement probe. The type of infor-
mation that can be obtained is largely determined by the
relationship between the size of the actual structures in
the data and the size (resolution) of the operators
(probes). Some very fundamental problems in computer
vision and image processing concern what operators to
use, where to apply them, and howlarge they should be. If
these problems are not addressed appropriately, then the
task of interpreting the operator response can be very
hard. Notably, the scale information required to view the
image data at an appropriate scale may in many cases be
a priori unknown.
The idea behind a scale-space representation of image
data is that in the absence of any prior information about
what scales are appropriate for a given visual task, the
only reasonable approachis to represent the data at multi-
ple scales. Taken to the limit, a scale-space representation
furthermore considers representations at all scales simul-
taneously. Thus, given any input image, this image is
embedded into a one-parameter family of derived signals,
in which ne-scale structures are suppressed progres-
sively. When constructing such a multiscale representa-
tion, a crucial requirement is that the coarse-scale
representations should constitute simplications of cor-
responding structures at ner scalesthey should not be
accidental phenomena created by the smoothing method
intended to suppress ne scale structures. This idea has
been formalized in a variety of ways by different authors,
and a noteworthy coincidence is that similar conclusions
can be obtained from several different starting points. A
fundamental result of scale-space theory is that if general
conditions are imposed on the types of computations that
are to be performed in the earliest stages of visual proces-
sing, then convolution by the Gaussian kernel and its
derivatives provide a canonical class of image operators
with unique properties. The requirements (scale-space
axioms; see below) that specify the uniqueness essentially
are linearity and spatial shift invariance, combined with
different ways of formalizing the notion that new struc-
tures shouldnot be createdinthe transformationfromne
to coarse scales.
In summary, for any two-dimensional (2-D) signal
f : R
2
!R, its scale-space representation L : R
2
R
!R
is dened by (16)
Lx; y; t
Z
x;h 2R
2
f x x; y hgx; h; t dx dh (1)
where g : R
2
R
L:
jrLj
2
L
2
x
L
2
yx
r
2
L L
xx
L
yyl
det HL L
xx
L
yy
L
2
xyy
~ kL L
2
x
L
yy
L
2
y
L
xx
2L
x
L
y
L
xy
8
>
>
>
>
<
>
>
>
>
:
(7)
Atheoretically well-founded approach to feature detec-
tion is to use rotationally variant descriptors such as the
N-jet, directional lter banks, or rotationally invariant
differential invariants as primitives for expressing visual
modules. For example, with v denoting the gradient direc-
tion L
x
; L
y
T
a differential geometric formulation of edge
detectionat a given scale can be expressed from the image
points for which the second-order directional derivative in
the gradient direction L
vv
is zero and the third-order
directional derivative L
vvv
is negative:
L
vv
L
2
x
L
xx
2 L
x
L
y
L
xy
L
2
y
L
yy
0;
L
vvv
L
3
x
L
xxx
3 L
2
x
L
y
L
xxy
3 L
x
L
2
y
L
xyy
L
3
y
L
yyy
<0
8
<
:
(8)
A single-scale blob detector that responds to bright and
dark blobs can be expressed from the minima and the
maxima of the Laplacian response r
2
L. An afne covariant
blob detector that also responds to saddles canbe expressed
fromthe maxima and the minima of the determinant of the
Hessian det HL. A straight forward and afne covariant
corner detector can be expressed from the maxima and
minima of the rescaled level curve curvature k
L With p
denoting the main eigendirection of the Hessian matrix,
which is parallel to the vector
cos w; sin we
1
L
xx
L
yy
L
xx
L
yy
2
4L
2
xy
q
v
u
u
t ;
0
B
@ signL
xy
1
L
xx
L
yy
L
xx
L
yy
2
4L
2
xy
q
v
u
u
u
t
1
C
A 9
Figure 1. (top left) A gray-level image of size 560420 pixels, (top right)(bottom right) Scale-space representations computed at scale
levels t j, 8, and 64 (in pixel units).
2 SCALE-SPACE
a differential geometric ridge detectorat a xed scale canbe
expressed from the zero-crossings of the rst derivative L
p
in this direction for which the second-order directional
derivative L
pp
is negative and in addition jL
pp
j jL
qq
j.
Similarly, valleys can be extracted from the zero-crossings
of L
q
that satisfy L
qq
0 and jL
qq
j jL
pp
j.
FEATURE-CLASSIFICATIONANDIMAGE MATCHINGFROM
THE MULTISCALE N-IET
By combining N-jet representations at multiple scales,
usually with the scale levels distributed by ratios of two
when the scale parameter is measured in units of the
standard deviation s
t
p
of the Gaussian, we obtain a
multiscale N-jet vector. This descriptor is useful for a vari-
ety of different tasks. For example, the task of texture
classication can be expressed as a classication and/or
clustering problem on the multi-scale N-jet over regions in
the image (9). Methods for stereo matching can be formu-
lated in terms of comparisons of local N-jets (10), either in
terms of explicit search or coarse-to-ne schemes based on
differential corrections within the support region of the
multi-scale N-jet. Moreover, straightforward (rotationally
dependent) methods for image-based object recognition can
expressed in terms of vectors or (global or regional) histo-
grams of multi-scale N-jets (1114) Methods for rotation-
ally invariant image-based recognition can formulated in
terms of histograms of multiscale vectors of differential
invariants.
WINDOWED IMAGE DESCRIPTORS WITH TWO SCALE
PARAMETERS
The image descriptors considered so far are all dependent
on a single scale parameter t. For certain problems, it is
useful to introduce image descriptors that depend on two
scale parameters. One such descriptor is the second-
moment matrix (structure tensor) denned as
mx; y; t; s
Z
x;h 2R
2
L
2
x
x; h; t L
x
x; h; tL
y
x; h; t
L
x
x; h; tL
y
x; h; t L
2
y
x; h; t
gx x; y h; sdx dh 10
where t is a local scale parameter that describes the scale of
differentiation, ands is anintegrationscale parameter that
describes the extent over which local statistics of deriva-
tives is accumulated. In principle, the formulation of this
descriptor implies a two-parameter variation. In many
practical applications, however, it is common practice to
couple the two scale parameters by a constant factor Csuch
that s Ct where C > 1.
One common application of this descriptor is for a multi-
scale version of the Harris corner detector(15), detecting
positive spatial maxima of the entity
H detm ktrace
2
m (11)
where k0:04 is a constant. Another common application
is for afne normalization/shape adaptation and will be
developed below. The eigenvalues l
1;2
m
11
m
22
m
11
m
22
2
4m
2
12
q
and the orientation argm
11
Figure 3. Edge detection: (left) A gray-level image of size 180 180 pixels, (middle) The negative value of the gradient magnitude jrLj
computed at t 1. (right) Differential geometric edges at t 1 with a complementary low threshold
t
p
jrLj 1 on the gradient magnitude.
Figure 2. The Gaussiankernel andits derivatives upto order two
in the 2-D case.
SCALE-SPACE 3
m
22
; 2m
12
of m are also useful for texture segmentation and
for texture classication.
Other commonly used image descriptors that depend an
additional integration scale parameter include regional
histograms obtained using Gaussian window functions
as weights (16).
SCALE-SPACE REPRESENTATION OF COLOR IMAGES
The input images f considered so far have all been assumed
to be scalar gray-level images. For a visionsystem, however,
color images are oftenavailable, andthe useof color cues can
increase the robustness and discriminatory power of image
operators. For the purpose of scale-space representation of
color images, aninitial redgreenblue-yellowcolor opponent
transformation is often advantageous (17). Although the
combination of a scale-space representation of the RGB
channels independently does not necessarily constitute
the best way to dene a color scale-space, by performing
the following pretransformation prior to scale-space
smoothing
I R GB=3
U R G
V B R G=2
8
<
:
(12)
Gaussianscale-space smoothing, Gaussianderivatives, and
differential invariants can then be dened from the IUV
color channels separately. The luminance channel I will
then reect mainly, the interaction between reectance,
illuminance direction, andintensity, whereas the chromatic
channels, UandV, will largelymaketheinteractionbetween
illumination color and surface pigmentation more explicit.
This approach has been applied successfully to the tasks of
image feature detection and image-based recognition based
on the N-jet. For example, red-green and blue-yellow color
opponent receptive elds of center-surround type can be
obtained by applying the Laplacian r
2
L to the chromatic U
and V channels.
An alternative approach to handling colors in scale-
space is provided by the Gaussian color model, in which
the spectral energy distribution El over all wave-
lengths l is approximated by the sum of a Gaussian
function and rst- and second-order derivatives of
Gaussians (18). In this way a 3-Dcolor space is obtained,
where the channels
^
E;
^
E
l
and
^
E
ll
correspond to a sec-
ond-order Taylor expansion of a Gaussian-weighted
spectral energy distribution around a specic wave-
length l
0
and smoothed to a xed spectral scale t
l0
. In
practice, this model can be implemented in different
ways. For example, with l
0
520 nm and t
l0
55 nm,
Figure 4. Differential descriptors for blob detection/interest point detection: (left) Agray level image of size 210 280 pixels, (middle) The
Laplacian r
2
L computed at t 16. (right) The determinant of the Hessian det HL computed at t 16.
Figure 5. Fixed scale valley detection: (left) A gray-level image of size 180 180 pixels, (middle) The negative value of the valley strength
measure L
qq
computed at t 4. (right) Differential geometric valleys detected at t 4 using a complementary lowthreshold on tjL
qq
j 1 and
then overlayed on a bright copy of the original gray-level image.
4 SCALE-SPACE
the Gaussian color model has been approximated by the
following color space transformation (19):
^
E
^
E
l
^
E
ll
0
B
B
@
1
C
C
A
1
2p
det
P
t
p e
x
T
P
1
t
x=2
(14)
where x x; y
T
: This afne scale-space combined with
directional derivatives can serve as a model for oriented
elongated lter banks. Consider two input images f and f
0
that are related by an afne transformation x
0
Ax such
that f
0
Ax f x. Then, the afne scale-space represen-
tations L and L
0
of f and f
0
are related according to
L
0
x
0
; S
0
Lx; S where S
0
ASA
T
. A second-moment
matrix [Equation 10] dened from an afne scale-space
with covariance matrices S
t
and S
s
transforms according
to
m
0
Ax; A
X
t
A
T
; A
X
s
A
T
A
T
mx;
X
t
;
X
s
A
1
(15)
If we can determine covariance matrices S
t
and S
s
such
that mx; S
t
; S
s
c
1
S
1
t
c
2
S
1
s
for some constants c
1
and c
2
, we obtain a xed-point that is preserved under
afne transformations. This property has been used to
express afne invariant interest point operators, afne
invariant stereo matching, as well as afne invariant
texture segmentation and texture recognition methods.
(2125). [26, 1, 28, 33, 20] In practice, afne invariants
at a given image point can (up to an unknown scale factor
and a free rotation angle) be accomplished by shape
adaptation, that is by estimating the second-moment
matrix m using a rotationally symmetric scale-space and
then iteratively by choosing the covariance matrices
P
t
and
P
s
proportional to m
1
until the xed-point has been
reached or equivalently by warping the input image by
linear transformations proportional to A m
1=2
until the
second-moment matrix is sufciently close to a constant
times the unit matrix.
AUTOMATIC SCALE SELECTION AND SCALE INVARIANT
IMAGE DESCRIPTORS
Although the scale-space theory presented so far pro-
vides a well-founded framework for expressing visual
operations at multiple scales, it does not address the
problem of how to select locally appropriate scales for
additional analysis. Whereas the problem of nding the
best scales for handling a givendata set may be regarded
as intractable unless more information is available, in
many situations a mechanism is required to generate
hypotheses about interesting scales for additional ana-
lysis. Specically, because the size of and the distance to
objects may vary in real-life applications, a need exists to
dene scale invariant image descriptors. Ageneral meth-
odology (28) for generating hypotheses about in teresting
scale levels is by studying the evolution properties over
scales of (possibly non-linear) combinations of g-normal-
ized derivatives dened by
@
x
t
g=2
@
x
and @
h
t
g=2
@
y
(16)
where g is a free parameter to be determined for the task
at hand. Specically, scale levels can be selected from the
scales at which g-normalized derivative expressions
assume local extrema with respect to scale. A general
rationale for this statement is that under a scaling trans-
formation x
0
; y
0
sx; sy for some scaling factor s with
f
0
sx; sy f x; y, if follows that for matching scale
levels t
0
s
2
t the m:th order normalized derivatives at
corresponding scales t
0
s
2
t in scale-space transform
according to
L
0
x
0
m
x
0
; y
0
; t
0
s
mg1
L
x
mx; y; t (17)
Hence, for any differential expression that can be
expressed as a homogeneous polynomial in terms of g-
normalized derivatives, it follows that local extrema over
scales will be preserved under scaling transformations. In
other words, if a g-normalized differential entity D
norm
L
assumes an extremum over scales at the point x
0
; y
0
; t
0
Z
1
tt
0
@
t
Lx; y; t dt
1
2
R
1
tt
0
r
2
Lx; y; t dt (19)
After a reparameterization of the scale parameter into
effective scale t log t, we obtain:
Lx; y; t
0
1
2
R
1
tt
0
tr
2
Lx; y; t
dt
t
R
1
tt
0
r
2
norm
Lx; y; t dt (20)
By detecting the scale at which the normalized Laplacian
assumes its positive maximum or negative minimum over
scales, we determine the scale for which the image in a
scale-normalized bandpass sense contains toe maximum
amount of information.
Another motivation for using the scale-normalized
Laplacian operator for early scale selection is that it serves
as anexcellent blobdetector. For any(possiblynonlinear) g-
normalized differential expression D
norm
L, let us rst
dene a scale-space maximum (minimum) as a point for
which the g-normalized differential expression assumes a
maximum (minimum) over both space and scale. Then, we
obtain a straightforward blob detector with automatic scale
selection that responds to dark and bright blobs, from the
scale-space maxima and the scale-space minima of r
2
norm
L
Another blob detector with automatic scale selection that
also responds to saddles canbe dened fromthe scale-space
maxima and minima of the scale-normalized determinant
of the Hessian (26):
det H
norm
L t
2
L
xx
L
yy
L
2
xy
(21)
For both scale invariant blob detectors, the selected scale
reects the size of the blob. For the purpose of scale
invariance, it is, however, not necessary to use the
same entity determine spatial interest points as to deter-
mine interesting scales for those points. An alternative
approach to scale interest point detection is to use the
Harris operator in Equation (11) to determine spatial
interest points and the scale normalized Laplacian for
determining the scales at these points (23). Afne covar-
iant interest points can in turn be obtained by combining
any of these three interest point operators with subse-
quent afne shape adaptation following Equation (15)
combined with a determination of the remaining free
rotation angle using, for example, the orientation of the
image gradient.
The free parameter g in the g-normalized derivative
concept can be related to the dimensionality of the type of
image features to be detected or to a normalization of the
Gaussian derivative kernels to constant L
p
norm over
scales. For blob-like image descriptors, such as the inter-
est points described above, g 1 is a good choice and
corresponds to L
1
normalization. For other types of
image structures, such as thin edges, elongated ridges,
or rounded corners, other values will be preferred. For
example, for the problem of edge detection, g 1/2 is a
useful choice to capture the width of diffuse edges,
whereas for the purpose of ridge detection, g 3/4 is
preferable for tuning the scale levels to the width of an
elongated ridge (2830).
SCALE-SPACE AXIOMS
Besides its Practical use for computer vision problems, the
scale-space representation satises several theoretic prop-
erties that dene it as a unique form of multiscale image
representation: The linear scale-space representation is
obtained from linear and shift-invariant transformations.
Moreover, the Gaussiankernels are positive and satisfy the
semi group property
g; ; t
1
g; ; t
2
g; ; t
1
t
2
(22)
which implies that any coarse-scale representation can be
computed from any ne-scale representation using a simi-
lar transformation as in the transformation from the ori-
ginal image
L; ; t
2
g; t
2
t
1
L; ; t
1
(23)
In one dimension, Gaussian smoothing implies that new
local extrema or new zero-crossings cannot be created with
increasing scales (1,3). In 2-D and higher dimensions scale-
space representation obeys nonenhancement of local
extrema (causality), which Implies that the value at a local
maximumis guaranteed not to increase while the vaiue at a
local minimum is guaranteed to not decrease (2,3,31). The
regular linear scale-space is closed under translations, rota-
tions and scaling transformations. In fact, it can be shown
that Gaussian smoothing arises uniquely for different sub-
SCALE-SPACE 7
sets of combinations of these special and highly useful
properties. For partial views of the history of scale-space
axiomatics, pleaserefer toRefs. 31and32andthe references
therein.
Concerning the topic of automatic scale selection, it can
be shown that, the notion of g-normalized derivatives in
Equation (16) devlops necessity from the requirement that
local extrema over scales should be preserved under scaling
transformations (26).
RELATIONS TO BIOLOGIC, VISION
Interestingly, the results of this computationally motivated
analysis of early visual operations are in qualitative agree-
ment with current knowledge about biologic vision. Neu-
rophysiologic studies have shown that receptive eld
proles exists in the mammalian retina and the visual
cortex that can be well modeled by Gaussian derivative
operators (33,34).
Figure 8. Scale-invariant feature detection: (top) Original gray-level image, (bottom left) The 1000 strongest scale-space extrema of the
scale normalizedLaplacianr
2
norm
. (bottomright) The 1000 strongest scale-space extrema of the scale normalized determinant of the Hessian
det H
norm
LEachfeature is displayedby a circle withits size proportional to the detectionscale. Inaddition, the color of the circle indicates the
type of image feature: red for dark features, blue for bright features, and green for saddle-like features.
Figure 9. Afne covariant image features: (left) Original gray-level image. (right) The result of applying afne shape adaptation to the 500
strongest scale-space extremaof the scale normalizeddeterminant of the Hessiandet H
norm
L(resultingin384features for whichthe iterative
scheme converged). Each feature is displayed by an ellipse with its size proportional to the detection scale and the shape determined froma
linear transformation Adetermined froma second-moment matrix m. In addition, the color of the ellipse indicates the type of image feature:
red for dark features, blue for bright features, and green for saddle-like features.
8 SCALE-SPACE
SUMMARY AND OUTLOOK
Scale-space theory provides a well-founded framework for
modeling image structures at multiple scales, and the out-
put from the scale-space representation can be used as
input to a large variety of visual modules. Visual operations
such as feature detection, feature classication, stereo
matching, motionestimation, shape cues, andimage-based
recognition can be expressed directly in terms of (possibly
nonlinear) combinations of Gaussian derivatives at multi-
ple scales. In this sense, scale-space representation can
serve as a basis for early vision.
The set of early uncommitted operations in a vision
system, which perform scale-space smoothing, compute
Gaussian derivatives at multiple scales, and combine these
into differential invariants or other types of general pur-
pose features to be used as input to later stage visual
processes, is often referred to as a visual front-end.
Pyramid representation is a predecessor to scale-space
representation, constructed by simultaneously smoothing
and subsampling a given signal (35,36) In this way, com-
putationally highly efcient algorithms canbe obtained. A
problem with pyramid representations, how-ever, is that
it is algorithmically harder to relate structures at differ-
ent scales, due to the discrete nature of the scale levels. In
a scale-space representation, the existence of a continuous
scale parameter makes it conceptually much easier to
express this deep structure. For features dened as
zero-crossings of differential invariants, the implicit func-
tion theorem denes trajectories directly. across scales,
and at those scales where bifurcations occur, the local
behavior can be modeled by singularity theory. Never-
theless, pyramids are frequently used to express compu-
tationally more efcient approximations to different
scale-space algorithms.
Extensions of linear scale-space theory concern the for-
mulationof nonlinear scale-space concepts more committed
to specic purposes (37,38). Strong relations exist between
scale-space theory and wavelets, although these two
notions of multiscale representations have been developed
from somewhat different premises.
BIBLIOGRAPHY
1. A. P. Witkin, Scale-space ltering, Proc. 8th Int. Joint Conf.
Art. Intell., 1983, pp. 10191022.
2. J. J. Koenderink, The structure of images, Biological Cyber-
netics 50: 363370, 1984.
3. T. Lindeberg, Scale-Space Theory in Computer Vision, Dor-
drecht: Kluwer/Springer, 1994.
4. J. Sporring, M. Nielsen, L. Florack and P. Johansen, eds.
Gaussian Scale-Space Theory: Proc. PhD School on Scale-
Space Theory, Dordrecht: Kluwer/Springer, 1996.
5. L. M. J. Florack, Image Structure, Dorarecht: Kluwer/
Springer, 1997.
6. B. t. H. Romeny, Front-End Vision and Multi-Scale Image
Analysis, Dordrecht: Kluwer/Springer, 2003.
7. J. J. Koenderink, and A. J. van Doorn, Representation of local
geometry in the visual system, Biological Cybernetics 55: 367
375, 1987.
8. J. J. Koenderink, and A. J. van Doorn, Generic neighborhood
operations, IEEE Trans. Pattern Anal. Machine Intell.
14(6): 597605, 1992.
9. T. Leung, and J. Malik, Representing and recognizing the
visual appearance of materials using three-dimensional tex-
tons, Int. J. of Computer Vision, 43(1): 2944, 2001.
10. D. G. Jones, and J. Malik, A computational framework for
determining stereo correspondences froma set of linear spatial
lters, Proc. Eur. Conf. Comp. Vis., 1992, pp. 395410.
11. C. Schmid, and R. Mohr, Local grayvalue invariants for image
retrieval, IEEE Trans. Pattern Anal. Machine Intell. 19(5):
530535, 1997.
12. B. Schiele, and J. Crowley, Recognition without correspon-
dence using multidimensional receptive eld histograms,
Int. J. of Computer Vision, 36(1): 3150, 2000.
13. D. Lowe, Distinctive image features from scale-invariant
keypoints, Int. J. of Computer Vision, 60: (2), 91110,
2004.
14. H. Bay, T. Tuytelaars, and Luc van Gool SURF: speeded up
robust features, Proc. European Conf. on Computer Vision,
Springer LNCS 3951, I: 404417, 2006.
15. C. Harris, and M. Stevens, A combined corner and edge
detector, Proc. Alvey Workshop on Visual Motion, 1988, pp.
156162.
16. J. J. Koenderink and A. J. van Doorn, The structure of locally
orderless images, Int. J. of Computer Vision, 31 (2): 159168,
1999.
17. D. Hall, V. de Verdiere, and J. Crowley, Object recognition
using coloured receptive elds, Proc. European Conf. on Com-
puter Vision, Springer LNCS 1842, I: 164177, 2000.
18. J. J. Koenderink, and A. Kappers, Colour space, unpublished
lecture notes, Utrecht University, The Netherlands.
19. J. M. Geusebroek, R. van den Boomgaard, A. W. M. Smeulders,
and A. Dev, Color and scale: The spatial structure of color
images. proc. European Conf. on Computer Vision, Springer
LNCS 1842, I: 331341, 2000.
20. J. MGeusebroek, R. van den Boomgaard, A. W. M. Smeulders,
andH. Geerts, Color invariance, IEEEPatt. Anal. Mach. Intell,
23: 12; 13381346, 2001.
21. T. Lindeberg, and J. Garding, Shape-adapted smoothing in
estimationof 3-Ddepthcues fromafne distortions of local 2-
D structure, Image and Vision Computing 15: 415434,
1997.
22. A. Baumberg, Reliable feature matching across widely sepa-
rated views Proc:Comp:VisionPatt:Recogn:; I: 17741781,
2000.
23. K. Mikolaiczyk, and C. Schmid, Scale and afne invariant
interest point detectors, Int. J. Comp. Vision, 60: 1, 6386, 2004.
24. F. Schaffahtzky, and A. Zisserman, Viewpoint invariant tex-
ture matching and wide baseline stereo, Proc. Int. Conf. Comp.
Vis., 2: 636644, 2001.
25. S. Lazebnik, C. Schmid, and J. Ponce, Afne-invariant local
descriptors and neighbourhood statistics for texture recogni-
tion, Proc. Int. Conf. Comp. Vis., I: 649655, 2003.
26. T. Lindeberg, Feature detectionwithautomatic scale selection,
Int. J. of Computer Vision 30(2): 77116, 1998.
27. D. J. Field, Relations between the statistics of natural images
and the response properties of cortical cells, J. Opt. Soc. Am.4:
23792394, 1987.
28. T. Lindeberg, Edge detection and ridge detection with auto-
matic scale selection, Int. J. of Computer Vision, 30 (2): 117
154, 1998.
SCALE-SPACE 9
29. A. F. Frangi, W. J. Niessen, R. M. Hoogeveen, T. van Walsum,
M. A. Viergever, Model-baseddetectionof tubular structures in
3Dimages, IEEETrans. Med. Imaging, 18(10): 946956, 2000.
30. K. Krissian, G. Malandain, N. Avache, R. Vaillant and Y.
Trousset, Model-based detection of tubular structures in 3D
images, Comput. Vis. Image Underst., 80(2): 130171, 2000.
31. T. Lindeberg, On the axiomatic foundations of linear scale-
space: Combining semi-groupstructure withcausailty vs. scale
invariance, In: J. Sporring et al. (eds.) Gaussian Scale-Space
Theory, pp. Kluwer/Springer, 1997, 7598.
32. J. Weickert, Linear scale space has rst been proposed in
Japan, J. Math. Imaging and Vision, 10(3): 237252, 1999.
33. R. A. Young, The Gaussian derivative model for spatial vision,
Spatial Vision 2: 273293, 1987.
34. G. C. DeAngelis, I. Ohzawa and R. D. Freeman, Receptive eld
dynamics in the central visual pathways, Trends Neurosc. 18
(10): 451457, 1995.
35. P. J. Burt, and E. H. Adelson, The Laplacian pyramid as a
compact image code, IEEE Trans. Comm. 9:4, 532540,
1983.
36. J. Crowley, and A. C. Parker, Arepresentation for shape based
on peaks and ridges in the difference of low-pass transform,
IEEETrans. PatternAnal. Machine Intell. 6(2): 156170, 1984.
37. B. t. H. Romeny, ed. Geometry-Driven Diffusion in Computer
Vision, Dordrecht: Kluwer/Springer, 1997.
38. J. Weickert, Anisotropic Diffusion in Image Processing, Ger-
many: Teubner-Verlag, 1998.
TONY LINDEBERG
KTH (Royal Institute
of Technology)
Stockholm, Sweden
10 SCALE-SPACE
Author Query
Q1 Au: Please check editing at 22 ?
Computing Milieux
B
BEHAVIORAL SCIENCES AND COMPUTING
This article presents an overview of behavioral science
research on humancomputer interactions. The use of
high-speed digital computers in homes, schools, and the
workplace has been the impetus for thousands of research
studies in the behavioral sciences since the 1950s. As
computers have become an increasingly important part
of daily life, more studies in the behavioral sciences have
beendirected at humancomputer use. Research continues
to proliferate, inpart, because rapidtechnological advances
continue to lead to the development of new products and
applications from which emerge new forms of human
computer interactions. Examples include engaging in
social interactions through electronic mail, chat, and dis-
cussion groups; using commercial websites for shopping
and banking; using Internet resources and multimedia
curriculumpackages to learn in schools and at home; using
handheld computers for work and personal life; collaborat-
ing in computer-supported shared workspaces; tele-
commuting via the Internet; engaging in onemany or
manymany synchronous and asynchronous communica-
tions; and performing in virtual environments. Given the
sheer quantity of empirical investigations in behavioral
sciences computing research, the reader should appreciate
the highly selective nature of this article. Even the reading
list of current journals and books included at the end of
this article is highly selective.
We present behavioral science computing research
according to the following three categories: (1) antece-
dent-consequence effects, (2) model building, and (3) indi-
vidual-social perspectives. The rst category, antecedent-
consequent effects, asks questions such as follows: How
does variability in human abilities, traits, and prior per-
formance affect computer use? How does use of computers
affect variability in human abilities, traits, and subsequent
performance? The second category, model building, con-
sists of research on the nature of human abilities and
performance using metaphors from computer science and
related elds. Here, the behavioral scientist is primarily
interested in understanding the nature of human beings
but uses computer metaphors as a basis for describing and
explaining human behavior. Model building can also start
with assumptions about the nature of human beings, for
example, limitations on human attention or types of moti-
vation that serve as the basis for the development of new
products and applications for human use. In this case, the
behavioral scientist is mainly interested in product devel-
opment but may investigate actual use. Such data may
serve to modify the original assumptions about human
performance, which in turn lead to renements in the
product. The third category, individual-social perspective,
investigates the effects of increased access to and accep-
tance of computers in everyday life on human social rela-
tions. Questions addressed here are those such as follows:
Do computers serve to isolate or connect persons to one
another? What are the implications of lack of either access
or acceptance of computers in modern cultures? These
three categories of work in behavioral science computing
are not mutually exclusive as the boundaries between any
two of them are not xed and rm.
ANTECEDENT-CONSEQUENCE RESEARCH
Personality
Research conducted since the 1970s has sought to
identify what type of person is likely to use computers
and related information technologies, succeed in learning
about these technologies and pursue careers that deal with
the development and testing of computer products. Most
recently a great deal of attention has been given to human
behavior on the Internet. Personality factors have been
shown to be relevant in dening Internet behavior. Studies
indicate that extroversion-introversion, shyness, anxiety,
and neuroticism are related to computer use. Extroverts
are outer directed, sociable, enjoy stimulation, and are
generally regarded to be people oriented in contrast to
introverts who are inner directed, reective, quiet, and
socially reserved. The degree of extroversion-introversion
is related to many aspects of everyday life, including
vocational choice, performance in work groups, and inter-
personal functioning. Early studies suggested that heavy
computer users tended to be introverts, and programming
ability, in particular, was found to be associated with
introversion. Recent studies reveal less relationship
between introversion-extroversion and degree of computer
use or related factors such as computer anxiety, positive
attitudes toward computers, and programming aptitude or
achievement. However, the decision to pursue a career in
computer-related elds still shows some association with
introversion.
Neuroticism is a tendency to worry, to be anxious and
moody, and to evidence negative emotions and outlooks.
Studies of undergraduate students andof individuals using
computers in work settings have found that neuroticism is
associated with anxiety about computers and negative
attitudes toward computers. Neurotic individuals tend to
be low users of computers, with the exception of the Inter-
net, where there is a positive correlation between neuroti-
cism and some online behavior. Neurotic people are more
likely than others to engage in chat and discussion groups
and to seek addresses of other people online. It is possible
that this use of the Internet is mediated by loneliness, so
that as neurotic people alienate others through their nega-
tive behaviors, they begin to feel lonely and then seek
relationships online (1). Anxious people are less likely,
however, to use the Internet for information searches
and may nd the many hyperlinks and obscure organiza-
tion disturbing. Shyness, a specic type of anxiety related
to social situations, makes it difcult for individuals to
interact with others and to create social ties. Shy people
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
are more likely to engage in social discourse online than
they are ofine and can form online relationships more
easily. There is a danger that as they engage in social
discourse online, shy people may become even less likely
to interact with people ofine. There has been some indica-
tion, however, that shy people who engage in online rela-
tionships may become less shy in their ofine relationships
(2).
As people spend time on the Internet instead of in the
real world, behavioral scientists are concerned that they
will become more isolated and will lose crucial social sup-
port. Several studies support this view, nding that Inter-
net use is associated with reduced social networks,
loneliness, and difculties in the family and at work.
Caplan and others suggest that these consequences may
result fromexisting psychosocial conditions suchas depres-
sion, low self-efcacy, and negative self-appraisals, which
make some people susceptible to feelings of loneliness,
guilt, and other negative outcomes, so that signicant
time spent online becomes problematic for them but not
for others (3). There is some evidence of positive effects of
Internet use: Involvement in chat sessions can decrease
loneliness and depression in individuals while increasing
their self-esteem, sense of belonging, and perceived avail-
ability of people whom they could conde in or could pro-
vide material aid (4). However, Caplan cautions that the
unnatural quality of online communication, with its
increased anonymity and poor social cues, makes it a
poor substitute for face-to-face relationships, and that
this new context for communication is one that behavioral
scientists are just beginning to understand.
As computer technologies become more integral to
many aspects of life, it is increasingly important to be
able to use them effectively. This is difcult for indi-
viduals who have anxiety about technology that makes
them excessively cautious near computers. Exposure to
computers and training in computer use can decrease
this anxiety, particularly if the training takes place in a
relaxed setting in small groups with a user-friendly inter-
face, provides both demonstrations and written instruc-
tions, and includes important learning strategies that help
to integrate new understanding with what is already
known (5). However, some individuals evidence such a
high degree of anxiety about computer use that they
have been termed computerphobics. Here the fear of
computers is intense and irrational, and exposure to com-
puters may cause distinct signs of agitation including
trembling, facial expressions of distress, and physical or
communicative withdrawal. In extreme cases, a general-
ized anxiety reaction to all forms of technology termed
technophobia has beenobserved. Personality styles differ
when individuals with such phobias are compared with
those who are simply uncomfortable with computer use.
Individuals with great anxiety about computers have
personality characteristics of low problem-solving persis-
tence and unwillingness to seek help from others (6). The
training methods mentioned above are less likely to benet
individuals who evidence severe computerphobia or very
high levels of neuroticism. Intensive intervention efforts
are probably necessary because the anxiety about compu-
ters is related to a personality pattern marked by anxiety
in general rather than an isolated fear of computers
exacerbated by lack of experience with technology.
Gender
Studies over many years have found that gender is an
important factor in humancomputer interaction. Gender
differences occur in virtually every area of computing
including occupational tasks, games, online interaction,
and programming, with computer use and expertise gen-
erally higher in males than in females, although recent
studies indicate that the gender gap in use has closed and
in expertise is narrowing. This change is especially notice-
able in the schools and should become more apparent in
the workforce over time. In the case of the Internet, males
and females have a similar overall level of use, but they
differ in their patterns of use. Males are found to more
often engage in information gathering and entertainment
tasks, and women spend more time in communication
functions, seeking social interaction online (7). Females
use e-mail more than males and enjoy it more, and they
nd the Internet more useful overall for social interaction.
These differences emerge early. Girls and boys conceptua-
lize computers differently, with boys more likely to view
computers as toys, meant for recreation and fun, and to be
interested in them as machines, whereas girls view com-
puters as tools to accomplish something they want to do,
especially in regard to social interaction (8). This may be
due, inpart, to differences ingender role identity, anaspect
of personality that is related to, but not completely deter-
mined by, biological sex. Gender role identity is ones sense
of self as masculine and/or feminine. Both men and women
have traits that are stereotypically viewed as masculine
(assertiveness, for example) and traits that are stereo-
typically viewed as feminine (nurturance, for example)
and often see themselves as possessing both masculine
and feminine traits. Computer use differs between people
with a high masculine gender role identity and those with
a high feminine gender role identity (9). Some narrowing
of the gender gap in computer use may be due to changing
views of gender roles.
Age
Mead et al. (10) reviewed several consequences of compu-
ter use for older adults, including increased social interac-
tion and mental stimulation, increased self-esteem, and
improvements in life satisfaction. They noted, however,
that older adults are less likely to use computers than
younger adults and are less likely to own computers;
have greater difculty learning how to use technology;
and face particular challenges adapting to the computer-
based technologies they encounter in situations that at one
time involved personal interactions, such as automated
check-out lines and ATM machines. They also make more
mistakes and take more time to complete computer-based
tasks. In view of these factors, Mead et al. suggested
psychologists apply insights gained from studying the
effects of age on cognition toward the development of
appropriate training and computer interfaces for older
adults. Such interfaces could provide more clear indi-
cations to the user of previously followed hyperlinks, for
2 BEHAVIORAL SCIENCES AND COMPUTING
example, to compensate for failures in episodic memory,
and employ cursors with larger activation areas to com-
pensate for reduced motor control.
Aptitudes
Intelligence or aptitude factors are also predictors of com-
puter use. In fact, spatial ability, mathematical problem-
solving skills, and understanding of logic may be better
than personality factors as predictors. A study of learning
styles, visualization ability, and user preferences found
that high visualizers performed better than lowvisualizers
and thought computer systems were easier to use than did
low visualizers (11). High visualization ability is often
related to spatial and mathematical ability, which in
turn has been related to computer use, positive attitudes
about computers, and educational achievement in compu-
ter courses. Others have found that, like cognitive abilities,
the amount of prior experience using computers for activ-
ities such as game playing or writing is a better predictor of
attitudes about computers than personality characteris-
tics. This may be because people who have more positive
attitudes toward computers are therefore more likely to
use them. However, training studies with people who
have negative views of computers reveal that certain types
of exposure to computers improve attitudes and lead to
increased computer use. Several researchers have sug-
gested that attitudes may play an intermediary role in
computer use, facilitating experiences with computers,
which in turn enhance knowledge and skills and the
likelihood of increased use. Some have suggested that
attitudes are especially important in relation to user appli-
cations that require little or no special computing skills,
whereas cognitive abilities and practical skills may play a
more important role in determining computer activities
such as programming and design.
Attitudes
Attitudes about self-use of computers and attitudes about
the impact of computers on society have been investigated.
Research on attitudes about self-use and comfort level with
computers presumes that cognitive, affective, and beha-
vioral components of an attitude are each implicated in a
persons reaction to computers. That is, the person may
believe that computers will hinder or enhance performance
onsome taskor job (a cognitive component), the personmay
enjoy computer use or may experience anxiety (affective
components), and the individual may approach or avoid
computer experiences (behavioral component). In each
case, a persons attitude about him- or herself ininteraction
withcomputers is the focus of the analysis. Attitudes are an
important mediator between personality factors and cog-
nitive ability factors and actual computer use.
Individuals attitudes with respect to the impact of
computers on society vary. Some people believe that
computers are dehumanizing, reduce humanhuman
interaction, and pose a threat to society. Others view
computers as liberating and enhancing the development
of humans within society. These attitudes about computers
and society can inuence the individuals own behavior
with computers, but they also have potential inuence
on individuals views of computer use by others and their
attitudes toward technological change in a range of
settings.
Numerous studies have shown that anxiety about
using computers is negatively related to amount of experi-
ence with computers and level of condence in human
computer interaction. As discussed, people who show
anxiety as a general personality trait evidence more
computer use anxiety. In addition, anxiety about mathe-
matics and a belief that computers have a negative inu-
ence on society are related to computer anxiety. Thus, both
types of attitudesattitudes about ones own computer
use and attitudes about the impact of computers on
societycontribute to computer anxieties (12).
With training, adult students attitudes about compu-
ters become more positive. That is, attitudes about ones
own interaction with computers and attitudes about the
inuence of computers on society at large generally become
more positive as a result of instruction through computer
courses in educational settings and of specic training in
a variety of work settings. Figure 1 presents a general
model of individual differences in computer use. The model
indicates that attitudes are affected by personality and
cognitive factors, and that they in turn can affect computer
use either directly or by inuencing values and expecta-
tions. The model also indicates that computer use can
inuence attitudes toward computers and personality fac-
tors such as loneliness.
Inuence of Gender on Attitudes. Gender differences in
attitudes toward computer use, although becoming less
pronounced, appear relativelyearly, duringthe elementary
school years, and persist into adulthood. Male students
have more positive attitudes than female students, express
greater interest in computers and greater condence in
their own abilities, and view computers as having greater
utility insociety thanfemales at nearly every age level. One
study revealed a moderate difference between males and
females in their personal anxiety about using computers,
with women displaying greater levels than men, and hold-
ing more negative views than men about the inuence of
computers on society. The ndings of this study suggest
that gender differences in computer-related behavior are
due in part to differences in anxiety. When anxiety about
computers was controlled, there were few differences
betweenmales and females incomputer behavior: It seems
that anxiety mediates some gender differences in compu-
ter-related behavior (13). Other studies conrm that gen-
der differences in computer behavior seem to be due to
attitudinal and experiential factors. Compared with men,
women report greater anxiety about computer use, lower
condence about their ability to use the computer, and
lower levels of liking computer work. However, wheninves-
tigators control the degree to which tasks are viewed as
masculine or feminine and/or control differences in prior
experiences with computers, gender differences in atti-
tudes are no longer signicant (14).
Middle-school students differ by gender in their reac-
tions to multimedia learning interfaces and may have
different sources for intrinsic satisfaction when using com-
puters. Boys particularly enjoy control over computers and
BEHAVIORAL SCIENCES AND COMPUTING 3
look for navigational assistance within computer games,
whereas girls prefer calmer games that include writing and
ask for assistance rather than try to nd navigational
controls (15). This indicates that gender differences in
attitudes about computers may be inuenced in part
through experience with gender-preferred computer inter-
faces, and that attitudes of girls toward computer use
improve when gender-specic attention is paid to the
design of the interface and the type of tasks presented.
The differences by gender in patterns of Internet use by
adults support this conclusion. Whenindividuals are free to
determine what type of tasks they do online, gender differ-
ences in overall use disappear, although males are more
likely to gather information and seek entertainment and
women are more likely to interact socially.
Other studies suggest that gender differences in atti-
tudes toward computers may vary with the nature of the
task. In one study, college students performed simple or
more complex computer tasks. Men and women did not
differ in attitudes following the simple tasks. However, the
men reported a greater sense of self-efcacy (such as feel-
ings of effective problem-solving and control) than the
women after completing the complex tasks (16). Such nd-
ings suggest that, in addition to anxiety, a lack of con-
dence affects womenmore thanmeninthe areaof computer
use. Training does not always reduce these differences:
Although people generally become less anxious about
computer use over the course of training, in some cases
women become more anxious (17). This increase in anxiety
may occur even though women report a concomitant
increase inasense of enjoyment withcomputers as training
progressed. Withtraining, bothmenand womenhave more
positive social attitudes toward computers and perceive
computers to be more like humans and less like machines.
To summarize the state of information on attitudes
about computer use thus far, results suggest that attitudes
about ones own computer use are related to personal
anxiety about computer use as well as to math anxiety.
These relationships are more likely to occur in women than
in men. However, when women have more computer
experiences, the relationship between anxiety and compu-
ter use is diminished and the gender difference is often not
observed. Several attitudinal factors seemto be involved in
computer use, including math anxiety, feelings of self-
efcacy and condence, personal enjoyment, and positive
views of the usefulness of computers for society.
Workplace
Computers are used in a variety of ways in organizations,
and computing attitudes and skills can affect both the daily
tasks that must be performed in a routine manner and the
ability of companies to remain efcient and competitive.
The degree of success of computer systems inthe workplace
is oftenattributed to the attitudes of the employees who are
end users of Inter- and intranet-based applications for
communications and workow, shared workspaces, nan-
cial applications, database management systems, data
analysis software, and applications for producing Web
resources and graphics. A study of factors inuencing
attitudes about computing technologies and acceptance
of particular technologies showed that three factors were
most important in determining user acceptance behaviors:
perceivedadvantages to usingthe technologyfor improving
job performance, perceived ease of use, and degree of enjoy-
ment in using the technology. Anticipated organizational
support, including technical support as well as higher
management encouragement and resource allocation,
had a signicant effect on perceived advantage toward
improving job performance, so is an additional factor in
determining whether users make use of particular systems
(18). The study also showed that the perceived potential for
Analytical
Mathematical
Visual-Spatial
Cognitive Ability Factors:
Sociability
Neuroticism
Gender role identity
Psychosocial health
Personality Factors:
Self-efficacy
Affective reactions
Views of computers
and society
Attitudes:
Quantity and quality
Training and support
Problems and successes
Computer Use:
Advantages of use
Ease of use
Level of support
Anticipated enjoyment
and success
Expectations and
Perceptions:
Figure 1. General model of individual differences in computer use.
4 BEHAVIORAL SCIENCES AND COMPUTING
improving job performance was inuenced by the extent to
which the system was observed as consistent with existing
needs, values, and past experiences of potential adopters.
Overall, however, this and other research shows that the
perception that systems will improve job performance is by
far the strongest predictor of anticipated use.
Research on the attitudes of employees toward com-
puters in the workplace reveals that, for the most part,
computers are observed as having a positive effect on jobs,
making jobs more interesting, and/or increasing employee
effectiveness. Employees who report negative attitudes cite
increased job complexity with the use of computers instead
of increased effectiveness. They also report negative atti-
tudes about the necessity for additional training and refer
to a reduction in their feelings of competence. These mixed
feelings may be related to their job satisfaction attitudes.
When confusion and frustration about computer systems
increase, job satisfaction decreases. The negative feelings
about their own ability to use computers effectively lead
employees to express greater dissatisfaction with the job as
a whole (19).
Work-related computer problems can increase stress.
Problems with computer systems (e.g., downtime, difcul-
ties withaccess, lackof familiarity withsoftware, etc.) often
result in an increase of overall work time, a perception
of increased workload and pressure, and less feeling of
control over the job. In these situations, computers can
be viewed as a detrimental force in the workplace even
when users have a generally positive attitude toward
them. There is some indication that individuals differ in
their reactions to problems with computers, and that these
differences play a role inviews of the utility of computers on
the job. Older staff who feel threatened by computers
tend to complain more about time pressures and health-
related issues related to computer use, whereas same-age
peers who view computers more neutrally or positively
report few problems (20).
Computer-supported collaborative work systems can
facilitate groupprojects byallowingpeople to worktogether
fromdifferent places and different times. Kunzer et al. (21)
discuss guidelines for their effective design and use. For a
system to be effective, it is important that its functions are
transparent so that it is highly usable, and that it provides
an information space structured for specic tasks for a
particular group. Web-based portals that include these
workspaces can greatly enhance collaborative activities.
However, systems that are not highly usable and do not
consider the unique requirements of the user are not well
accepted and therefore not used. For example, inadequate
consideration of user preferences regarding software appli-
cations can make it impossible for users to work with
familiar sets of tools. Users then are likely to nd other
ways to collaborate either ofine or inseparate, more poorly
coordinated applications. Successful shared workspace
systems provide basic features such as a document compo-
nent that allows various le formats with revision control
and audit trails; calendars with search capabilities,
schedule-conict notication, and priority settings;
threaded discussions with attachments, e-mail notica-
tion, and specication of read rights; and contact, project,
and workow management components. Finally, the most
usable shared workspaces can be customized to different
cooperation scenarios.
Differences in computer anxiety and negative attitudes
about the social impact of computers are more likely to
occur in some occupations than in others. Individuals in
professional and managerial positions generally evidence
more positive attitudes toward computers. Particular
aspects of some jobs may inuence individuals attitudes
andaccount for some of these differences. Medcof (22) found
that the relative amounts of computing and noncomputing
tasks, the job characteristics (such as skill variety, level of
signicance of assigned tasks, and autonomy), and the
cognitive demand (e.g., task complexity) of the computing
tasks interact with one another to inuence attitudes
toward computer use. When job characteristics are low
and the computing components of the job also have low
cognitive demand onthe user (as inthe case of data entry in
a clerical job), attitudes toward computer use are negative,
and the job is viewed as increasingly negative as the
proportion of time spent on the low cognitive demand
task increases. If a larger proportion of the work time is
spent on a high cognitive demand task involving computer
use, attitudes toward computer use and toward the job
will be more positive.
Medcofs ndings suggest that under some conditions
job quality is reduced when computers are used to fulll
assigned tasks, although such job degradation can be mini-
mized or avoided. Specically, when jobs involve the use
of computers for tasks that have low levels of cognitive
challenge and require a narrow range of skills, little auto-
nomy, and little opportunity for interaction with others,
attitudes toward computer use, and toward the job, are
negative. But varying types of noncomputing tasks within
the job (increased autonomy or social interaction in non-
computing tasks, for example) reduces the negative impact;
inclusion of more challenging cognitive tasks as part of
the computing assignment of the job is especially effective
in reducing negative views of computer use. The attitudes
about computers in the workplace therefore depend on
the relative degree of computer use in the entire job, the
cognitive challenge involved in that use, and the type of
noncomputing activities.
Older workers tend to use computers in the workplace
less often than younger workers, and researchers have
found that attitudes may be implicated in this difference.
As older workers tend to have more negative attitudes
toward computers than younger workers or those with
less seniority, they use them less. Negative attitudes
toward computer use and computer anxiety are better
predictors of computer use than age alone (23).
MODEL BUILDING
Cognitive Processes
Modications in theories of human behavior have been
both the cause and the effect of research in behavioral
science computing. A cognitive revolution in psychology
occurred during the 1950s and 1960s, in which the human
mind became the focus of study. A general approach called
information processing, inspired by computer science,
BEHAVIORAL SCIENCES AND COMPUTING 5
became dominant in the behavioral sciences during this
time period. Attempts to model the owof information from
input-stimulation through output-behavior have included
considerations of human attention, perception, cognition,
memory, and, more recently, human emotional reactions
and motivation. This general approach has become a stan-
dard model that is still in wide use.
Cognitive sciences interest in computer technologies
stems also from the potential to implement models and
theories of human cognition as computer systems, such as
Newell and Simons General Problem Solver, Chomskys
transformational grammar, and Andersons Atomic Com-
ponents of Thought (ACT). ACT represents many compo-
nents and activities of human cognition, including
procedural knowledge, declarative knowledge, proposi-
tions, spreading activation, problem solving, and learning.
One benet to implementing models of human thought on
computers is that the process of developing a computer
model constrains theorists to be precise about their the-
ories, making it easier to test andthenrene them. As more
has beenlearned about the humanbrains ability to process
many inputs and operations simultaneously, cognitive
theorists have developed connectionist computer models
made up of large networks of interconnected computer
processors, each network comprising many interconnected
nodes. The overall arrangement of interconnected nodes
allows the system to organize concepts and relationships
among them, simulating the human mental structure of
knowledge in which single nodes may contain little mean-
ing but meaning emerges in the pattern of connections.
These and other implementations of psychological theories
showhowthe interactions betweencomputer scientists and
behavioral scientists have informed understandings of
human cognition.
Other recent theoretical developments include a focus
on the social, contextual, and constructive aspects of
human cognition and behavior. From this perspective,
human cognition is viewed as socially situated, collabora-
tive, and jointly constructed. Although these developments
have coincided with shifts from stand-alone computers to
networks and Internet-based systems that feature shared
workspaces, it would be erroneous to attribute these
changes in theoretical models and explanation solely to
changes in available technology. Instead, many of todays
behavioral scientists base their theories on approaches
developed by early twentieth-century scholars such as
Piaget and Vygotsky. Here the focus shifts fromexamining
individual cognitive processing to evaluating how people
work within a dynamic interplay of social factors, tech-
nological factors, and individual attitudes and experi-
ences to solve problems and learn (24). This perspective
encourages the development of systems that provide
mechanisms for people to scaffold other learners with sup-
ports that can be strengthened or faded based on the
learners understanding. The shift inviews of humanlearn-
ing from knowledge transfer to knowledge co-construction
is evident in the evolution of products to support learning,
from early computer-assisted instruction (CAI) systems,
to intelligent tutoring systems (ITS), to learning from
hypertext, to computer-supported collaborative learning
(CSCL). An important principle in this evolution is that
individuals need the motivation and capacity to be more
actively in charge of their own learning.
Human Factors
Human factors is a branch of the behavioral sciences that
attempts to optimize human performance in the context of
a system that has been designed to achieve an objective or
purpose. A general model of human performance includes
the human, the activity being performed, and the context
(25). In the area of humancomputer interactions, human
factors researchers investigate such matters as optimal
workstation design (e.g., to minimize soft tissue and joint
disorders); the perceptual and cognitive processes involved
in using software interfaces; computer access for persons
with disabilities such as visual impairments; and charac-
teristics of textual displays that inuence reading com-
prehension. An important principle in human factors
research is that improvements to the system are limited
if considered apart frominteraction with actual users. This
emphasis on contextual design is compatible with the
ethnographic movement in psychology that focuses on
very detailed observation of behavior in real situations
(24). A human-factors analysis of human learning from
hypermedia is presented next to illustrate this general
approach.
Hypermedia is a method of creating and accessing non-
linear text, images, video, andaudio resources. Information
in hypermedia is organized as a network of electronic
documents, each a self-contained segment of text or other
interlinked media. Content is elaborated by providing
bridges to various source collections and libraries. Links
among resources can be based on a variety of relations,
such as background information, examples, graphical
representations, further explanations, and related topics.
Hypermedia is intended to allow users to actively explore
knowledge, selecting which portions of an electronic
knowledge base to examine. However, following links
through multiple resources can pose problems when users
become disoriented and anxious, not knowing where they
are and where they are going (26). Human factors research
has been applied to the hypermedia environments of digi-
tal libraries, where users search and examine large-scale
databases with hypermedia tools.
Rapp et al. (27) suggest that cognitive psychologys
understanding of human cognition should be considered
during the design of digital libraries. Hypermedia struc-
tures can be fashioned with an awareness of processes and
limitations in human text comprehension, mental repre-
sentations, spatial cognition, learning, memory, and other
aspects of cognitive functioning. Digital libraries can in
turn provide real-world environments to test and evaluate
theories of humaninformationprocessing. Understandings
of both hypermedia and cognition can be informed through
an iterative process of research and evaluation, where
hypotheses about cognitive processes are developed and
experiments withinthe hypermedia are conducted. Results
are then evaluated, prompting revisions to hypermedia
sources and interfaces, and generating implications for
cognitive theory. Questions that arise during the process
can be used to evaluate and improve the organization and
6 BEHAVIORAL SCIENCES AND COMPUTING
interfaces in digital library collections. For example, how
might the multiple sources of audio, video, and textual
information in digital libraries be organized to promote
more elaborated, integrated, and better encoded mental
representations? Can the goal-directed, active exploration
and search behaviors implicit in hypermedia generate
the multiple cues and conceptual links that cognitive
science has found best enhance memory formation and
later retrieval?
The Superbook hypertext project at Bellcore was an
early example of how the iterative process of human-factor
analysis and system revision prompted modications in
original and subsequent designs before improvements over
traditional text presentations were observed. Dillon (28)
developed a framework of readerdocument interaction
that hypertext designers used to ensure usability from
the learners perspective. The framework, intended to be
an approximate representation of cognition and behavior
central to reading and information processing, consists of
four interactive elements: (1) a task model that deals
with the users needs and uses for the material; (2) an
information model that provides a model of the information
space; (3) a set of manipulation skills and facilities that
support physical use of the materials; and (4) a processor
that represents the cognitive and perceptual processing
involved in reading words and sentences. This model pre-
dicts that the users acts of reading will vary with their
needs and knowledge of the structure of the environment
that contains textual information, in addition to their gen-
eral ability to read (i.e., acquire a representation that
approximates the authors intention via perceptual and
cognitive processes). Research comparing learning from
hypertext versus traditional linear text has not yielded a
consistent pattern of results (29). User-oriented models
such as Dillons enable designers to increase the yield
from hypertext versus traditional text environments.
Virtual environments provide a rich setting for human
computer interaction where input and output devices are
adapted to the human senses. Individuals using virtual
reality systems are immersed into a virtual world that
provides authentic visual, acoustic, and tactile informa-
tion. The systems employ interface devices such as data
gloves that track movement and recognize gestures, stereo-
scopic visualizers that render scenes for each eye in real
time, headphones that provide all characteristics of realis-
tic sound, and head and eye tracking technologies. Users
navigate the world by walking or even ying through it,
and they can change scale so they effectively shrink to look
at smaller structures in more detail. Krapichler et al. (30)
present a virtual medical imaging system that allows phy-
sicians to interactively inspect all relevant internal and
external areas of a structure such as a tumor from any
angle. In these and similar applications, care is taken to
ensure that both movement through the virtual environ-
ment and feedback from it are natural and intuitive.
The emerging eld of affective computing applies
human factors research to the emotional interaction
between users and their computers. As people seek mean-
ing and patterns intheir interactions, they have a tendency
to respond to computers as though they were people,
perceiving that that they have human attributes and
personalities, and experiencing appropriate emotions
when attered or ignored. For example, when a computers
voice is gendered, people respond according to gender-
stereotypic roles, rating the female-voiced computer as
more knowledgeable about love and the male voice as
more knowledgeable about technical subjects, and con-
forming to the computers suggestions if they fall within
its gender-specic area of expertise (31). Picard and Klein
lead a team of behavioral scientists who explore this
willingness to ascribe personality to computers and to
interact emotionally with them. They devise systems
that candetect humanemotions, better understand human
intentions, and respond to signs of frustration and other
negative emotions withexpressions of comfort andsupport,
so that users are better able to meet their needs andachieve
their objectives (32). The development and implemen-
tation of products that make use of affective computing
systems provide behavioral theorists arichareafor ongoing
study.
INDIVIDUAL-SOCIAL PERSPECTIVE
In a previous section we presented an overview of research
on gender differences, attitudes toward the impact of com-
puters on society, and the use of computers in the work-
place. Each of these issues relates to the effects of
computers on human social relations. One prevalent
perception is that as people spend time on the Internet
instead of engaging with people face-to-face they will
become more isolated, lonely, and depressed, and that as
the kinds of social discourse available online to themmaybe
less meaningful and fullling, they could lose social sup-
port. As we discussed, increased Internet use may at times
be a result of loneliness, not a cause of it. The emerging
research about this concern is inconclusive, making this an
important area for further research. Kling (33) lists addi-
tional social controversies about the computerization of
society: class divisions in society; human safety and critical
computer systems; democratization; the structure of labor
markets; health; education; military security; computer
literacy; and privacy and encryption. These controversies
have yet to be resolved and are still being studied by
behavioral scientists.
Psychologists explore the inuence of computers on
relationships among people, not only in terms of their
online behaviors and interactions, but also their per-
ceptions of themselves and one another. Power among
relationships is sometimes renegotiated because of dif-
fering attitudes and competencies with technology. One
researcher proposed that relatively lower computer
expertise among fathers in contrast to their sons, and a
sense of dependence on their sons for technical support,
can change the family dynamic and emasculate fathers,
reducing perceptions of their strength and of their own
sense of competence (34).
Computer-Mediated Communication
Much research on computerhuman interactions is devel-
oped in the context of using computers to communicate
with other people. Computer-mediated communication
BEHAVIORAL SCIENCES AND COMPUTING 7
(CMC) is a broad termthat covers forms of communication
including e-mail, listservs, discussion groups, chat,
instant messaging, and videoconferencing. In comparison
with face-to-face communication, CMC has fewer social
and nonverbal cues but allows people to communicate
easily from different places and different times. Compu-
ter-supported collaborative work systems we discussed in
regard to computers in the workplace are examples of
specialized CMC systems that facilitate collaborative
group work. They include many forms of communication,
allowing both synchronous and asynchronous interac-
tions among multiple participants. In the cases of chat,
instant messaging, and videoconferencing, people com-
municate at the same time but may be in different loca-
tions. E-mail, listservs, and discussion groups have the
added benets and difculties of asynchronous commu-
nication. The reduction in social and nonverbal cues in
these forms of communication can be problematic, as
people misinterpret messages that are not carefully con-
structed and may respond negatively. It has been found
that these problems diminish with experience. As users
adapt to the medium and create means of improving com-
munication, and become more adept using linguistic cues,
differences between CMC and face-to-face communica-
tion may be lessened. Social norms and conventions
within groups serve to reduce individual variability
across formats rendering CMC similar to face-to-face
communication, especially in established organizations.
For example, messages from superiors receive more
attention than messages from coworkers or from subordi-
nates.
Research on learning in the workplace and in educa-
tional institutions has examined CMCs ability to support
the transfer of knowledge (an instructional perspective)
and the social, co-construction of knowledge (a conver-
sational perspective) (35). Grabe and Grabe (36) discuss
how CMC can change the role of the teacher, effectively
decentering interactions so that students feel freer to
communicate and to become more involved in their own
learning. CMCresults in more diverse participation and a
greater number of interactions among students when
compared with traditional classroom discussion charac-
terized by longer periods of talking by the teacher and
shorter, less complex individual responses from students.
With CMC, instructors can focus on observing and facil-
itating student learning and can intervene to help with
scaffolding or direct instruction as needed. Structure
provided by the teacher during CMC learning is impor-
tant to help create a social presence and to encourage
participation. Grabe and Grabe suggest that teachers
assume responsibility for managing the discussion by
dening the overall purpose of the discussion, specifying
roles of participants, establishing expectations, and
responding to negative or passive behaviors. In a study
of interaction in an online graduate course, increased
structure led to more interaction and increased dialogue
(37).
Another potential area for discussion, computer-sup-
ported collaborative learning, is somewhat beyond the
scope of this article but is included in our reading list.
Access
Behavioral scientists are interestedinwhat inequities exist
in access to technology, how the inequities developed, and
what can be done to reduce them. Until recently, the
number of computers in schools was signicantly lower
for poorer communities. These data are changing, and
the differences are diminishing. However, socioeconomic
factors continue to be powerful predictors for whether
people have computers or Internet access in their homes.
The National Telecommunications and Information
Administration (NTIA) (38) reported that, although Inter-
net use is increasing dramatically for Americans of all
incomes, education levels, ages, and races, many inequities
remain. Individuals in high-income households are much
more likely to be computer and Internet users than those in
low-income households (over 80% for the highest income
households; 25% for the lowest income households). Age
and level of education are also powerful predictors: Com-
puter and Internet use is higher among those who are
younger and those who are more highly educated. In addi-
tion, people with mental or physical disabilities are less
likely than others to use computers or the Internet.
Many inequities in access to computers are declining.
According to the NTIAs report, rates of use are rising much
more rapidly among poorer, less educated, and elderly
people, the verygroups who have beenmost disadvantaged.
The report attributes this development to the lowering cost
of computer technology, and to the expanding use of the
Internet at schools, work, and libraries, which makes these
resources available to people who do not own computers.
Journals
Applied Ergonomics
Behavior and Information Technology
Behavior Research Methods, Instruments and Computers
Computers in Human Behavior
CyberPsychology and Behavior
Ergonomics Abstracts
Hypertext and Cognition
Interacting with Computers
International Journal of Human Computer Interaction
International Journal of Human Computer Studies
Journal of Educational Computing Research
Journal of Educational Multimedia and Hypermedia
Journal of Occupational and Organizational Psychology
Books
E. Barrett (ed.), Sociomedia, Multimedia, Hypermedia, and the
Social Constructionof Knowledge. Cambridge, MA: The MITPress,
1992.
C. Cook, Computers and the Collaborative Experience of Learning.
London: Routledge, 1994.
S. J. Derry, M. Siegel, J. Stampen, and the STEP Research Group,
The STEP system for collaborative case-based teacher education:
Design, evaluation and future directions, in Proceedings of Com-
puter Support for Collaborative Learning (CSCL) 2002. Mahwah,
NJ: Lawrence Erlbaum Associates, 2002, pp. 209216.
8 BEHAVIORAL SCIENCES AND COMPUTING
D. H. Jonassen, K. Peck, and B. G. Wilson, Learning With Tech-
nology: A Constructivist Perspective. Columbus, OH: Merrill/Pre-
ntice-Hall, 1999.
D. H. Jonassen (ed.), Handbook of Research on Educational Com-
munications and Technology, 2
nd
ed. Mahwah, NJ: Lawrence
Erlbaum Associates, 2004.
T. Koschmann (ed.), CSCL: Theory and Practice of an Emerging
Paradigm. Mahwah, NJ: Lawrence Erlbaum Associates, 1996.
J. A. Oravec, Virtual Individual, Virtual Groups: Human Dimen-
sions of Groupware and Computer Networking. Melbourne, Aus-
tralia: Cambridge University Press, 1996.
D. Reinking, M. McKenna, L. Labbo and R. Kieffer (eds.),
Handbook of Literacy and Technology: Transformations in a
Post-typographic World. Mahwah, NJ: Lawrence Erlbaum Associ-
ates, 1998.
S. Vosniadou, E. D. Corte, R. Glaser and H. Mandl, International
Perspectives on the Design of Technology-Supported Learning
Environments. Mahwah, NJ: Lawrence ErlbaumAssociates, 1996.
B. B. Wasson, S. Ludvigsen, and U. Hoppe (eds.), Designing for
Change in Networked Learning Environments: Proceedings of the
International Conference on Computer Support for Collaborative
Learning 2003. Boston, MA: Kluwer Academic, 2003.
BIBLIOGRAPHY
1. Y. Amichai-Hamburger and E. Ben-Artzi, Loneliness and
Internet use, Comput. Hum. Behav., 19(1): 7180, 2000.
2. L. D. Roberts, L. M. Smith, and C. M. Polluck, U r a lot bolder
on the net: Shyness and Internet use, in W. R. Crozier(ed.),
Shyness: Development, Consolidation and Change. New York:
Routledge Farmer: 121138, 2000.
3. S. E. Caplan, Problematic Internet use and psychosocial well-
being: Development of a theory-based cognitive-behavioral
measurement instrument, Comput. Hum. Behav., 18(5):
553575, 2002.
4. E. H. Shaw and L. M. Gant, Users divided? Exploring the
gender gap in Internet use, CyberPsychol. Behav., 5(6):
517527, 2002.
5. B. Wilson, Redressing the anxiety imbalance: Computer-
phobia and educators, Behav. Inform. Technol., 18(6):
445453, 1999.
6. M. Weil,L. D. Rosen, and S. E. Wugalter, The etiology of
computerphobia, Comput. Hum. Behav., 6(4): 361379, 1990.
7. L.A. Jackson, K.S. Ervin, and P.D. Gardner, Gender and the
Internet: Women communicating and men searching, Sex
Roles, 44(5/6): 363379, 2001.
8. L. M. Miller, H. Schweingruber, andC. L. Brandenberg, Middle
school students technology practices and preferences: Re-
examining gender differences, J. Educ. Multimedia Hyperme-
dia, 10(2): 125140, 2001.
9. A. M. Colley, M. T. Gale, and T. A. Harris, Effects of gender
role identity and experience on computer attitude compo-
nents, J. Educ. Comput. Res., 10(2): 129137, 1994.
10. S. E. Mead, P. Batsakes, A. D. Fisk, and A. Mykityshyn,
Applicationof cognitive theory to training and designsolutions
for age-related computer use, Int. J. Behav. Develop.23(3):
553573, 1999.
11. S. Davis and R. Bostrom, An experimental investigation of the
roles of the computer interface and individual characteristics
in the learning of computer systems, Int. J. Hum. Comput.
Interact., 4(2): 143172, 1992.
12. F. Farina et al., Predictors of anxiety towards computers,
Comput. Hum. Behav., 7(4): 263267, 1991.
13. B. E. Whitley, Gender differences in computer related atti-
tudes: It depends on what you ask, Comput. Hum. Behav.,
12(2): 275289, 1996.
14. J. L. Dyck and J. A.-A. Smither, Age differences in computer
anxiety: The role of computer experience, gender and educa-
tion, J. Educ. Comput. Res., 10(3): 239248, 1994.
15. D. Passig and H. Levin, Gender interest differences with
multimedia learning interfaces, Comput. Hum. Behav.,
15(2): 173183, 1999.
16. T. Busch, Gender differences in self-efcacy and attitudes
toward computers, J. Educ. Comput. Res., 12(2): 147158,
1995.
17. L. J. Nelson, G. M. Wiese, and J. Cooper, Getting started with
computers: Experience, anxiety and relational style, Comput.
Hum. Behav., 7(3): 185202, 1991.
18. S. Al-Gahtani and M. King, Attitudes, satisfaction and
usage: Factors contributing to each in the acceptance of infor-
mation technology, Behav. Inform. Technol., 18(4): 277297,
1999.
19. A. J. Murrell and J. Sprinkle, The impact of negative attitudes
towards computers on employees satisfaction and commit-
ment within a small company, Comput. Hum. Behav., 9(1):
5763, 1993.
20. M. Staufer, Technological change and the older employee:
Implications for introduction and training, Behav. Inform.
Technol., 11(1): 4652, 1992.
21. A. Kunzer , K. Rose, L. Schmidt, and H. Luczak, SWOFAn
open framework for shared workspaces to support different
cooperation tasks, Behav. Inform. Technol., 21(5): 351358,
2002.
22. J. W. Medcof, The job characteristics of computing and non-
computing work activities, J. Occupat. Organizat. Psychol.,
69(2): 199212, 1996.
23. J. C. Marquie et al., Age inuence on attitudes of ofce workers
faced with new computerized technologies: A questionnaire
analysis, Appl. Ergon., 25(3): 130142, 1994.
24. J. M. Carrol, Human-computer interaction: Psychology as a
science of design, Annu. Rev. Psychol., 48: 6183, 1997.
25. R. W. Bailey, Human Performance Engineering: Designing
High Quality, Professional User Interfaces for Computer
Products, Applications, and Systems, 3rd ed. Upper Saddle
River, NJ: Prentice Hall, 1996.
26. P. A. Chalmers, The role of cognitive theory in human-compu-
ter interface, Comput. Hum. Behav., 19(5): 593607, 2003.
27. D. N. Rapp, H. A. Taylor, and G. R. Crane, The impact of
digital libraries on cognitive processes: Psychological issues
of hypermedia, Comput. Hum. Behav., 19(5): 609628,
2003.
28. A. Dillon, Myths, misconceptions, and an alternative perspec-
tive on information usage and the electronic medium, in J. F.
Rouetet al. (eds.), Hypertext and Cognition. Mahwah, NJ:
Lawrence Erlbaum Associates, 1996.
29. A. Dillon and R. Gabbard, Hypermedia as an educational
technology: A review of the quantitative research literature
on learner expectations, control and style, Rev. Educ. Res.,
68(3): 322349, 1998.
30. C. Krapichler, M. Haubner, A. Losch, D. Schumann, M. See-
man, K. Englmeier, Physicians in virtual environments
Multimodal human-computer interaction, Interact. Comput.,
11(4): 427452, 1999.
BEHAVIORAL SCIENCES AND COMPUTING 9
31. E. Lee, Effects of gender of the computer on informational
social inuence: The moderating role of the task type, Int. J.
Hum.-Comput. Studies, 58(4): 347362, 2003.
32. R. W. Picard and J. Klein, Computers that recognise and
respond to user emotion: Theoretical and practical implica-
tions, Interact. Comput., 14(2): 141169, 2002.
33. R. Kling, Social controversies about computerization, in
R. Ming (ed.), Computerization and Controversy, Value
Conicts and Social Choices, 2nd ed. New York: Academic
Press, 1996.
34. R. Ribak, Like immigrants: Negotiating power in the face of
the home computer, New Media Soci., 3(2), 220238, 2001.
35. A. J. Romiszowski and R. Mason, Computer-mediated commu-
nication, in D. H. Jonassen (ed.), Handbook of Research for
Educational Communications and Technology. New York:
Simon & Schuster Macmillan, 1996.
36. M. Grabe andC. Grabe, IntegratingTechnology for Meaningful
Learning. New York: Houghton Mifin, 2004.
37. C. Vrasidas and M. S. McIsaac, Factors inuencing interaction
in an online course, The Amer. J. Distance Educ., 13(3): 2236,
1999.
38. National Telecommunications and Information Administra-
tion. (February 2002). A nation online: How Americans are
expanding their use of the Internet. [Online]. Available:
http://www.ntia.doc.gov/ntiahome/dn/index.html, Accessed
December 10, 2003.
FURTHER READING
Journal Articles
M. J. Alvarez-Torrez, P. Mishra, and Y. Zhao, Judging a book by
its cover! Cultural stereotyping of interactive media and its effect
on the recall of text information, J. Educ. Media Hypermedia,
10(2): 161183, 2001.
Y. Amichai-Hamburger, Internet and Personality, Comput. Hum.
Behav., 18(1): 110, 2002.
B. Boneva, R. Kraut, and D. Frohlich, Using email for personal
relationships, Amer. Behav. Scientist, 45(3): 530549, 2001.
J. M. Carrol, Human-computer interaction: Psychologyas ascience
of design, Annu. Rev. Psychol., 48: 6183, 1997.
E. C. Boling, The Transformation of Instruction through Technol-
ogy: Promoting inclusive learning communities in teacher educa-
tion courses, Action Teacher Educ., 24(4), 2003.
R.A. Davis, A cognitive-behavioral model of pathological Internet
use, Comput. Hum. Behav., 17(2): 187195, 2001.
C. E. Hmelo, A. Nagarajan, and R. S. Day, Effects of high and low
prior knowledge on construction of a joint problemspace, J. Exper.
Educ., 69(1): 3656, 2000.
R. Kraut, M. Patterson, and V. Ludmark, Internet paradox: A
social technology that reduces social involvement and psychologi-
cal well-being, Amer. Psychol., 53(9), 10171031, 1998.
K.Y.A. McKenna and J.A. Bargh, Plan 9 from cyberspace: The
implications of the Internet formpersonalityandsocial psychology,
Personality Social Psychol. Rev., 4: 5775, 2000.
A. G. Namlu, The effect of learning strategy on computer anxiety,
Comput. Hum. Behav., 19(5): 565578, 2003.
L. Shashaani, Gender differences in computer experiences and
its inuence on computer attitudes, J. Educ. Comput. Res., 11(4):
347367, 1994.
J. F. Sigurdsson, Computer experience, attitudes towards compu-
ters andpersonalitycharacteristics inpsychologyundergraduates,
Personality Individual Differences, 12(6): 617624, 1991.
C. A. Steinkuehler, S. J. Derry, C. E. Hmelo-Silver, and M. Del-
Marcelle, Cracking the resource nut with distributed problem-
based learning in secondary teacher education, J. Distance
Educ., 23: 2339, 2002.
STEVEN M. BARNHART
RICHARD DE LISI
Rutgers, The State University of
New Jersey
New Brunswick, New Jersey
10 BEHAVIORAL SCIENCES AND COMPUTING
B
BIOLOGY COMPUTING
INTRODUCTION
The modern era of molecular biology began with the dis-
covery of the double helical structure of DNA. Today,
sequencing nucleic acids, the determination of genetic
information at the most fundamental level, is a major
tool of biological research (1). This revolution in biology
has createdahuge amount of dataat great speedbydirectly
reading DNA sequences. The growth rate of data volume is
exponential. For instance, the volume of DNA and protein
sequence data is currently doubling every 22 months (2).
One important reason for this exceptional growth rate of
biological data is the medical use of such information in the
design of diagnostics and therapeutics (3,4). For example,
identication of genetic markers in DNA sequences would
provide important information regarding which portions of
the DNAare signicant and would allowthe researchers to
nd many disease genes of interest (by recognizing them
from the pattern of inheritance). Naturally, the large
amount of available data poses a serious challenge in stor-
ing, retrieving, and analyzing biological information.
A rapidly developing area, computational biology, is
emerging to meet the rapidly increasing computational
need. It consists of many important areas such as informa-
tionstorage, sequence analysis, evolutionary tree construc-
tion, protein structure prediction, and so on (3,4). It is
playing an important role in some biological research.
For example, sequence comparison is one of the most
important methodological issues and most active research
areas in current biological sequence analysis. Without the
help of computers, it is almost impossible to compare two or
more biological sequences (typically, at least a fewhundred
characters long).
Inthis chapter, we survey recent results onevolutionary
tree construction and comparison, computing syntenic dis-
tances between multi-chromosome genomes, and multiple
sequence alignment problems.
Evolutionary trees model the evolutionary histories of
input data such as a set of species or molecular sequences.
Evolutionary trees are useful for a variety of reasons, for
example, in homology modeling of (DNA and protein)
sequences for diagnostic or therapeutic design, as an aid
for devising classications of organisms, and in evaluating
alternative hypotheses of adaption and ancient geographi-
cal relationships (for example, see Refs. 5 and 6 for discus-
sions on the last two applications). Quite a fewmethods are
known to construct evolutionary trees from the large
volume of input data. We will discuss some of these methods
in this chapter. We will also discuss methods for comparing
and contrasting evolutionary trees constructed by various
methods to nd their similarities or dissimilarities, which
is of vital importance in computational biology.
Syntenic distance is a measure of distance between
multi-chromosome genomes (where each chromosome is
viewed as a set of genes). Applications of computing dis-
tances between genomes can be traced back to the well-
known Human Genome Project, whose objective is to
decode this entire DNA sequence and to nd the location
and ordering of genetic markers along the length of the
chromosome. These genetic markers can be used, for exam-
ple, to trace the inheritance of chromosomes infamilies and
thereby to nd the location of disease genes. Genetic mar-
kers can be found by nding DNA polymorphisms (i.e.,
locations where two DNA sequences spell differently).
A key step in nding DNA polymorphisms is the calcu-
lation of the genetic distance, which is a measure of the
correlation (or similarity) between two genomes.
Multiple sequence alignment is an important tool for
sequence analysis. It can help extracting and nding bio-
logically important commonalities from a set of sequences.
Many versions have been proposed and a huge number of
papers have beenwritten oneffective and efcient methods
for constructing multiple sequence alignment. We will
discuss some of the important versions such as SP-align-
ment, star alignment, tree alignment, generalized tree
alignment, and xed topology alignment with recombina-
tion. Recent results on those versions are given.
We assume that the reader has the basic knowledge of
algorithms and computational complexity (such as NP, P,
and MAX-SNP). Otherwise, please consult, for example,
Refs. 79.
The rest of this chapter is organized as follows. In the
next section, we discuss construction and comparison
methods for evolutionary trees. Then we discuss briey
various distances for comparing sequences and explain in
details the syntenic distance measure. We then discuss
multiple sequence alignment problems. We conclude
with a few open problems.
CONSTRUCTION AND COMPARISON
OF EVOLUTIONARY TREES
The evolutionary history of organisms is oftenconveniently
represented as trees, called phylogenetic trees or simply
phylogenies. Such a tree has uniquely labeled leaves and
unlabeled interior nodes, can be unrooted or rooted if the
evolutionary origin is known, and usually has internal
nodes of degree 3. Figure 1 shows an example of a phylo-
geny. A phylogeny may also have weights on its edges,
where an edge weight (more popularly known as branch
length in genetics) could represent the evolutionary dis-
tance along the edge. Many phylogeny reconstruction
methods, including the distance and maximum likelihood
methods, actually produce weighted phylogenies. Figure 1
also shows a weighted phylogeny (the weights are for
illustrative purposes only).
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Phylogenetic Construction Methods
Phylogenetic construction methods use the knowledge of
evolution of molecules to infer the evolutionary history of
the species. The knowledge of evolution is usually in the
form of two kinds of data commonly used in phylogeny
inference, namely, character matrices (where eachposition
(i, j) is base j in sequence i) and distance matrices (where
each position (i, j) contains the computed distance between
sequence i and sequence j). Three major types of phyloge-
netic construction methods are the parsimony and compat-
ibility method, the distance method, and the maximum-
likelihoodmethod, Below, we discuss eachof these methods
briey. See the excellent survey in Refs. 10 and 11 for more
details.
Parsimony methods construct phylogenetic trees for the
given sequences such that, in some sense, the total number
of changes (i.e., base substitutions) or some weightedsumof
the changes is minimized. See Refs. 1214 for some of the
papers in this direction.
Distance methods (15)(17) try to t a tree to a matrix of
pairwise distances between a set of nspecies. Entries in the
distance matrices are assumed to represent evolutionary
distance between species represented by the sequences in
the tree (i.e., the total number of mutations inbothlineages
since divergence from the common ancestor). If no tree ts
the distance matrix perfectly, then a measure of the dis-
crepancy of the distances in the distance matrix and those
in the tree is taken, and the tree with the minimum dis-
crepancy is selected as the best tree. An example of the
measure of the discrepancy, which has been used in the
literature (15,16), is a weighted least-square measure, for
example, of the form
X
1_i; j_n
w
i j
(D
i j
d
i j
)
2
where D
ij
are the given distances and d
ij
are the distances
computed from the tree.
Maximum-likelihood methods (12,18,19) relies on the
statistical method of choosing a tree that maximizes
the likelihood (i.e., maximizes the probability) that the
observed data would have occurred. Although this method
is quite general and powerful, it is computationally inten-
sive because of the complexity of the likelihood function.
All the above methods have been investigated by simu-
lation and theoretical analysis. None of the methods work
well under all evolutionary conditions, but each works well
under particular situations. Hence, one must choose the
appropriate phylogeny construction method carefully for
the best results (6).
Comparing Evolutionary Trees
As discussed in the previous section, over the past few
decades, many approaches for reconstructing evolutionary
trees have been developed, including (not exhaustively) par-
simony, compatibility, distance, and maximum-likelihood
methods. As aresult, inpractice, they oftenleadto different
trees on the same set of species (20). It is thus of interest to
compare evolutionary trees produced by different methods,
or by the same method on different data. Several distance
models for evolutionary trees have been proposed in the
literature. Among them, the best known is perhaps the
nearest-neighbor interchange (nni) distance introduced
independently in Refs. 21 and 22. Other distances include
the subtree-transfer distance introduced in Refs. 23 and 24
and the the linear-cost subtree-transfer distance(25,26).
Below, we discuss very briey a few of these distances.
Nearest-Neighbor Interchange Distance
An nni operation swaps two subtrees that are separated
by an internal edge (u, v), as shown in Fig. 2. The nni
operation is said to operate on this internal edge. The nni
distance, D
nni
(T
1
, T
2
), between two trees T
1
and T
2
is
denned as the minimumnumber of nni operations required
to transformone tree into the other. Culik II and Wood (27)
[improved later by Tramp and Zhang (28)] proved that
nlogn O(n) nni moves are sufcient to transform a tree
of n leaves to any other tree with the same set of leaves.
Sleator et al. (29) proved an V(nlogn) lower bound for most
pairs of trees. Although the distance has been studied
extensively in the literature (21,22,2734), the computa-
tional complexity of computing it has puzzled the research
community for nearly 25 years until recently, when the
authors in Ref. 25 showed this problem to be NP-hard an
erroneous proof of the NP-hardness of the nni distance
between unlabeled trees was published in Ref. 34. As
computing the nni distance is shown to be NP-hard, the
next obvious question is: Can we get a good approximation
of the distance? The authors in Ref. 28 show that the nni
distance can be approximated in polynomial time within a
factor of logn O(1).
Subtree-transfer Distances
An nni operation can also be viewed as moving a subtree
past a neighboring internal node. Amore general operation
is to transfer a subtree from one place to another arbitrary
place. Figure 3 shows such a subtree-transfer operation.
The subtree-transfer distance, D
st
(T
1
, T
2
), between two
trees T
1
and T
2
is the minimum number of subtrees we
need to move to transform T
1
into T
2
(2326,35).
It is sometimes appropriate in practice to discriminate
among subtree-transfer operations as they occur with dif-
ferent frequencies. Inthis case, we cancharge eachsubtree-
transfer operation a cost equal to the distance (the number
of nodes passed) that the subtree has moved in the current
tree. The linear-cost subtree-transfer distance, D
lcst
(T
1
,
T
2
), between two trees T
1
and T
2
is then the minimum
total cost required to transform T
1
into T
2
by subtree-
transfer operations (25,26). Clearly, both subtree-transfer
and linear-cost subtree-transfer models can also be used as
alternative measures for comparing evolutionary trees
generated by different tree reconstruction methods. In
fact, on unweighted phylogenies, the linear-cost subtree-
transfer distance is identical to the nni distance (26).
The authors in Ref. (35) show that computing the
subtree-transfer distance between two evolutionary trees
is NP-hard and give an approximation algorithm for this
distance with performance ratio 3.
2 BIOLOGY COMPUTING
Rotation Distance
Rotation distance is a variant of the nni distance for rooted,
ordered trees. A rotation is an operation that changes one
rooted binary tree into another withthe same size. Figure 4
shows the general rotation rule. An easy approximation
algorithmfor computing distance with a performance ratio
of 2 is given in Ref. 36. However, it is not known if comput-
ing this distance is NP-hard or not.
Distances on Weighted Phylogenies
Comparison of weighted evolutionary trees has recently
been studied in Ref. 20. The distance measure adopted is
based on the difference in the partitions of the leaves
induced by the edges in both trees, and it has the drawback
of being somewhat insensitive to the tree topologies. Both
the linear-cost subtree-transfer and nni models can be
naturally extended to weighted trees. The extension for
nni is straightforward: An nni is simply charged a cost
equal to the weight of the edge it operates on. In the case of
linear-cost subtree-transfer, although the idea is immedi-
ate (i.e., a moving subtree should be charged for the
weighted distance it travels), the formal denition needs
some care and can be found in Ref. 26.
As computing the nni distance on unweighted phylo-
genies is NP-hard, it is obvious that computing this
distance is NP-hard for weighted phylogenies also. The
authors in Ref. 26 give an approximation algorithm for
the linear-cost subtree-transfer distance on weighted
phylogenies with performance ratio 2. In Ref. 25, the
authors give an approximation algorithm for the nni dis-
tance on weighted phylogenies with performance ratio of
O(log n). It is open whether the linear-cost subtree-transfer
problem is NP-hard for weighted phylogenies. However, it
has been shown that the problem is NP-hard for weighted
trees with non-uniquely labeled leaves (26).
COMPUTING DISTANCES BETWEEN GENOMES
The denition and study of appropriate measures of dis-
tance between pairs of species is of great importance in
Cat Dog
Seal
Whale
Goose
Ostrich
Reptilian Ancestor
Platypus
Horse
Cat Dog
Seal
Whale
Goose
Ostrich
Reptilian Ancestor
Platypus
Horse
0.8
1.1
2
1.1
0.3
2.2
0.5
0.3
5.5
1
3
2.2
3.3
0.7
1
Figure 1. Examples of unweighted and
weighted phylogenies.
D C
A C A
B
B
D B
B C B D
u
u u
v
v v
C
D
A
Figure 2. The two possible nni operations on an internal edge (u,
v): exchange BC or BD.
s5
s1 s2 s3 s4 s1 s2 s4
one subtree transfer
s3
s5
Figure 3. An example of a subtree-transfer operation on a tree.
A
A
rotation at u
rotation at v
B
B
C
C
u
u v
v
Figure 4. Left and right rotation operations on a rooted binary
tree.
BIOLOGY COMPUTING 3
computational biology. Such measures of distance can be
used, for example, in phylogeny construction and in tax-
onomic analysis.
As more and more molecular data become available,
methods for denning distances between species have
focused on such data. One of the most popular distance
measures is the edit distance between homologous DNA or
amino acid sequences obtained fromdifferent species. Such
measures focus on point mutations and dene the distance
between two sequences as the minimum number of these
moves required to transform one sequence into another. It
has been recognized that the edit-distance may underesti-
mate the distance between two sequences because of the
possibility that multiple point mutations occurring at the
same locus will be accounted for simply as one mutation.
The problem is that the probability of a point mutation is
not low enough to rule out this possibility.
Recently, there has been a spate of new denitions of
distance that try to treat rarer, macro-level mutations as
the basic moves. For example, if we knowthe order of genes
onachromosome for two different species, we candene the
reversal distance between the two species to be the number
of reversals of portions of the chromosome to transformthe
gene order in one species to the gene order in the other
species. The question of nding the reversal distance was
rst explored in the computer science context by Kece-
cioglu and Sankoff and by Bafna and Pevzner, and there
has been signicant progress made on this question by
Bafna, Hannenhalli, Kececioglu, Pevzner, Ravi, Sankoff,
and others (3741). Other moves besides reversals have
been considered as well. Breaking off a portion of the
chromosome and inserting it elsewhere in the chromosome
is referred to as a transposition, and one can similarly
dene the transposition distance (42). Similarly, allowing
two chromosomes (viewed as strings of genes) to exchange
sufxes (or sometimes a sufx with a prex) is known as a
translocation, and this move can also be used to dene an
appropriate measure of distance between two species for
which much of the genome has been mapped (43).
Ferretti et al. (44) proposed a distance measure that is at
an even higher level of abstraction. Here, even the order of
genes on a particular chromosome of a species is ignored or
presumed to be unknown. It is assumed that the genome of
a species is given as a collection of sets. Each set in the
collection corresponds to a set of genes that are on one
chromosome and different sets in the collection correspond
to different chromosomes (see Fig. 5). In this scenario, one
can dene a move to be an exchange of genes between two
chromosomes, the ssion of one chromosome into two, or
the fusion of two chromosomes into one (see Fig. 6). The
syntenic distance between two species has been dened by
Ferretti et al. (44) to be the number of such moves required
to transformthe genome of one species to the genome of the
other.
Notice that any recombination of two chromosomes is
permissible in this model. By contrast, the set of legal
translocations (in the translocation distance model) is
severely limited by the order of genes on the chromosomes
being translocated. Furthermore, the transformationof the
rst genome into the second genome does not have to
produce a specied order of genes in the second genome.
The underlying justication of this model is that the
exchange of genes between chromosomes is a much rarer
event than the movement of genes within a chromosome
and, hence, a distance function should measure the mini-
mum number of such exchanges needed.
In Ref. 45, the authors prove various results on the
syntenic distance. For example, they show that computing
the syntenic distance exactly is NP-hard, there is a simple
polynomial time approximation algorithm for the synteny
problem with performance ratio 2, and computing the
syntenic distance is xed parameter tractable.
The median problem develops in connection with the
phylogenetic inference problem(44) and denned as follows:
Given three genomes (
1
, (
2
, and (
3
, we are required to
construct a genome ( such that the median distance a
(
=
P
3
i=1
D((; (
i
) is minimized (where D is the syntenic dis-
tance). Without any additional constraints, this problem is
trivial, as we can take ( to be empty (and then a
(
= 0). In
the context of syntenic distance, any one of the following
three constraints seem relevant: (cl) ( must contain all
genes present in all the three given genomes, (c2) ( must
contain all genes present in at least two of the three given
genomes, and (c3) ( must contain all genes present in at
least one of the three given genomes. Then, computing the
median genome is NP-hard with any one of the three
constraints (cl), (c2), or (c3). Moreover, one canapproximate
the median problem in polynomial time [under any one of
the constraints (cl), (c2), or (c3)] with a constant perfor-
mance ratio. See Ref. 45 for details.
MULTIPLE SEQUENCE ALIGNMENT PROBLEMS
Multiple sequence alignment is the most critical cutting-
edge tool for sequence analysis. It can help extracting, nd-
ing, andrepresentingbiologicallyimportant commonalities
2 8 1 3 7 5 4 6 9 10 11 12
gene chromosome
Figure 5. A genome with 12 genes and 3 chromosomes.
1 12 5 3
1 3 5 12
1 9 2 5
1 9 2 5
1 12 3 7 5 19 2 4
1 19 2 7
12 3 5 4
Fission
Breaks a chromosome
into two
Fusion
Joins two chromosomes
into one
Translocation
Transfers genes between
chromosomes
Figure 6. Different mutation operations.
4 BIOLOGY COMPUTING
from a set of sequences. These commonalities could repre-
sent some highly conserved subregions, common functions,
or common structures. Multiple sequence alignment is also
very useful in inferring the evolutionary history of a family
of sequences (4649).
A multiple alignment/ of k _2 sequences is obtained
as follows: Spaces are inserted into each sequence so that
the resulting sequences s
/
i
(i = 1; 2; . . . k) have the same
length l, and the sequences are arranged in k rows of l
columns each.
The value of the multiple alignment / is denned as
X
l
i=1
m(s
/
1
(i); s
/
2
(i); . . . s
/
k
(i))
where s
/
l
(i) denotes the ith letter in the resulting sequence
s
/
l
, and m(s
/
1
(i); s
/
2
(i); . . . s
/
k
(i)) denotes the score of the ith
column. The multiple sequence alignment problem is to
construct a multiple alignment minimizing its value.
Many versions have been proposed based on different
objective functions. We will discuss some of the important
ones.
SP-Alignment and Steiner Consensus String
For SP-score (Sum-of-the-Pairs), the score of each column
is denned as:
m(s
/
1
(i); s
/
2
(i); . . . s
/
k
(i)) =
X
1_j <l_k
m(s
/
j
(i); s
/
l
(i))
where m(s
/
j
(i); s
/
l
(i)) is the score of the two opposing letters
s
/
j
(i) and s
/
l
(i). The SP-score is sensible and has previously
been studied extensively.
SP-alignment problem is to nd an alignment with the
smallest SP-score. It is rst studied in Ref. 50 and subse-
quently used in Refs. 5154. SP-alignment problem can be
solved exactly by using dynamic programming. However, if
there are k sequences and the length of sequences is n, it
takes O(n
k
) time. Thus, it works for only small numbers of
sequences. Some techniques to reduce the time and space
have been developed in Refs. 51 and 5557. With these
techniques, it is possible to optimally align up to six
sequences of 200 characters in practice.
In fact, SP-alignment problem is NP-hard (58). Thus, it
is impossible to have a polynomial time algorithm for this
problem. In the proof of NP-hardness, it is assumed that
some pairs of identical characters have a non-zero score. An
interesting open problem is if each pair of two identical
characters is scored 0.
The rst approximationalgorithmwas givenbyGuseld
(53). He introduced the center star algorithm. The center
star algorithm is very simple and efcient. It selects a
sequence (called center string) s
c
in the set of k given
sequences S such that
P
k
i=1
dist(s
c
; s
i
) is minimized. It
then optimally aligns the sequences in S s
c
to s
c
and
gets k 1 pairwise alignments. These k 1 pairwise
alignments leadto a multiple alignment for the k sequences
in S. If the score scheme for pairs of characters satises the
triangle inequality, the cost of the multiple alignment
produced by the center star algorithm is at most twice
the optimum(47,53). Some improved results were reported
in Refs. 54 and 59.
Another score called consensus score is denned as fol-
lows:
m(s
/
1
(i); s
/
2
(i); . . . s
/
k
(i)) = min
s S
X
k
j=1
m(s
/
j
(i); s)
where
P
is the set of characters that form the sequences.
Here, we reconstruct a character for each column and thus
obtain a string. This string is called a Steiner consensus
string and can be used as a representative for the set of
given sequences. The problem is called the Steiner consen-
sus string problem.
The Steiner consensus string problem was proved to be
NP-complete (60) and MAX SNP-hard (58). In the proof of
MAX SNP-hardness, it is assumed that there is a wild
card, and thus the triangle inequality does not hold.
Combining with the results in Ref. 61, it shows that there
is no polynomial time approximation scheme for this pro-
blem. Interestingly, the same center star algorithmalso has
performance ratio 2 for this problem (47).
Diagonal Band Alignment. The restriction of aligning
sequences within a constant diagonal band is often used
in practical situations. Methods under this assumption
have been extensively studied too. Sankoff and Kruskal
discussed the problemunder the rubric of cutting corners
inRef. 62. Alignment withinabandis usedinthe nal stage
of the well-known FASTA program for rapid searching of
protein and DNA sequence databases (63,64). Pearson
showed that alignment within a band gives very good
results for many protein superfamilies (65).
Other references on the subject can be found in Refs. 51
and 6669. Spouge gives a survey on this topic in Ref. 70.
Let o = s
1
; s
2
; . . . ; s
k
be a set of k sequences, each of
length m (for simplicity), and / an alignment of the k
sequences. Let the length of the alignment / be M. / is
called a c-diagonal alignment if for any p _ m and
1<i < j <k, if the pth letter of s
i
is in column q of / and
the pth letter of s
j
is in column r of /, then [q r[ _ c. In
other words, the inserted spaces are evenly distributed
among all sequences and the ith position of a sequence is
about at most c positions away from the ith. position of any
other sequence.
In Ref. 71 Li, et al. presented polynomial time approx-
imation schemes of c-diagonal alignment for both SP-score
and consensus score.
Tree Alignment
Tree score: To dene the score m(s
/
1
(i); s
/
2
(i); . . . s
/
k
(i)) of
the ith. column, an evolutionary (or phylogenetic) tree
T = (V; E) with k leaves is assumed, each leaf j cor-
responding to a sequence S
j
. (Here, V and E denote the
sets of nodes and edges in T, respectively.) Let k 1; K
2; . . . ; k m be the internal nodes of T. For each inter-
nal node j, reconstruct a letter (possibly a space) s
/
j
(i)
such that
P
( p;q) E
m(s
/
p
(i); s
/
q
(i)) is minimized. The score
BIOLOGY COMPUTING 5
m(s
/
1
(i); s
/
2
(i); . . . s
/
k
(i)) of the ith column is thus denned as
m(s
/
1
(i); s
/
2
(i); . . . s
/
k
(i)) =
X
( p;q) E
m s
/
p
(i); s
/
q
(i)
This measure has been discussed in Refs. 14,48,51,59
and 72. Multiple sequence alignment with tree score is
often referred to as tree alignment in the literature.
Note that a tree alignment induces a set of reconstructed
sequences, each corresponding to an internal node. Thus, it
is convenient to reformulate a tree alignment as follows:
Givenaset Xof ksequences andanevolutionary tree Twith
k leaves, where each leaf is associated with a given
sequence, reconstruct a sequence for each internal node
to minimize the cost of T. Here, the cost of Tis the sumof the
edit distance of each pair of (given or reconstructed)
sequences associated with an edge. Observe that once a
sequence for each internal node has been reconstructed, a
multiple alignment can be obtained by optimally aligning
the pair of sequences associated with each edge of the tree.
Moreover, the tree score of this induced multiple alignment
equals the cost of T. In this sense, the two formulations of
tree alignment are equivalent.
Sankoff gave an exact algorithmfor tree alignment that
runs inO(n
k
), where nis the lengthof the sequences andkis
the number of given sequences. Tree alignment was proved
to be NP-hard (58).
Therefore, it is unlikely to have a polynomial time algo-
rithm for tree alignment. Some heuristic algorithms have
also been considered inthe past. Altschul and Lipman tried
to cut down the computation volume required by dynamic
programming (51). Sankoff, et al. gave an iterative
improvement method to speed up the computation
(48,72). WatermanandPerlwitz devisedaheuristic method
when the sequences are related by a binary tree (73). Hein
proposed a heuristic method based on the concept of a
sequence graph (74,75). Ravi and Kececioglu designed an
approximation algorithm with performance ratio
deg1
deg1
when the given tree is a regular deg-ary tree (i.e., each
internal node has exactly deg children) (76).
The rst approximation algorithm with a guaranteed
performance ratio was devised by Wang, et al. (77). Aratio-
2 algorithm was given. The algorithm was then extended
to a polynomial time approximation scheme (PTAS) (i.e.,
the performance ratio could arbitrarily approach 1). The
PTAS requires computing exact solutions for depth-t sub-
trees. For a xed t, the performance ratio was proved to
be 1
3
t
, and the running time was proved to be O((k=
deg
t
)
deg
t1
2
M(2; t 1; n)), where deg is the degree of the
given tree, n is the length of the sequences, and M(deg; t
1; n) is the time needed to optimally align a tree with
deg
t1
1 leaves, which is upper-bounded by O(n
deg
t1
1
).
Based on the analysis, to obtain a performance ratio less
than 2, exact solutions for depth-4 subtrees must be com-
puted, andthus optimally aligning nine sequences at a time
is required, which is impractical even for sequences of
length 100.
Animprovedversionwas giveninRef. 78. Theyproposed
a new PTAS for the case where the given tree is a regular
deg-ary tree. The algorithm is much faster than the one in
Ref. 77. The algorithm also must do local optimizations for
depth-t subtrees. For a xed t, the performance ratio of the
new PTAS is 1
2
t
2
t2
t
and the running time
is O(min2
t
; kkdM(deg; t 1; n)), where d is the depth of
the tree. Presently, there are efcient programs (72) to do
local optimizations for three sequences (t = 2). In fact, we
can expect to obtain optimal solutions for ve sequences
(t = 3) of length 200 in practice as there is such a program
(55,56) for SP-score and similar techniques can be used to
attack the tree alignment problem. Therefore, solutions
withcosts at most 1.583 times the optimumcanbe obtained
in practice for strings of length 200.
For tree alignment, the given tree is typically a binary
tree. Recently, Wang et al. (79) designed a PTAS for binary
trees. The newapproximation scheme adopts a more clever
partitioning strategy and has a better time efciency for
the same performance ratio. For any xed r, where r =
2
t1
1 q and 0 _ q _ 2
t2
1, the new PTAS runs in
time O(kdn
r
) and achieves an approximation ratio of
1
2
t1
2
t2
(t1)q
. Here, the parameter r represents the size
of local optimization. In particular, when r = 2
t1
1, its
approximation ratio is simply 1
2
t1
.
Generalized Tree Alignment
In practice, we often face a more difcult problem called
generalized tree alignment. Suppose we are given a set of
sequences. The problem is to construct an evolutionary
tree as well as a set of sequences (called reconstructed
sequences) such that each leaf of the evolutionary tree is
assigned a given sequence, each internal node of the tree
is assigned a reconstructed sequence, and the cost of the
tree is minimized over all possible evolutionary trees and
reconstructed sequences.
Intuitively, the problem is harder than tree alignment
because the tree is not given and we have to compute the
tree structure as well as the sequences assigned to internal
nodes. Infact, the problemwas provedto be MAXSNP-hard
(58) and a simplied proof was given in Ref. 80. It implies
that it is impossible to have a PTAS for generalized tree
alignment unless P=NP (61), which conrms the observa-
tion from an approximation point of view.
Generalized tree alignment problem is, in fact, the
Steiner tree problem in sequence spaces. One might use
the approximation algorithms with guaranteed perfor-
mance ratios (81) to graph Steiner trees, which, however,
may lead to a tree structure where a given sequence is an
internal node. Sometimes, it is unacceptable. Schwikowski
and Vingron give a method that combines clustering algo-
rithms and Heins sequence graph method (82). The pro-
duced solutions contain biologically reasonable trees and
keep the guaranteed performance ratio.
Fixed Topology History/Alignment with Recombination
Multigene families, viruses, and alleles from within popu-
lations experience recombinations (23,24,83,84). When
recombination happens, the ancestral material on the pre-
sent sequence s
1
is located on two sequences s
2
and s
3
. s
2
and s
3
can be cut at k locations (break points) into k 1
pieces, where s
2
= s
2;1
s
2
. . . s
2;l1
and s
3
= s
3;1
s
3;2
. . . s
3;l1
.
s
1
can be represented as s
^
2;1
= s
^
3;2
s
^
2;3
. . . s
^
2;i
s
3;
^
i1
. . ., where
6 BIOLOGY COMPUTING
subsequences s
^
2;i
and s
3;
^
i1
differ from the corresponding
s
2;i
and s
3;i1
by insertion, deletion, and substitution of
letters. k, the number of times s
1
switches between s
2
and s
3
, is called the number of crossovers. The cost of the
recombination is
dist(s
1;1
; s
^
1;1
) dist(s
2;2
; s
^
2;2
); . . . dist(s
1;i
; s
^
1;i
) dist
(s
2;i1
; s
2;
^
i1
) kx
wheredist(s
2;i1
; s
2;
^
i1
) is the edit distancebetweenthe two
sequences s
2;i1
and s
2;
^
i1
, k is the number of crossovers,
and x the crossover penalty. The recombination distance to
produce s
1
froms
2
and s
3
is the cost of a recombination that
has the smallest cost among all possible recombinations.
We use r dist(s
1
; s
2
; s
3
) to denote the recombination dis-
tance. For more details, see Refs. 83 and 85.
When recombination occurs, the given topology is no
longer a binary tree. Instead, some nodes, called recombi-
nation nodes, in the given topology may have two parents
(23,24). In a more general case, as described in Ref. 83, the
topology may have more than one root. The set of roots is
called a pro-toset. The edges incident to recombination
nodes are called recombination edges [see Fig. 7 (b)]. A
node/edge is normal if it is not a recombination node/edge.
The cost of a pair of recombination edges is the recom-
bination distance to produce the sequence on the recombi-
nationnode fromthe two sequences on its parents. The cost
of other normal edges is the edit distance between two
sequences. A topology is fully labeled if every node in the
topology is labeled. For a fully labeled topology, the cost of
the topology is the total cost of edges in the topology. Each
node in the topology with degree greater than 1 is an
internal node. Each leaf/terminal (degree 1 node) in the
topologyis labeledwithagivensequence. The goal here is to
construct a sequence for each internal node such that the
cost of the topology is minimized. We call this problemxed
topology history with recombination (FTHB).
Obviously, this problemis a generalization of tree align-
ment. The difference is that the giventopology is no longer a
binary tree. Instead, there are some recombination nodes
that have two parents instead of one. Moreover, there may
be more than one root in the topology.
A different version called xed topology alignment
with recombination (FTAR) is also dicsussed (86). From
anapproximation point of view, FTHRand FTARare much
harder than tree alignment. It is shown that FTHR and
FTAR cannot be approximated within any constant perfor-
mance ratio unless P = NP (86).
A more restricted case, where each internal node has at
most one recombination child and there are at most six
parents of recombination nodes in any path fromthe root to
a leaf in the given topology, is also considered. It is shown
that the restricted version for both FTHR and FTAR is
MAX-SNP-hard. That is, there is no polynomial time
approximation scheme unless P = NP (86).
The above hardness results are disappointing. However,
recombination occurs infrequently. So, it is interesting
to study some restricted cases. A merge node of recombi-
nation node v is the lowest common ancestor of vs two
parents. The two different paths from a recombination
node to its merge node are called merge paths. We then
study the case, where
v
(C1) each internal node has at most one recombination
child and
v
(C2) any two merge paths for different recombination
nodes do not share any common node.
Using a method similar to the lifting method for tree
alignment, one can get a ratio-3 approximation algorithm
for bothFTHRand HTARwhenthe given topology satises
(C1) and (C2). The ratio-3 algorithm can be extended to a
PTAS for FTAR with bounded number of crossovers (see
Ref. 86).
Remarks: Heinmight be the rst to study the methodto
reconstruct the history of sequences subject to recombina-
tion (23,24). Hein observed that the evolution of a sequence
with k recombinations could be described by k recombina-
tion points and k 1 trees describing the evolution of the
k 1 intervals, where two neighboring trees were either
identical or differed by one subtree transfer operation (23
26,35). A heuristic method was proposed to nd the most
parsimonious history of the sequences in terms of mutation
and recombination operations.
Another strike was given by Kececioglu and Guseld
(83). They introduced two new problems, recombination
distance and bottleneck recombination history. They tried
to include higher-order evolutionary events such as block
insertions and deletions (68) as well as tandem repeats
(87,88).
CONCLUSION
In this article, we have discussed some important topics in
the eld of computational biology such as the phylogenetic
construction and comparison methods, syntenic distance
between genomes, and the multiple sequence alignment
problems. Given the vast majority of topics in computa-
tional biology, these discussed topics constitute only a part
of them. Some of the important topics that were not covered
in this articls are as follows:
v
protein structure prediction,
v
DNA physical mapping problems,
recombination
(a) (b)
crossover
Figure 7. (a) Recombination operation, (b) The topology. The
dark edges are recombination edges. The circled node is a recom-
bination node.
BIOLOGY COMPUTING 7
v
metabolic modeling, and
v
string/database search problems.
We hope that this survey article will inspire the readers
for further study and research of these and other related
topics.
Papers on compuational molecular biology have started
to appear in many different books, journals, and confer-
ences. Below, we list some sources that could serve as
excellent starting points for various problems that occur
in computational biology:
Books: See Refs. 49,53,62 and 8992.
Journals: Computer Applications in the Biosciences
(recently renamed as Bioinformatics), Journal of
Computational Biology, Bulletin of Mathematical
Biology, Journal of Theoretical Biology, and so on.
Conferences: Annual Symposium on Combinatorial
Pattern Matching (CPM), Pacic Symposium on Bio-
computing (PSB), Annual International Conference
on Computational Molecular Biology (RECOMB),
Annual Conference on Intelligent Systems in Mole-
cular Biology (ISMB), and so on.
Web pages: http://www.cs.Washington.edu/
education/courses/590bi, http://www.cse.
ucsc.edu/research/compbio, http://www.
cs.jhu.edu/salzberg/cs439. html, and so on.
ACKNOWLEDGMENTS
We thankProf. Tao Jiang for bringing the authors together.
We also thank Dr. Todd Wareham for carefully reading
the draft and giving valuable suggestions. Bhaskar
DasGuptas work was partly supported by NSF grants
CCR-0296041, CCR-0206795, and CCR-0208749. Lusheng
Wangs work was fully supported by the Research Grants
Council of the Hong Kong Special Administrative Region,
China (Project No. CityU 1070/02E).
BIBLIOGRAPHY
1. M. S. Waterman, Sequence alignments, in M. S. Waterman
(ed.), Mathematical Methods for DNA Sequences, Boca Raton,
FL: CRC Press, 1989, pp. 5392.
2. W. Miller, S. Scbwartz, and R. C. Hardison, A point of contact
between computer science and molecular biology, IEEE Com-
putat. Sci. Eng., 6978, 1994.
3. K. A. Frenkel, The human genome project and informatics,
Commun. ACM, 34, (11): 4151, 1991.
4. E. S. Lander, R. Langridge, and D. M. Saccocio, Mapping and
interpreting biological information, Commun. ACM, 34, (11):
3339, 1991.
5. V. A. Funk, and D. R. Brooks, Phylogenetic Systematics as the
Basis of Comparative Biology, Washington, DC: Smithsonian
Institution Press, 1990.
6. D. M. Hillis, B. K. Mable, and C. Moritz, Applications of
molecular systematics, in D. M. Hillis, C. Moritz, and B. K.
Mable (eds.), Molecular Systematics, (2nd ed.) Sunderland,
MA: Sinauer Associates, 1996, pp. 515543.
7. M. R. Garey and D. S. Johnson, Computers and Intractability:
A Guide to the Theory of NF-completeness, New York: W. H.
Freeman, 1979.
8. D. Hochbaum, Approximation Algorithms for NP-hard Pro-
blems, Boston, MA: PWS Publishers, 1996.
9. C. H. Papadimitriou, Computational Complexity, Reading,
MA: Addison-Wesley, 1994.
10. J. Felsenstein, Phylogenies from molecular sequences: infer-
ences and reliability, Annu. Rev. Genet., 22: 521565, 1988.
11. D. L. Swofford, G. J. Olsen, P. J. Waddell, and D. M. Hillis,
Phylogenetic inference, in D. M. Hillis, C. Moritz, and B. K.
Mable (eds.), Molecular Systematics, (2nd ed.), Sunderland,
MA: Sinauer Associates, 1996, pp. 407514.
12. A. W. F. Edwards and L. L. Cavalli-Sforza, The reconstruc-
tion of evolution, Ann. Hum. Genet, 27: 105, 1964, (also in
Heredity 18: 553, 1964).
13. W. M. Fitch, Towarddenningthe course of evolution: minimum
change for a specied tree topology, Syst Zool., 20: 406416,
1971.
14. D. Sankoff, Minimal mutation trees of sequences, SIAM J.
Appl. Math., 28: 3542, 1975.
15. L. L. Cavalli-Sforza and A. W. F. Edwards, Phylogenetic ana-
lysis: models and estimation procedures, Evolution, 32: 550
570, 1967, (also published in Am. J. Hum. Genet., 19: 233257,
1967.)
16. W. M. Fitch and E. Margoliash, Construction of phylogenetic
trees, Science, 155: 279284, 1967.
17. N. Saitou and M. Nei, The neighbor-joining method: a new
method for reconstructing phylogenetic trees, Mol. Biol. Evol.,
4: 406425, 1987.
18. D. Barry and J. A. Hartigan, Statistical analysis of hominoid
molecular evolution, Stat. Sci., 2: 191210, 1987.
19. J. Felsenstein, Evolutionary trees for DNA sequences: a max-
imum likelihood approach, J. Mol. Evol., 17: 368376, 1981.
20. M. Kuhner and J. Felsenstein, A simulation comparison of
phylogeny algorithms under equal and unequal evolutionary
rates. Mol. Biol. Evol. 11 (3): 459468, 1994.
21. D. F. Robinson, Comparisonof labeledtrees withvalencythree,
J. Combinatorial Theory, Series B, 11: 105119, 1971.
22. G. W. Moore, M. Goodman, and J. Barnabas, An iterative
approach from the standpoint of the additive hypothesis to
the dendrogram problem posed by molecular data sets, J.
Theoret. Biol., 38: 423457, 1973.
23. J. Hein, Reconstructing evolution of sequences subject to
recombination using parsimony, Math. Biosci., 98: 185200,
1990.
24. J. Hein, A heuristic method to reconstruct the history of
sequences subject to recombination, J. Mol. Evol., 36: 396
405, 1993.
25. B. DasGupta, X. He, T. Jiang, M. Li, J. Tromp, andL. Zhang, On
distances between phylogenetic trees, Proc. 8th Annual ACM-
SIAM Symposium on Discrete Algorithms, 1997, pp. 427436.
26. B. DasGupta, X. He, T. Jiang, M. Li, and J. Tromp, On the
linear-cost subtree-transfer distance, to appear in the special
issue in Algorithmica on computational biology, 1998.
27. K. Culik II and D. Wood, A note on some tree similarity
measures, Information Proc. Lett., 15: 3942, 1982.
28. M. Li, J. Tromp, and L. X. Zhang, On the nearest neighbor
interchange distance between evolutionary trees, J. Theoret.
Biol., 182: 463467, 1996.
8 BIOLOGY COMPUTING
29. D. Sleator, R. Tarjan, W. Thurston, Short encodings of evolving
structures, SIAM J. Discr. Math., 5: 428450, 1992.
30. M. S. Waterman and T. F. Smith, On the similarity of dendro-
grams, J. Theoret. Biol., 73: 789800, 1978.
31. W. H. E. Day, Properties of the nearest neighbor interchange
metric for trees of small size, J. Theoretical Biol., 101: 275288,
1983.
32. J. P. Jarvis, J. K. Luedeman, andD. R. Shier, Counterexamples
in measuring the distance between binary trees, Math. Social
Sci., 4: 271274, 1983.
33. J. P. Jarvis, J. K. Luedeman, and D. R. Shier, Comments on
computing the similarity of binary trees, J. Theoret. Biol., 100:
427433, 1983.
34. M. Kriva nek, Computing the nearest neighbor interchange
metric for unlabeled binary trees is NP-complete, J. Classica-
tion, 3: 5560, 1986.
35. J. Hein, T. Jiang, L. Wang, and K. Zhang, On the complexity of
comparing evolutionary trees, Discrete Appl. Math., 7: 153
169, 1996.
36. D. Sleator, R. Tarjan, and W. Thurston, Rotation distance,
triangulations, and hyperbolic geometry, J. Amer. Math. Soc.,
1: 647681, 1988.
37. V. Bafna and P. Pevzner, Genome rearrangements and sorting
by reversals, 34th IEEE Symp. on Foundations of Computer
Science, 1993, pp. 148157.
38. V. Bafna, and P. Pevzner, Sorting by reversals: genome rear-
rangements in plant organelles and evolutionary history of X
chromosome, Mol. Biol. Evol, 12: 239246, 1995.
39. S. Hannenhalli and P. Pevzner, Transforming Cabbage into
Turnip (polynomial algorithmfor sorting signed permutations
by reversals), Proc. of 27th Ann. ACM Symp. on Theory of
Computing, 1995, pp. 178189.
40. J. Kececioglu and D. Sankoff, Exact and approximation algo-
rithms for the inversion distance between two permutations,
Proc. of 4th Ann. Symp. on Combinatorial Pattern Matching,
Lecture Notes in Comp. Sci., 684: 87105, 1993.
41. J. Kececioglu, and D. Sankoff, Efcient bounds for oriented
chromosome inversion distance, Proc. of 5th Ann. Symp. on
Combinatorial Pattern Matching, Lecture Notes in Comp. Sci.,
807: 307325, 1994.
42. V. Bafna, and P. Pevzner, Sorting by transpositions, Proc. of
6th Ann. ACM-SIAM Symp. on Discrete Algorithms, 1995,
pp. 614623.
43. J. Kececioglu and R. Ravi, Of mice and men: evolutionary
distances between genomes under translocation, Proc. of 6th
Ann. ACM-SIAMSymp. onDiscrete Algorithms, 1995, pp. 604
613.
44. V. Ferretti, J. H. Nadeau, and D. Sankoff, Original synteny, in
Proc. of 7th Ann. Symp. on Combinatorial Pattern Matching,
1996, pp. 159167.
45. B. DasGupta, T. Jiang, S. Kannan, M. Li, and E. Sweedyk, On
the complexity and approximation of syntenic distance, 1st
Annual International Conference On Computational Molecu-
lar Biology, 1997, pp. 99108 (journal version to appear in
Discrete and Applied Mathematics).
46. S. C. Chan, A. K. C. Wong, and D. K. T. Chiu, A survey of
multiple sequence comparison methods, Bull. Math. Biol., 54,
(4): 563598, 1992.
47. D. Guseld, Algorithms on Strings, Trees, and Sequences:
Computer Science and Computational Biology, Cambridge,
UK: Cambridge University Press, 1997.
48. D. Sankoff, and R. Cedergren, Simultaneous comparisons
of three or more sequences related by a tree, in D. Sankoff,
and J. Kruskal (eds.), Time Warps, String Edits, and Macro-
molecules: The Theory and Practice of Sequence Comparison,
Reading, MA: Addison Wesley, 1983, pp. 253264.
49. M. S. Waterman, Introduction to Computational Biology:
Maps, Sequences, and Genomes, London: Chapman and
Hall, 1995.
50. H. Carrillo and D. Lipman, The multiple sequence alignment
problemin biology, SIAMJ. Appl. Math., 48: 10731082, 1988.
51. S. Altschul, and D. Lipman, Trees, stars, and multiple
sequence alignment, SIAM J. Applied Math., 49: 197209,
1989.
52. D. Baconn, and W. Anderson, Multiple sequence alignment, J.
Molec. Biol., 191: 153161, 1986.
53. D. Guseld, Efcient methods for multiple sequence alignment
with guaranteed error bounds, Bull. Math. Biol., 55: 141154,
1993.
54. P. Pevzner, Multiple alignment, communication cost, and
graphmatching, SIAMJ. Appl. Math., 56(6): 17631779, 1992.
55. S. Gupta, J. Kececioglu, and A. Schaffer, Making the shortest-
paths approach to sum-of-pairs multiple sequence alignment
more space efcient in practice, Proc. 6th Symposium on
Combinatorial Pattern Matching, Springer LNCS937, 1995,
pp. 128143.
56. J. Lipman, S. F. Altschul, and J. D. Kececioglu, A tool for
multiple sequence alignment, Proc. Nat. Acid Sci. U.S.A.,
86: 44124415, 1989.
57. G. D. Schuler, S. F. Altschul, and D. J. Lipman, A workbench
for multiple alignment construction and analysis, in Proteins:
Structure, function and Genetics, in press.
58. L. Wang and T. Jiang, On the complexity of multiple sequence
alignment, J. Computat. Biol., 1: 337348, 1994.
59. V. Bafna, E. Lawer, and P. Pevzner, Approximate methods for
multiple sequence alignment, Proc. 5th Symp. on Combinator-
ial Pattern Matching. Springer LNCS 807, 1994, pp. 4353.
60. E. Sweedyk, and T. Warnow, The tree alignment problem is
NP-complete, Manuscript.
61. S. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy, On
the intractability of approximation problems, 33rd IEEESym-
posium on Foundations of Computer Science, 1992, pp. 1423.
62. D. Sankoff andJ. Kruskal (eds.). Time Warps, StringEdits, and
Macro-Molecules: The Theory and Practice of Sequence Com-
parison, Reading, MA: Addison Wesley, 1983.
63. W. R. Pearson and D. Lipman, Improved tools for biological
sequence comparison, Proc. Natl. Acad. Sci. USA, 85: 2444
2448, 1988.
64. W. R. Pearson, Rapid and sensitive comparison with FASTP
and FASTA, Methods Enzymol., 183: 6398, 1990.
65. W. R. Pearson, Searching protein sequence libraries: compar-
ison of the sensitivity and selectivity of the Smith-Waterman
and FASTA algorithms, Genomics, 11: 635650, 1991.
66. K. Chao, W. R. Pearson, and W. Miller, Aligning two sequences
within a specied diagonal band, CABIOS, 8: 481487, 1992.
67. J. W. Fickett, Fast optimal alignment, Nucleic Acids Res., 12:
175180, 1984.
68. Z. Galil and R. Ciancarlo, Speeding up dynamic programming
withapplications to molecular biology, Theoret. Comp. Sci., 64:
107118, 1989.
69. E. Ukkonen, Algorithms for approximate string matching,
Inform. Control, 64: 100118, 1985.
70. J. L. Spouge, Fast optimal alignment, CABIOS, 7: 17, 1991.
BIOLOGY COMPUTING 9
71. M. Li, B. Ma, and L. Wang, Near optimal multiple alignment
within a band in polynomial time, 32th ACM Symp. on Theory
of Computing, 2000, pp. 425434.
72. D. Sankoff, R. J. Cedergren, and G. Lapalme, Frequency of
insertion-deletion, transversion, and transition in the evolu-
tion of 5S ribosomal RNA, J. Mol. Evol., 7: 133149, 1976.
73. M. S. Waterman and M. D. Perlwitz, Line geometries for
sequence comparisons, Bull. Math. Biol, 46: 567577, 1984.
74. J. Hein, Atree reconstruction method that is economical in the
number of pairwise comparisons used, Mol. Biol. Evol., 6, (6):
669684, 1989.
75. J. Hein, A new method that simultaneously aligns and recon-
structs ancestral sequences for any number of homologous
sequences, when the phylogeny is given, Mol. Biol. Evol., 6:
649668, 1989.
76. R. Ravi and J. Kececioglu, Approximation algorithms for
multiple sequence alignment under a xed evolutionary
tree, 5thAnnual SymposiumonCombinatorial PatternMatch-
ing, 1995, pp. 330339.
77. L. Wang, T. Jiang, andE. L. Lawler, Approximationalgorithms
for tree alignment with a given phylogeny, Algorithmica, 16:
302315, 1996.
78. L. Wang and D. Guseld, Improved approximation algorithms
for tree alignment, J. Algorithms, 25: 255173, 1997.
79. L. Wang, T. Jiang, and D. Guseld, A more efcient approx-
imation scheme for tree alignment, SIAMJ. Comput., 30: 283
299, 2000.
80. H. T. Wareham, A simplied proof of the NP-hardness and
MAX SNP-hardness of multiplel sequence tree alignment, J.
Computat. Biol., 2: 509514, 1995.
81. A. Z. Zelikovsky, The 11/6 approximation algorithm for the
Steiner problem on networks, Algorithmica, 9: 463470, 1993.
82. B. Schwikowski and M. Vingron, The deferred path heuristic
for the generalized tree alignment problem, 1st Annual Inter-
national Conference On Computational Molecular Biology,
1997, pp. 257266.
83. J. Kececioglu and D. Guseld, Reconstructing a history of
recombinations from a set of sequences, 5th Annual ACM-
SIAM Symposium on Discrete Algorithms, pp. 471480, 1994.
84. F. W. Stahl, Genetic Recombination, New York: Scientic
American, 1987, pp. 90101.
85. J. D. Watson, N. H. Hopkins, J. W. Roberts, J. A. Steitz, and A.
M. Weiner, Molecular Biology of the Gene, 4th ed.Menlo Park,
CA: Benjamin-Cummings, 1987.
86. B. Ma, L. Wang, and M. Li, Fixed topology alignment with
recombination, CPM98, to appear.
87. S. Kannan and E. W. Myers, An algorithm for locating non-
overlapping regions of maximum alignment score, 3rd
Annual Symposium on Combinatorial Pattern Matching,
1993, pp. 7486.
88. G. M. Landauand J. P. Schmidt, Analgorithmfor approximate
tandem repeats, 3rd Annual Symposium on Combinatorial
Pattern Matching, 1993, pp. 120133.
89. J. Collado-Vides, B. Magasanik, and T. F. Smith (eds.), Inte-
grative Approaches to Molecular Biology, Cambridge, MA: MIT
Press, 1996.
90. L. Hunter (ed.), Articial Intelligence in Molecular Biology,
Cambridge, MA: MIT Press, 1993.
91. J. Meidanis and J. C. Setubal, Introduction to Computational
Molecular Biology, Boston, MA: PWS Publishing Company,
1997.
92. G. A. Stephens, String Searching Algorithms, Singapore:
World Scientic Publishers, 1994.
BHASKAR DASGUPTA
University of Illinois at Chicago
Chicago, Illinois
LUSHENG WANG
City University of Hong Kong
Kowloon, Hong Kong
10 BIOLOGY COMPUTING
C
COMPUTATIONAL INTELLIGENCE
INTRODUCTION
Several interpretations of the notion of computational
intelligence (CI) exist (19). Computationally intelligent
systems have beencharacterized by Bezdek(1,2) relative to
adaptivity, fault-tolerance, speed, and error rates. In its
original conception, many technologies were identied to
constitute the backbone of computational intelligence,
namely, neural networks (10,11), genetic algorithms
(10,11), fuzzy sets and fuzzy systems (10,11), evolutionary
programming (10,11), and articial life (12). More recently,
rough set theory (1333) has been considered in the context
of computationally intelligent systems (3, 611, 1316, 29
31, 34, 35) that naturally led to a generalization in the
context of granular computing (32). Overall, CI can be
regarded as a eld of intelligent systemdesignand analysis
that dwells on a well-dened and clearly manifested
synergy of genetic, granular, and neural computing. A
detailed introduction to the different facets of such a
synergy along with a discussion of various realizations of
such synergistic links between CI technologies is given in
Refs. 3, 4, 10, 11, 19, 36, and 37.
GENETIC ALGORITHMS
Genetic algorithms were proposed by Holland as a search
mechanism in articially adaptive populations (38). A
genetic algorithm (GA) is a problem-solving method that
simulates Darwinian evolutionary processes and naturally
occurring genetic operations on chromosomes (39). In nat-
ure, a chromosome is a thread-like linear strand of DNA
and associated proteins in the nucleus of animal and plant
cells. Achromosome carries genes and serves as a vehicle in
transmitting hereditary information. Agene is a hereditary
unit that occupies a specic location on a chromosome and
that determines a particular trait in an organism. Genes
can undergo mutation (alteration or structural change). A
consequence of the mutation of genes is the creation of a
newtrait inanorganism. Ingenetic algorithms, the traits of
articial life forms are stored in bit strings that mimic
chromosome strings found in nature. The traits of indivi-
duals in a population are represented by a set of evolving
chromosomes. A GA transforms a set of chromosomes to
obtain the next generation of an evolving population. Such
transformations are the result of applying operations, such
as reproduction based on survival of the ttest, and genetic
operations, such as sexual recombination(also called cross-
over) and mutation.
Each articial chromosome has an associated tness,
which is measured with a tness function. The simplest
form of tness function is known as raw tness, which
is some form of performance score (e.g., number of pieces
of food found, amount of energy consumed, or number of
other life forms found). Each chromosome is assigned a
probability of reproduction that is proportional to its t-
ness. In a Darwinian system, natural selection controls
evolution (10). Consider, for example, a collection of arti-
cial life forms withbehaviors resemblingants. Fitness will
be quantied relative to the total number of pieces of food
found and eaten (partially eaten food is counted). Repro-
duction consists in selecting the ttest individual x and the
weakest individual y in a population and replacing y with a
copy of x. After reproduction, a population will then have
two copies of the ttest individual. A crossover operation
consists in exchanging genetic coding (bit values of one or
more genes) in two different chromosomes. The steps in a
crossover operation are as follows: (1) Randomly select a
location (also called the interstitial location) between two
bits in a chromosome string to form two fragments, (2)
select two parents (chromosomes to be crossed), and (3)
interchange the chromosome fragments. Because of the
complexity of traits represented by a gene, substrings of
bits in a chromosome are used to represent a trait (41). The
evolution of a population resulting from the application of
genetic operations results in changing the tness of indi-
vidual population members. A principal goal of GAs is to
derive a population with optimal tness.
The pioneering works of Holland (38) and Fogel et al.
(42) gave birth to the new paradigm of population-driven
computing (evolutionary computation) resulting in struc-
tural and parametric optimization. Evolutionary program-
ming was introduced by Fogel in the 1960s (43). The
evolution of competing algorithms denes evolutionary
programming. Each algorithm operates on a sequence of
symbols to produce an output symbol that is likely to
maximize an algorithms performance relative to a well-
dened payoff function. Evolutionary programming is the
precursor of genetic programming (39). Ingenetic program-
ming, large populations of computer programs are bred
genetically. One may also refer to biologically inspired
optimization, such as particle swarm optimization (PSO),
ant colonies, and others.
FUZZY SETS AND SYSTEMS
Fuzzy systems (models) are immediate constructs that
result from a description of real-world systems (say, social,
economic, ecological, engineering, or biological) in terms of
information granules, fuzzy sets, and the relationships
between them (44). The concept of a fuzzy set introduced
by Zadeh in 1965 (45,46) becomes of paramount relevance
when formalizing a notion of partial membership of an
element. Fuzzy sets are distinguished from the fundamen-
tal notion of a set (also called a crisp set) by the fact that
their boundaries are formed by elements whose degree of
belonging is allowed to assume numeric values in the
interval [0, 1]. Let us recall that the characteristic function
for a set Xreturns a Booleanvalue {0, 1} indicating whether
an element x is in X or is excluded from it. A fuzzy set is
noncrispinasmuchas the characteristic functionfor afuzzy
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
set returns a value in [0, 1]. Let U, X,
~
A, and x be a universe
of objects, subset of U, fuzzy set in U, and an individual
object x in X, respectively. For a set X, m
~
A : X [0; 1[ is a
function that determines the degree of membership of an
object x in X. A fuzzy set
~
A is then dened to be a set of
ordered pairs, where
~
A = (x; m
~
A(x))[x X. The counter-
parts of intersection and union (crisp sets) are the t-norm
and s-norm operators in fuzzy set theory. For the intersec-
tion of fuzzy sets, the min operator was suggested by Zadeh
(29), and it belongs to a class of intersection operators (min,
product, and bold intersection) known as triangular or t-
norms. At-normis a mapping t : [0; 1[
2
[0; 1[. The s-norm
(t-conorm) is a mapping s : [0; 1[
2
[0; 1[ (also a triangular
conorm) that is commonly used for the union of fuzzy sets.
The properties of triangular norms are presented in
Ref. 84.
Fuzzy sets exploit imprecision in conventional systems
in an attempt to make system complexity manageable. It
has been observed that fuzzy set theory offers a new model
of vagueness (1316). Many examples of fuzzy systems are
given in Pedrycz (47) and in Kruse et al. (48)
NEURAL COMPUTING
Neural networks offer a powerful and distributed comput-
ing architecture equipped with signicant learning abil-
ities (predominantly as far as parametric learning is
concerned). They help represent highly nonlinear and
multivariable relationships between system variables.
Starting from the pioneering research of McCulloch and
Pitts (49), Rosenblatt (50) as well as Minsky and Pappert
(51) neural networks have undergone a signicant meta-
morphosis and have become an important reservoir of
various learning methods (52) as well as an extension of
conventional techniques in statistical pattern recognition
(53). Articial neural networks (ANNs) were introduced to
model features of the human nervous system (49). An
articial neural network is a collection of highly inter-
connected processing elements called neurons. InANNs, a
neuron is a threshold device, which aggregates (sums)
its weighted inputs and applies an activation function to
each aggregation to produce a response. The summing
part of a neuron in an ANN is called an adaptive linear
combiner (ALC) in Refs. 54 and 55. For instance, a McCul-
lochPitts neuron n
i
is a binary threshold unit with an
ALC that computes a weighted sum net, where net =
P
n
j=0
w
j
x
j
A weight w
i
associated with x
i
represents the
strength of connection of the input to a neuron. Input x
0
represents a bias, whichcanbe thought of as aninput with
weight 1. The response of a neuron can be computed in
several ways. For example, the response of neuron n
i
can
be computed using sgn(net), where sgn(net) =1 for net >0,
sgn(net) = 0 for net = 0, and sgn(net) = 1, if net < 0. A
neuron comes with adaptive capabilities that could be
exploited fully assuming that an effective procedure is
introduced to modify the strengths of connections so that a
correct response is obtained for a given input. A good
discussion of learning algorithms for various forms of
neural networks can be found in Freeman and Skapura
(56) and in Bishop (53). Various forms of neural networks
have been used successfully in system modeling, pattern
recognition, robotics, and process control applications
(10,11,35,57,58).
ROUGH SETS
Zdzislaw Pawlak (1316,59,60) introduced rough sets
in 1981 (24,25). The rough set approach to approximation
and classication was then elaborated and rened in
Refs. 1321, 2431, 33, 61, and 62. Rough set theory offers
an approach to CI by drawing attention to the importance
of set approximation in knowledge discovery and informa-
tion granulation (32).
In particular, rough set methods provide a means of
approximating a set by other sets (17,18). For computa-
tional reasons, a syntactic representation of knowledge is
providedbyroughsets inthe formof datatables. Ingeneral,
an information system (IS) is represented by a pair (U, F),
where U is a nonempty set of objects and F is a nonempty,
countable set of probe functions that are a source of mea-
surements associated with object features (63). For exam-
ple, a feature of an image may be color with probe functions
that measure tristimulus values received from three pri-
mary color sensors, brightness (luminous ux), hue (domi-
nant wavelength in a mixture of light waves), and
saturation (amount of white light mixed with a hue).
Each f F maps an object to some value in a set V
f
. In
effect, we have f : UV
f
for every f F.
The notions of equivalence and equivalence class are
fundamental inroughsets theory. AbinaryrelationR_X
X is an equivalence relationif it is reexive, symmetric, and
transitive. A relation R is reexive if every object x X has
relation R to itself. That is, we can assert x Rx. The sym-
metric property holds for relation R if xRy implies yRx for
everyx, y X. TherelationRis transitivefor everyx, y, z X;
then xRy and yRz imply xRz. The equivalence class of an
object x X consists of all objects y X so that xRy. For each
set of functions B_F; an equivalence relation ~B =
(x; x
/
)[ \ aB: a(x) = a(x
/
) (indiscernibility relation)is
associat with it. If (x; x
/
)
~
B, we say that objects x and x
are indiscernible fromeach other relative to attributes from
B. This concept is fundamental to rough sets. The notation
[x]
B
is a commonly used shorthand that denotes the equiva-
lence class dened by x relative to a feature set B. In effect,
[x[
B
= y U[x ~By. Furthermore, Ul ~B denotes the
partition of U dened by relation ~B. Equivalence classes
[x[
B
represent B-granules of anelementaryportionof knowl-
edge that we can perceive as relative to available data. Such
a view of knowledge has led to the study of concept approx-
imation(64) andpatternextraction(65). For X _U, theset X
can be approximated only from information contained in B
by constructing a B-lower approximation B X =
[x[
B
[[x[
B
Ul ~Band [x[
B
_X and a B-upper approxima-
tion B + X = [x[
B
[[x[
B
Ul ~Band [x[
B
X ,=?, respec-
tively. In other words, a lower approximation B X of a
set X is a collection of objects that can be classied with full
certainty as members of Xusing the knowledge represented
by B. By contrast, anupper approximationB + X of a set Xis
a collection of objects representing both certain knowledge
(i.e., classes entirely contained in X) and possible uncertain
2 COMPUTATIONAL INTELLIGENCE
knowledge (i.e., possible classes partially contained inX). In
the case in which B + X is a proper subset of B X, then the
objects in Xcannot be classied with certainty and the set X
is rough. It has recently been observed by Pawlak (1316)
that this is exactly the idea of vagueness proposed by Frege
(65). That is, the vagueness of an approximation of a set
stems from its borderline region.
The size of the difference between lower and upper
approximations of a set (i.e., boundary region) provides a
basis for the roughness of an approximation, which is
important because vagueness is allocated to some regions
of what is known as the universe of discourse (space)
rather than to the whole space as encountered in fuzzy
sets. The study of what it means to be a part of provides a
basis for what is known as mereology, which was intro-
duced by Lesniewski in 1927 (66). More recently, the
study of what it means to be a part of to a degree has
led to a calculus of granules (23, 6770). In effect, gran-
ular computing allows us to quantify uncertainty and to
take advantage of uncertainty rather than to discard it
blindly.
Approximation spaces introduced by Pawlak (24)
which were elaborated by Refs. 1719, 22, 23, 29, 30,
61, 62, 70, 71, and applied in Refs. 68, 35, 64, 72, and
73 serve as a formal counterpart of our perception ability
or observation (61,62), and they provide a framework for
perceptual reasoning at the level of classes (63). In its
simplest form, an approximation space is denoted by
(U; F; ~B), where U is a nonempty set of objects (called
a universe of discourse), F is a set of functions represent-
ing object features, B_F; and ~B is an equivalence
relation that denes a partition of U. Equivalence classes
belonging to a partition Ul ~B are called elementary sets
(information granules). Given an approximation space
S = (U; F; ~B), a subset X of U is denable if it can be
represented as the union of some elementary sets. Not all
subsets of U are denable in space S(61,62). Given a
nondenable subset X in U, our observation restricted
by ~B causes X to be perceived relative to classes in the
partition Ul ~B. An upper approximation B
+
X is the set of
all classes in Ul ~Bthat have elements in common with X,
and the lower approximation B X is the set of all classes
in Ul ~B that are proper subsets of X.
Fuzzy set theory and rough set theory taken singly and
in combination pave the way for a variety of approximate
reasoning systems andapplications representing a synergy
of technologies from computational intelligence. This
synergy can be found, for example, in recent work on the
relation among fuzzy sets and rough sets (1316,35,37,
74,75), rough mereology (19,37,6769), rough control
(76,77), fuzzyroughevolutionary control (36), machine
learning (18,57,72,78), fuzzy neurocomputing (3), rough
neurocomputing (35), data mining (6,7,1316,31), diagnos-
tic systems (18,79), multiagent systems (8,9,80), real-time
decision making (34,81), robotics and unmanned vehicles
(57,82,83), intelligent systems (8,13,29,57), signal analysis
(84), perception and classication of perceptual objects (29,
30,6163), software engineering (4,8487), a dominance-
based rough set approach to multicriteria decision making
and data mining (31), VPRS (33), and shadowed sets (75).
BIBLIOGRAPHY
1. J. C. Bezdek, On the relationship between neural networks,
patternrecognitionandintelligence, Int. J. Approx. Reasoning,
6: 85107. 1992.
2. J. C. Bezdek, What is computational intelligence? in J. Zurada,
R. Marks, C. Robinson (eds.), Computational Intelligence:
Imitating Life, Piscataway, NJ: IEEE Press, 1994,
pp. 112.
3. W. Pedrycz, Computational Intelligence: AnIntroduction, Boca
Raton, FL: CRC Press, 1998.
4. W. Pedrycz, J. F. Peters (eds.), Computational intelligence in
software engineering, Advances in Fuzzy SystemsApplica-
tions and Theory, vol. 16. Singapore: World Scientic, 1998.
5. D. Poole, A. Mackworth, R. Goebel, Computational Intelli-
gence: A Logical Approach. Oxford: Oxford University Press,
1998.
6. N. Cercone, A. Skowron, N. Zhong(eds.), Roughsets, fuzzysets,
data mining, and granular-soft computing special issue, Com-
put. Intelli.: An Internat. J., 17(3): 399603, 2001.
7. A. Skowron, S. K. Pal (eds.), Rough sets, pattern recognition
and data mining special issue, Pattern Recog. Let., 24(6):
829933, 2003.
8. A. Skowron, Toward intelligent systems: calculi of information
granules, in: T. Terano, T. Nishida, A. Namatane, S. Tsumoto,
Y. Ohsawa, T. Washio (eds.), New Frontiers in Articial Intel-
ligence, Lecture Notes in Articial Intelligence 2253. Berlin:
Springer-Verlag, 2001, pp. 2839.
9. J. F. Peters, A. Skowron, J. Stepaniuk, S. Ramanna, Towards
anontology of approximate reason, Fundamenta Informaticae,
51(1-2): 157173, 2002.
10. IEEE World Congress on Computational Intelligence, Vancou-
ver B.C., Canada, 2006.
11. M. H. Hamaza, (ed.), Proceedings of the IASTED Int. Conf. on
Computational Intelligence. Calgary, AB, Canada, 2005.
12. R. Marks, Intelligence: computational versus articial, IEEE
Trans. on Neural Networks, 4: 737739, 1993.
13. Z. Pawlak and A. Skowron, Rudiments of rough sets, Informa-
tion Sciences, 177(1): 327, 2007.
14. J. F. Peters and A. Skowron, Zdzislaw Pawlak life and work
(19262006), Information Sciences, 177(1): 12, 2007.
15. Z. Pawlak and A. Skowron, Rough sets: Some extensions,
Information Sciences, 177(1): 2840, 2007.
16. Z. Pawlak, and A. Skowron, Roughsets andBooleanreasoning,
Information Sciences, 177(1): 4173, 2007.
17. Z. Pawlak, Rough sets, Int. J. of Informat. Comput. Sciences,
11(5): 341356, 1982.
18. Z. Pawlak, Rough Sets. Theoretical Aspects of Reasoning about
Data, Dordrecht: Kluwer Academic Publishers, 1991.
19. L. Polkowski, Rough sets, Mathematical Foundations.
Advances inSoft Computing, Heidelberg: Physica-Verlag, 2002.
20. Z. Pawlak, Some issues on rough sets, Transactions on Rough
Sets I, LNCS 3100, 2004, pp. 158.
21. Z. Pawlak, A treatise on rough sets, Transactions on Rough
Sets IV, LNCS 3700, 2005, pp. 117.
22. A. Skowron, and J. Stepaniuk, Generalized approximation
spaces, in: T. Y. Lin, A. M. Wildberger, (eds.), Soft Computing,
San Diego, CA: Simulation Councils, 1995, pp. 1821.
23. A. Skowron, J. Stepaniuk, J. F. Peters and R. Swiniarski,
Calculi of approximation spaces, Fundamenta Informaticae,
72(13): 363378, 2006.
COMPUTATIONAL INTELLIGENCE 3
24. Z. Pawlak, Classication of Objects by Means of Attributes.
Institute for Computer Science, Polish Academy of Sciences,
Report 429: 1981.
25. Z. Pawlak, Rough Sets. Institute for Computer Science, Polish
Academy of Sciences, Report 431: 1981.
26. Z. Pawlak, Rough classication, Int. J. of ManMachine
Studies, 20(5): 127134, 1984.
27. Z. Pawlak, Rough sets and intelligent data analysis, Informa-
tion Sciences: An Internat, J., 147(14): 112, 2002.
28. Z. Pawlak, Roughsets, decisionalgorithms and Bayes theorem,
European J. Operat. Res., 136: 181189, 2002.
29. M. Kryszkiewicz, J. F. Peters, H. Rybinski and A. Skowron
(eds.), rough sets and intelligent systems paradigms, Lecture
Notes in Articial Intelligence 4585 Berlin: Springer,
2007.
30. J. F. Peters and A. Skowron, Transactions on Rough Sets,
volumes I-VII, Berlin: Springer, 20042007. Avaiilable:
http://www.springer.com/west/home/computer/
lncs?SGWID=41646996270.
31. R. Slowinski, S. Greco and B. Matarazzo, Dominance-based
rough set approach to reasoning about ordinal data, in: M.
Kryszkiewicz, J. F. Peters, H. Rybinski and A. Skowron, eds.,
Rough Sets and Intelligent Systems Paradigms, Lecture Notes
in Articial Intelligence, Berlin: Springer, 2007, pp. 511.
32. L. Zadeh, Granular computing and rough set theory, in: M.
Kryszkiewicz, J. F. Peters, H. Rybinski, andA. Skowron, (eds.),
Rough Sets and Intelligent Systems Paradigms, Lecture Notes
in Articial Intelligence, Berlin: Springer, 2007, pp. 14.
33. W. Ziarko, Variable precision rough set model, J. Comp. Sys.
Sciences, 46(1): 3959, 1993.
34. J. F. Peters, Time and clock information systems: concepts
and roughly fuzzy petri net models, in J. Kacprzyk (ed.),
Knowledge Discovery and Rough Sets. Berlin: Physica Ver-
lag, 1998.
35. S. K. Pal, L. Polkowski and A. Skowron (eds.), Rough-Neuro
Computing: Techniques for Computing with Words. Berlin:
Springer-Verlag, 2003.
36. T. Y. Lin, Fuzzy controllers: An integrated approach based on
fuzzylogic, roughsets, andevolutionarycomputing, inT. Y. Lin
and N. Cercone (eds.), Rough Sets and Data Mining: Analysis
for Imprecise Data. Norwell, MA: Kluwer Academic Publishers,
1997, pp. 109122.
37. L. Polkowski, Rough mereology as a link between rough and
fuzzy set theories: a survey, Trans. Rough Sets II, LNCS 3135,
2004, pp. 253277.
38. J. H. Holland, Adaptation in Natural and Articial Systems,
Ann Arbor, MI: University of Michigan Press, 1975.
39. J. R. Koza, Genetic Programming: On the Progamming of
Computers by Means of Natural Selection. Cambridge, MA:
MIT Press, 1993.
40. C. Darwin, On the Origin of Species by Means of Natural
Selection, or the Preservationof FavouredRaces inthe Struggle
for Life. London: John Murray, 1959.
41. L. Chambers, Practical Handbookof Genetic Algorithms, vol. 1.
Boca Raton, FL: CRC Press, 1995.
42. L. J. Fogel, A. J. Owens, and M. J. Walsh, Articial Intelligence
through Simulated Evolution, Chichester: J. Wiley, 1966.
43. L. J. Fogel, On the organization of the intellect. Ph. D. Dis-
sentation, Los Angeles: University of California Los Angeles,
1964.
44. R. R. Yager and D. P. Filev, Essentials of Fuzzy Modeling and
Control. New York, John Wiley & Sons, Inc., 1994.
45. L. A. Zadeh, Fuzzy sets, Information and Control, 8: 338353,
1965.
46. L. A. Zadeh, Outline of a new approach to the analysis of
complex systems and decision processes, IEEE Trans. on Sys-
tems, Man, and Cybernetics, 2: 2844, 1973.
47. W. Pedrycz, Fuzzy Control andFuzzy Systems, NewYork: John
Wiley & Sons, Inc., 1993.
48. R. Kruse, J. Gebhardt and F. Klawonn, Foundations of Fuzzy
Systems. New Yark: John Wiley & Sons, Inc., 1994.
49. W. S. McCulloch and W. Pitts, A logical calculus of ideas
immanent in nervous activity, Bulletin of Mathemat. Biophy.,
5: 115133, 1943.
50. F. Rosenblatt, Principles of Neurodynamics: Perceptrons and
the Theory of Brain Mechanisms, Washington, D.C: Spartan
Press, 1961.
51. M. Minsky and S. Pappert, Perceptrons: An Introduction to
Computational Geometry, Cambridge: MIT Press, 1969.
52. E. Fiesler and R. Beale (eds.), Handbook on Neural Computa-
tion. oxford: Institute of Physics Publishing and Oxford Uni-
versity Press, 1997.
53. C. M. Bishop, Neural Networks for Pattern Recognition.
Oxford: Oxford University Press, 1995.
54. B. Widrow and M. E. Hoff, Adaptive switching circuits, Proc.
IRE WESCON Convention Record, Part 4, 1960, pp. 96104.
55. B. Widrow, Generalization and information storage in net-
works of adaline neurons. in M. C. Yovits, G. T. Jacobi,
and G. D. Goldstein (eds.), Self-Organizing Systems. Washing-
ton, D.C.: Spartan, 1962.
56. J. A. Freeman and D. M. Skapura, Neural Networks: Algo-
rithms, Applications and Programming Techniques. Reading,
MA: Addison-Wesley, 1991.
57. D. Lockery, and J. F. Peters, Robotic target tracking with
approximation space-based feedback during reinforcement
learning, Springer best paper award, Proceedings of Eleventh
International Conference on Rough Sets, Fuzzy Sets, Data
Mining and Granular Computing (RSFDGrC 2007), Joint
Rough Set Symposium (JRS 2007), Lecture Notes in Articial
Intelligence, vol. 4482, 2007, pp. 483490.
58. J. F. Peters, L. Han, and S. Ramanna, Roughneural computing
in signal analysis, Computat. Intelli., 1(3): 493513, 2001.
59. E. Orlowska, J. F. Peters, G. Rozenberg and A. Skowron, New
Frontiers in Scientic Discovery. Commemorating the Life and
Work of Zdzislaw Pawlak, Amsterdam: IOS Press, 2007.
60. J. F. Peters, and A. Skowron, ZdzislawPawlak: Life and Work.
19262006, Transactions onRoughSets, V, LNCS4100, Berlin:
Springer, 2006, pp. 124.
61. E. Orlowska, Semantics of Vague Concepts. Applications of
Rough Sets. Institute for Computer Science, PolishAcademy of
Sciences, Report 469: 1981.
62. E. Orlowska, Semantics of vague concepts, in: G. Dorn, and P.
Weingartner, (eds.), Foundations of Logic and Linguistics,
Problems and Solutions, London: Plenum Press, 1985, pp.
465482.
63. J. F. Peters, Classication of perceptual objects by means of
features, Internat. J. Informat. Technol. Intell. Comput., 2007.
in Press.
64. H. S. Bazan, Nguyen, A. Skowron, and M. Szczuka, A view on
rough set concept approximation, in: G. Wang, Q. Liu, Y. Y.
Yao, A. Skowron, Proceedings of the Ninth International Con-
ference on Rough Sets, Fuzzy Sets, Data Mining and Granular
Computing RSFDGrC2003), Chongqing, China, 2003, pp.
181188.
4 COMPUTATIONAL INTELLIGENCE
65. J. Bazan, H. S. Nguyen, J. F. Peters, A. Skowron and M.
Szczuka, Rough set approach to pattern extraction from clas-
siers, Proceedings of the Workshop on Rough Sets in Knowl-
edge Discovery and Soft Computing at ETAPS2003, pp. 23.
66. S. Lesniewski, Opodstawach matematyki (in Polish), Przeglad
Filozoczny, 30: 164206, 31: 261291, 32: 60101, 33: 142
170, 1927.
67. L. Polkowski and A. Skowron, Implementing fuzzy contain-
ment via rough inclusions: Rough mereological approach to
distributed problem solving, Proc. Fifth IEEE Int. Conf. on
Fuzzy Systems, vol. 2, New Orleans, 1996, pp. 11471153.
68. L. Polkowski and A. Skowron, Rough mereology: A new para-
digm for approximate reasoning, Internat. J. Approx. Reason-
ing, 15(4): 333365, 1996.
69. L. Polkowski and A. Skowron, Rough mereological calculi of
granules: A rough set approach to computation, Computat.
Intelli. An Internat. J., 17(3): 472492, 2001.
70. A. Skowron, R. Swiniarski, and P. Synak, Approximation
spaces and information granulation, Trans. Rough Sets III,
LNCS 3400, 2005, pp. 175189.
71. A. SkowronandJ. Stepaniuk, Toleranceapproximationspaces,
Fundamenta Informaticae, 27(23): 245253. 1996.
72. J. F. Peters andC. Henry, Reinforcement learningwithapprox-
imation spaces. Fundamenta Informaticae, 71(23): 323349,
2006.
73. J. F. Peters, Rough ethology: towards a biologically-inspired
study of collective behavior in intelligent systems with approx-
imation spaces, Transactions on Rough Sets III, LNCS 3400,
2005, pp. 153174.
74. W. Pedrycz, Shadowedsets: Representingandprocessing fuzzy
sets, IEEE Trans. on Systems, Man, and Cybernetics, Part B:
Cybernetics, 28: 103108, 1998.
75. W. Pedrycz, Granular computing with shadowed sets, in: D.
Slezak, G. Wang, M. Szczuka, I. Duntsch, and Y. Yao (eds.),
Rough Sets, Fuzzy Sets, Data Mining, and Granular Comput-
ing, LNAI 3641. Berlin: Springer, 2005, pp. 2331.
76. T. Munakata and Z. Pawlak, Rough control: Application of
rough set theory to control, Proc. Fur. Congr. Fuzzy Intell.
Techool. EUFIT96, 1996, pp. 209218,
77. J. F. Peters, A. Skowron, Z. Suraj, An application of rough set
methods to automatic concurrent control design, Fundamenta
Informaticae, 43(14): 269290, 2000.
78. J. Grzymala-Busse, S. Y. Sedelow, andW. A. Sedelow, Machine
learning &knowledge acquisition, rough sets, and the English
semantic code, in T. Y. Lin and N. Cercone (eds.), Rough Sets
and Data Mining: Analysis for Imprecise Data. Norwell, MA:
Kluwer Academic Publishers, 1997, pp. 91108.
79. R. Hashemi, B. Pearce, R. Arani, W. Hinson, and M. Paule, A
fusion of rough sets, modied rough sets, and genetic algo-
rithms for hybrid diagnostic systems, in T. Y. Lin, N. Cercone
(eds.), Rough Sets and Data Mining: Analysis for Imprecise
Data. Norwell, MA: Kluwer Academic Publishers, 1997, pp.
149176.
80. R. Ras, Resolving queries through cooperation in multi-agent
systems, in T. Y. Lin, N. Cercone (eds.), Rough Sets and Data
Mining: Analysis for Imprecise Data. Norwell, MA: Kluwer
Academic Publishers, 1997, pp. 239258.
81. A. Skowron and Z. Suraj, A parallel algorithm for real-time
decision making: a rough set approach. J. Intelligent Systems,
7: 528, 1996.
82. M. S. Szczuka and N. H. Son, Analysis of image sequences for
unmanned aerial vehicles, in: M. Inuiguchi, S. Hirano, S.
Tsumoto (eds.), Rough Set Theory and Granular Computing.
Berlin: Springer-Verlag, 2003, pp. 291300.
83. H. S. Son, A. Skowron, andM. Szczuka, Situationidentication
by unmanned aerial vehicle, Proc. of CS&P 2000, Informatik
Berichte, Humboldt-Universitat zu Berlin, 2000, pp. 177188.
84. J. F. Peters and S. Ramanna, Towards a software change
classication system: A rough set approach, Software Quality
J., 11(2): 87120, 2003.
85. M. Reformat, W. Pedrycz, and N. J. Pizzi, Software quality
analysis with the use of computational intelligence, Informat.
Software Technol., 45: 405417, 2003.
86. J. F. Peters and S. Ramanna, A rough sets approach to asses-
sing software quality: concepts and rough Petri net models, in:
S. K. Pal and A. Skowron, (eds.), Rough-Fuzzy Hybridization:
New Trends in Decision Making. Berlin: Springer-Verlag,
1999, pp. 349380.
87. W. Pedrycz, L. Han, J. F. Peters, S. Ramanna and R. Zhai,
Calibration of software quality: fuzzy neural and rough neural
approaches. Neurocomputing, 36: 149170, 2001.
FURTHER READING
G. Frege, Grundlagender Arithmetik, 2, VerlagvonHermanPohle,
Jena, 1893.
W. Pedryz, Granular computing in knowledge intgration and
reuse, in: D. Zhang, T. M. Khoshgoftaar and M. -L. Shyu, (eds.),
IEEEInt. conf. on Information Reuse and Intergration. Las Vegas,
NV: 2005, pp.1517.
W. Pedrycz and G. Succi, Genetic granular classiers in modeling
software quality, J. Syste. Software, 76(3): 277285, 2005.
W. Pedrycz and M. Reformat, Genetically optimized logic models,
Fuzzy Sets & Systems, 150(2): 351371, 2005.
A. Skowron and J. F. Peters, Rough sets: trends and challenges, in
G. Wang, Q., Liu, Y. Yao, and A. Skowron, (eds.), Proceedings 9th
Int. Conf. on Rough Sets, Fuzzy Sets, Data Mining, and Granular
Computing (RSFDGrC2003), LNAI 2639, Berlin: Springer-Verlag,
2003, pp. 2534.
J.F. Peters, Near sets: General theory about nearness of objects,
Appl. Math. Sci., 1(53): 26092629, 2007.
J.F. Peters, A. Skowron, and J. Stepanuik, Nearness of objects:
Extension of approximation space model, Fundamenta Informati-
cae, 79: 497512, 2007.
JAMES F. PETERS
University of Manitoba
Winnipeg, Manitoba, Canada
WITOLD PEDRYCZ
University of Alberta
Edmonton, Alberta, Canada
COMPUTATIONAL INTELLIGENCE 5
C
COMPUTING ACCREDITATION: EVOLUTION
AND TRENDS ASSOCIATED WITH U.S.
ACCREDITING AGENCIES
INTRODUCTION
Accreditation is the primary quality assurance mechanism
for institutions and programs of higher education, which
helps a program prove that it is of a quality acceptable to
constituents (1). Its original emergence in the United
States was to ensure that federal student loans were
awarded in support of quality programs. Since then, the
need for accreditation has strengthened considerably, as
the need for quality assurance continues to be in the fore-
front of educational concerns. For example, a recent report
from the Commission on Higher Education (2) identied
several problems with the education system in the United
States, a central one being the overall quality of higher
education.
The continuing interest in accreditation emerges of
three factors. First, recognition exists that the public is
not always in a position to judge quality, certainly not for
programs or institutions for higher education. Therefore, a
need exists for an independently issued stamp of approval
for programs and institutions of higher education.
Second, a lack of oversight exists in higher education.
Unlike the public school system, colleges and universities
have considerable freedom when it comes to curriculum
design, hiring practices, and student expectations (1). This
lack of oversight requires a different mechanism to ensure
quality, andhigher educationhas optedto useaccreditation
for this purpose. Finally, many organizations have recog-
nized the importance of continuous quality improvement,
and higher education is no exception. As explained in this
article, recent developments in accreditation reect this
trend. The objective of accreditation is not only to ensure
that educational institutions strive for excellence but also
to make certainthat the process for ensuring highquality is
apposite.
Two major types of accreditation are available in the
United States: (1) institutional accreditation (also called
regional accreditation in the United States) and (2) specia-
lized accreditation. Specialized accreditation includes both
program and school-specic accreditation, with the former
applied to a unique program and the latter applied to an
administrative unit within an institution. In the United
States, institutional accreditation is generally the respon-
sibility of aregional accreditationbodysuchas the Southern
Association of Colleges and Schools (SACS) (www.sacs.org)
or the Middle States Commission on Higher Education
(www.msche.org).
This article concentrates onaccreditationinthe comput-
ing discipline, focusing on the primary accreditation bodies
accrediting U.S. programs, namely ABET, Inc. (www.abet.
org) and the Association to Advance Collegiate Schools of
Business (AACSB) (www.aacsb.edu). ABET, Inc. is the
primary body responsible for specialized program-level
accreditation in computing. AACSB, on the other hand,
provides specialized accreditation at the unit level and
accredits business schools only. The latter implies that
all the programs offered within the unit are accredited,
including any computing programs that it may offer.
Both accrediting bodies have concentrated primarily on
the accreditation of programs housed within U.S. institu-
tions. ABET has evaluated programs outside the United
States for substantial equivalency, which means the
program is comparable in educational outcomes with a
U.S.-accredited program. Substantial equivalency has
been phased out, and international accreditation pilot vis-
its are now being employed for visits within the United
States. In 2003, AACSB members approved the interna-
tional visits as relevant and applicable to all business
programs and have accredited several programs outside
of the United States.
The purpose of this article is to (1) review key concepts
associated with accreditation, (2) discuss the state of com-
puting accreditation by U.S. accrediting organizations and
how accreditation has evolved through recent years, (3)
describe the criteria for accreditation put forth by the
primary agencies that review computing programs, and
(4) review the typical process of accreditation. Although
many concepts are applicable to any accreditation process,
we refer to AACSB and ABET, Inc., which well are recog-
nizedagencies that ensure quality ineducational units that
include technology or specialized computing programs.
ABETs Computer Accreditation Commission (CAC), as
the primary agency targeting computing programs, is
emphasized.
KEY CONCEPTS OF ACCREDITATION
For many years, accrediting organizations established
focused and specic guidelines to which a program had
to adhere to receive its stamp of approval. For instance, a
xednumber of credits, requiredinaspecic area, hadbeen
the norm, and any program that wished to be granted
accreditation had to offer the required number of credits
in relevant areas. Quality was measured through a check-
list of attributes that were expectedto be met by the various
inputs into learning processes, such as curriculum, teach-
ing faculty, laboratory, and other facilities and resources.
The denitionof quality, implicit inthis approachto accred-
itation, was that of meeting specic standards, to be fol-
lowed by every institution.
Some proof exists that in computer science education,
this approach is successful. In a study of accredited and
nonaccredited programs, Rozanski (3) reports that
although similarities and differences exist between these
programs, accredited programs have more potential to
increase specic quality indicators.
In the past it was straightforward to determine whether
a program or institution met the accreditation criteria. A
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
major drawback of this approach was that it forced uni-
formity among institutions, preventing innovation and the
consideration of specialized needs of a programs or schools
constituencies. Other controversies centered on the
expense and time necessary to navigate successfully the
process. Smaller schools especially felt the guidelines
were targeted toward the larger institution (3).
Partly inresponse to concerns andto the danger of a lack
of innovation and, at least in the United States, partly
under pressure fromthe federal government, accreditation
agencies have moved to an outcomes-based approach, and
accreditation criteria now embody a denition of quality
more in line with that adopted by many quality improve-
ment approaches, namely tness for purpose (4). The
basis for this approach is the premise that quality is
multifaceted.
From the overall mission of an institution, units or
programs are expected to establish long-term educational
objectives or goals, which describe achievements of gradu-
ates a few years after graduation, and to derive a set of
learning outcomes, which are statements dened as the
aforementioned skills, knowledge, and behaviors that stu-
dents are expected to acquire in their matriculation
through the program (5). Additionally, an institution or
program is expected to establish an assessment process to
determine how well its graduates are achieving its objec-
tives and outcomes and to establish a quality enhancement
program that uses the data collected through this assess-
ment process to improve the program. Assessment pro-
cesses foster program improvement by enabling visit
teams to make judgments about program effectiveness in
preparing graduates for entry into a eld.
Some differences exist between the various accredita-
tion agencies concerning the range of decisions that they
can make. Clearly, each will have the option of whether to
accredit or not. However, different options are available
should it be determined that a program or institution does
not meet all criteria. For example, some accreditation
agencies may reduce the period for which the program or
institutionis accredited. Alternatively, theymay provision-
ally accredit but make a nal decision contingent on an
interim report by the program or institution in which it
makes clear how it has addressed any weaknesses or con-
cerns identied during the team visit.
Programs continue to be reviewed cyclically. Again,
differences exist between the various agencies relative to
the maximumlengthfor whicha programor institutioncan
be accredited. The ABET CAC maximally accredits a pro-
gram for 6 years, whereas the AACSB has operated on a 5-
or 10-year cycle and is moving more toward the use of
maintenance reports.
Clearly, given the importance of the agency in the
accreditation process, the question develops concerning
the basis used to determine whether an organization can
become an accreditation agency. The answer differs from
country to country. In many countries, accreditation agen-
cies are governmental or quasi-governmental organiza-
tions established through an act or parliament. In the
United States, most accreditation agencies are essentially
private organizations. However, under the Higher Educa-
tion Act, the U.S. Secretary of Education is required to
recognize an accreditation agency before students enrolled
in programs or institutions accredited by it can receive
federal funding.
Also, several professional organizations recognize
accreditation agencies, chief among then are the Interna-
tional Network for Quality Assurance Agencies in Higher
Education (INQAAHE) and the Council of Higher Educa-
tion Accreditation (CHEA) in the United States.
ACCREDITATION AGENCIES RELEVANT TO COMPUTING
The AACSB
The AACSB International accredits both undergraduate
and graduate education for business and accounting.
Founded in 1916, the rst standards were adopted in
1919 (6). The AACSB accredits computing programs, typi-
cally in Management Information Systems (MIS) or Infor-
mation Systems (IS), only as part of an evaluation of a
business program, and it does not review any one specic
program. By accrediting the unit that offers the various
business-related programs, it accredits indirectly any
specic programoffered within the unit. The one exception
is accounting, which can receive a program-specic
evaluation.
Both undergraduate and graduate programs must be
evaluated inmaking anaccreditationdecision. The AACSB
puts forth standards for continuous quality improvement.
Important in the process is the publication of a mission
statement, academic and nancial considerations, and stu-
dent support.
AACSB International members approved mission-
linked accreditation standards and the peer reviewprocess
in 1991, and in 2003, members approved a revised set of
worldwide standards. The application of the AACSBs
accreditation standards is based on the stated mission of
each institution, and the standards thus provide enough
exibility so that they can be applied to a wide variety of
business schools with different missions. This exibility
offers the opportunity for many institutions that offer
online and other distance learning programs, as well as
the more conventional on-campus programs, to be accre-
dited.
ABET Inc.
In the early 1980s, computer science accreditation was
initiated by groups from the Association for Computing
Machinery, Inc. (ACM) and the Institute of Electrical and
Electronics Engineers, Inc. Computer Society (IEEE-CS).
Criteria for accreditation of computer science were estab-
lished, and visits started in 1984. The initial visits were
made by the Computer Science Accreditation Commission
(CSAC), which in 1985 established the Computer Science
Accreditation Board (CSAB) with the explicit purpose to
advance the development and practices of computing dis-
ciplines in the public interest through the enhancement of
quality educational degree programs in computing (http://
www.csab.org).
Eventually CSAC was incorporated into ABET with
CSAB remaining as the lead society within ABET for
2 COMPUTING ACCREDITATION: EVOLUTION AND TRENDS ASSOCIATED WITH U.S. ACCREDITING AGENCIES
accreditationof programs incomputer science, information
systems, information technology, and software engineer-
ing. It should be noted that the ABET Engineering
Accreditation Commission (EAC) is responsible for soft-
ware engineering accreditation visits. In this capacity,
the CSAB is responsible for recommending changes to
the accreditationcriteriaandfor the recruitment, selection,
and training of program evaluators (PEVs). All other
accreditation activities, which were conducted previously
by the CSAC, are now conducted by the ABET CAC.
The CSAB is governed by a Board of Directors whose
members are appointed by the member societies. The cur-
rent member societies of the CSAB, which include the ACM
and the IEEE-CS, as well as its newest member, the Asso-
ciation for Information Systems (AIS), are the three largest
technical, educational, and scientic societies in the com-
puter and computer-related elds.
Since the incorporation of the CSAC into the ABET,
computing programs have been accredited by the ABET
CAC. The rst programs accredited by the ABETCACwere
in computer science (CS). In 2004, IS criteria were com-
pleted and partly funded by a National Science Foundation
grant. With the addition of IS criteria, the scope of comput-
ing program accreditation was enlarged. This addition has
been followed by the addition of information technology
(IT), with criteria currently being piloted.
The CAC recognized a need to address the growing
number of programs in emerging computing areas, which
has resulted in even more revision of accreditation criteria,
allowing such programs to apply for accreditation under
computing general criteria. Thus, these areas can benet
fromaccreditationas well. We discuss these revisions inthe
next section.
THE AACSB AND ABET CAC ACCREDITATION CRITERIA
The AACSB Criteria
Although the AACSB does not accredit IS programs by
themselves, the standards used support the concept of
continuous quality improvement and require the use of a
systematic process for curriculummanagement. Normally,
the curriculum management process will result in an
undergraduate degree program that includes learning
experiences in such general knowledge and skill areas as
follows (7):
Communication abilities
Analytic skills
Capstone experience
Flexibility
The curriculum models below include some of the ear-
liest, the most recent, and the most widely distributed in
the eld of software engineering.
EDUCATION AND TRAINING IN SOFTWARE ENGINEERING 3
Software Engineering Institute MSE Model
The mission of the Software Engineering Institute (SEI) at
Carnegie Mellon University, in Pittsburgh, Pennsylvania
is to provide leadership in advancing the state of the
practice of software engineering to improve the quality of
systems that depend on software. Recognizing the impor-
tance of education in the preparation of software profes-
sionals, the institutes charter required it to inuence
software engineering curricula throughout the education
community. Thus, the SEI Education Program began in
1985, only one year after the Institutes founding. This
program emerged at the right time to play a key role in
the development of software engineering education in the
United States. The program was organized with a perma-
nent staff of educators along with a rotating set of visiting
professors.
Under the direction of Norman Gibbs (19851990) and
Nancy Mead (19911994), the SEI Education Program
accomplished a wide variety of tasks, including developing
a detailedGraduate CurriculumModel, several curriculum
modules on various topics, an outline of a undergraduate
curriculum model; compiling a list of U.S. graduate soft-
ware engineering degree programs; creating a directory of
software engineering courses offered in U.S. institutions;
developing educational videotape series for both academia
and industry; and creating and initial sponsoring of the
Conference on Software Engineering Education. Although
the Education Program was phased out at SEI in 1994, its
work is still inuential today.
With regard to curriculum models, the SEI concen-
tratedinitially onmasters level programs for two reasons.
First, it is substantially easier within a university to
develop and initiate a one-year masters program than a
four-year bachelors program. Second, the primary client
population at the time was software professionals, nearly
all of whom already have a bachelors degree in some
discipline. In 1987, 1989, and 1991, the SEI published
model curricula for university MSE programs (described
below) (2022).
Becausethe goal of anMSEdegree programis to produce
a software engineer who can assume rapidly a position of
substantial responsibility within an organization, SEI pro-
posed a curriculum designed to give the student a body of
knowledge that includes balanced coverage of the software
engineering process activities, their aspects, and the pro-
ducts produced, along with sufcient experience to bridge
the gap between undergraduate programming and profes-
sional software engineering. Basically, the program was
broken into four parts: undergraduate prerequisites, core
curriculum, the project experience component, and elec-
tives. The minimal undergraduate prerequisites were dis-
crete mathematics, programming, data structures,
assembly language, algorithm analysis, communication
skills, and some calculus. Laboratory sciences were not
required. The six core curriculum courses were:
Specication of Software Systems
Software Verication and Validation
Software Generation and Maintenance
Principles and Applications of Software Design
Software Systems Engineering
Software Project Management
The topics and content of these courses were in many
ways an outgrowth of the common courses identied in the
rst MSEprograms. These courses would likely be takenin
the rst year of a two-year MSE sequence. The bulk of the
report consists of detailed syllabi and lecture suggestions
for these six courses.
The project experience component took the form of the
capstone project, and might require different prerequisites
according to the project. The electives could be additional
software engineering subjects, related computer science
topics, system engineering courses, application domain
courses, or engineering management topics.
Although the nal version of the SEI model for an MSE
was released almost 15 years ago, it remains today the most
inuential software engineering curriculum model for
Masters degree programs.
BCS/IEE Undergraduate Model
As stated previously, the BCS and the IET have cooperated
for many years in the development and accreditation of
software engineering degree programs, dating back to the
IEE, one of the two societies that merged to form the IET.
Therefore, it is not surprising that the rst major effort to
develop a curriculum model for the discipline was a joint
effort by these two societies (23).
The curriculumcontent dened in the report was delib-
erately designed to be non-prescriptive and to encourage
variety (p. 27). Three types of skills are dened: central
software engineering skills, supporting fundamental skills
(technical communication, discrete math, and various com-
puter science areas), and advanced skills.
At the end of the section on curriculum content, the
authors state the topics alone do not dene a curriculum,
since it is also necessary to dene the depth to which they
may be taught. The same topic may be taught in different
disciplines with a different target result. . . (p. 31). This
reinforces the comments concerning the variety and non
prescriptive nature of the recommendations at the begin-
ning of that section.
It is interesting to note that the curriculumcontent does
not include a need for skills in areas traditionally taught to
engineers (e.g., calculus, differential equations, chemistry,
physics, and most engineering sciences). It is remarkable
that the computer scientists and electrical engineers on
the working party that produced this report were able to
agree on a curriculum that focused primarily on engineer-
ingprocess andcomputer science skills, anddidnot tie itself
to traditional engineering courses.
SEI Undergraduate Model
In making undergraduate curriculum recommendations,
Gary Ford (24) of SEI wanted a curriculum that would be
compatible with the general requirements for ABET and
CSAB and the core computing curriculum in the IEEE-CS/
ACM Curricula 91 (which was then in draft form). Also,
although he was complimentary of the BCS/IEE recom-
mendations, Ford was more precise in dening his model
4 EDUCATION AND TRAINING IN SOFTWARE ENGINEERING
curriculum. The breakdown (for a standard 120 semester
hour curriculum) was as follows:
Mathematics and Basic Sciences 27 semester hours
Software Engineering Sciences
and Design 45 semester hours
Humanities and Social Sciences 30 semester hours
Electives 18 semester hours
The humanities & social sciences and electives were
included to allow for maximum exibility in implementing
the curriculum. Mathematics and basic sciences consisted
of two semesters of bothdiscrete mathematics andcalculus,
and one semester of probability and statistics, numerical
methods, physics, chemistry, and biology.
In designing the software engineering science and
design component, Ford argues that the engineering
sciences for software are primarily computer science,
rather than sciences such as statics and thermodynamics
(although for particular application domains, such knowl-
edge may be useful). Four different software engineering
areas are dened: software analysis, software architec-
tures, computer systems, and software process.
Ford goes on to dene 14 courses (totaling 42 semester
hours; one elective is allowed), with 3 or 4 courses in each of
these four areas, and places them (as well as the other
aspects of the curriculum) in the context of a standard four-
year curriculum. The four software process courses were
placed in the last two years of the program, which allowed
for the possibility of a senior project-type experience.
Thus, several similarities existed between the BCS/IEE
and SEI recommended undergraduate models: they
focused on similar software engineering skills, did not
require standard engineering sciences, and (interestingly)
did not require a capstone experience.
IEEE-ACM Education Task Force
The Education Task Force of an IEEE-ACM Joint Steering
Committee for the Establishment of Software Engineering
developed some recommended accreditation criteria for
undergraduate programs in software engineering (25).
Although no accreditation board has yet adopted these
precise guidelines, it has inuenced several accreditation
and curriculum initiatives.
According to their accreditation guidelines, four
areas exist (software engineering, computer science,
and engineering, supporting areas suchc technical
communication and mathematics, and advanced work
in one or more area) that are each about three-sixteenths
of the curriculum, which amounts to 2124 semester
hours for each area in a 120-hour degree plan. (The
remaining hours were left open, to be used by, for
instance, general education requirements in the United
States). As in the SEI graduate guidelines, a capstone
project is addressed explicitly.
Guidelines for Software Engineering Education
This project was created by a team within the Working
Group for Software Engineering Education and Training
(WGSEET). The Working Group was a think tank of
about 30 members of the software engineering and training
communities, who come together twice a year to work on
major projects related to the discipline. WGSEET was
established in 1995 in part to ll a void left by the demise
of the SEI Education Program.
At the November 1997 Working Group meeting, work
began on what was originally called the Guidelines for
Software Education. A major difference in goals of the
Guidelines with the previously discussed projects is that
the latter developed computer science, computer engineer-
ing, and information systems curricula with software engi-
neering concentrations in addition to making recomm-
endations for an undergraduate software engineering
curriculum. (This article will focus only on the software
engineering model.)
Here are the recommended number of hours andcourses
for each topic area in the software engineering model, from
Ref. 26.
Software Engineering Required (24 of 120 Semester
Hours)
Software Engineering Seminar (One hour course for
rst-semester students)
Introduction to Software Engineering
Formal Methods
Software Quality
Software Analysis and Design I and II
Professional Ethics
Senior Design Project (capstone experience)
Computer Science Required (21 Semester Hours)
Introduction to Computer Science for Software Engi-
neers 1 and 2
(Similar to rst year computer science, but oriented to
software engineering)
Data Structures and Algorithms
Computer Organization
Programming Languages
Software Systems 1 and 2
(Covers operating systems, databases, and other appli-
cation areas)
Computer Science / Software Engineering Electives (9
Semester Hours)
Various topics
Lab Sciences (12 Semester Hours)
Chemistry 1
Physics 1 and 2
EDUCATION AND TRAINING IN SOFTWARE ENGINEERING 5
Engineering Sciences (9 Semester Hours)
Engineering Economics
Engineering Science 1 and 2
(Provides anoverviewof the major engineering sciences)
Mathematics (24 Semester Hours)
Discrete Mathematics
Calculus 1, 2 and 3
Probability and Statistics
Linear Algebra
Differential Equations
Communication/Humanities/Social Sciences (18 Semester
Hours)
Communications 1 and 2 (rst year writing courses)
Technical Writing
Humanities/Social Sciences electives
Open Electives (3 Semester Hours)
Any course
This model is intended to be consistent withthe IEEE-ACM
Education Task Force recommendations and criteria for all
engineering programs accredited by ABET. The nine hours
of engineering sciences is less thanfor traditional engineer-
ing disciplines and that contained in the McMaster
University degree program, but more than that is required
in the other software engineering curriculum models.
Not much exibility exists in this model, because only
one technical elective outside of Computer Science/Soft-
ware Engineering can be taken (using the open elective).
However, note that the model is for a minimal number of
semester hours (120), so additional hours could provide
that exibility.
Information Processing Society of Japan (IPSJ)
IPSJ has developed a core curriculum model for computer
science and software engineering degree programs com-
monly called J97 (18). Unlike the other curriculum models
discussed above, J97 denes a common core curriculum for
every undergraduate computer science and software engi-
neering program. This core curriculumincludes the follow-
ing learning units:
Computer Literacy Courses
Computer Science Fundamentals
Programming Fundamentals
Mathematics Courses
Discrete Mathematics
Computing Algorithms
Probability and Information Theory
Basic Logic
Computing Courses
Digital Logic
Formal Languages and Automata Theory
Data Structures
Computer Architecture
Programming Languages
Operating Systems
Compilers
Databases
Software Engineering
The Human-Computer Interface
Note that, unlike most computing curriculum models,
basic programming skills are considered part of basic
computer literacy, whereas algorithm analysis (comput-
ing algorithms) is treated as part of the mathematics
component. The units listed above provide for a broad
background in computer science as well as an education
in basic computer engineering and software engineering
fundamentals.
Computing Curricula Software Engineering
Over the years, the ACM/IEEE-CS joint Education Task
Force went through several changes, eventually aligning
itself with similar projects being jointly pursued by the two
societies. The Software Engineering 2004 (SE2004) project
(as it was eventually named) was created to provide
detailed undergraduate software engineering curriculum
guidelines which could serve as a model for higher educa-
tion institutions across the world. The result of this project
was an extensive and comprehensive document which has
indeed become the leading model for software engineering
undergraduate curricula (27). SE 2004 was dened using
the following principles:
1. Computing is a broad eld that extends well beyond
the boundaries of any one computing discipline.
2. Software Engineering draws its foundations from a
wide variety of disciplines.
3. The rapid evolution and the professional nature of
software engineering require an ongoing review of
the corresponding curriculum. The professional asso-
ciations in this discipline.
4. Development of a software engineering curriculum
must be sensitive to changes in technologies, prac-
tices, and applications, new developments in peda-
gogy, and the importance of lifelong learning. In a
eld that evolves as rapidly as software engineering,
educational institutions must adopt explicit strate-
gies for responding to change.
5. SE 2004 must go beyond knowledge elements to offer
signicant guidance in terms of individual curricu-
lum components.
6 EDUCATION AND TRAINING IN SOFTWARE ENGINEERING
6. SE 2004 must support the identication of the
fundamental skills and knowledge that all software
engineering graduates must possess.
7. Guidance on software engineering curricula must be
based on an appropriate denition of software engi-
neering knowledge.
8. SE 2004 must strive to be international in scope.
9. The development of SE2004 must be broadly based.
10. SE 2004 must include exposure to aspects of profes-
sional practice as an integral component of the
undergraduate curriculum.
11. SE 2004 must include discussions of strategies and
tactics for implementation, along with high-level
recommendations.
The SE 2004 document also denes student outcomes;
that is, what is expected of graduates of a software
engineering program using the curriculum guidelines con-
tained within:
1. Showmastery of the software engineering knowledge
and skills, and professional issues necessary to begin
practice as a software engineer.
2. Workas anindividual andas part of ateamto develop
and deliver quality software artifacts.
3. Reconcile conicting project objectives, nding accep-
table compromises within limitations of cost, time,
knowledge, existing systems, and organizations.
4. Design appropriate solutions in one or more applica-
tion domains using software engineering approaches
that integrate ethical, social, legal, and economic
concerns.
5. Demonstrate an understanding of and apply current
theories, models, and techniques that provide a basis
for problem identication and analysis, software
design, development, implementation, verication,
and documentation.
6. Demonstrate an understanding and appreciation for
the importance of negotiation, effective work habits,
leadership, and good communication with stake-
holders in a typical software development environ-
ment.
7. Learn new models, techniques, and technologies as
they emerge and appreciate the necessity of such
continuing professional development.
The next step was to dene Software Engineering Edu-
cation Knowledge (SEEK), a collection of topics considered
important in the education of software engineering stu-
dents. SEEKwas createdandreviewed by volunteers inthe
software engineering education community. The SEEK
body is a three-level hierarchy, initially divided into knowl-
edge areas (KAs) as follows:
Computing essentials
Mathematical and engineering fundamentals
Professional practice
Software modeling and analysis
Software design
Software verication & validation
Software evolution
Software process
Software quality
Software management
Those KAs are then divided even more into units, and
nally, those units are divided into topics. For example,
DES.con.7 is a specic topic (Design Tradeoffs) in the
Design Concepts unit of the Software Design knowledge
area. Each topic in SEEK is also categorized for its impor-
tance: Essential, Desired, or Optional.
The SE 2004 document also contains recommendations
for possible curricula and courses, tailored towards models
for specic countries and specic kinds of institutions of
higher-level education. Software engineering topics con-
sisted of at least 20%of the overall curriculumin each case.
Most of the models recommended for use in the United
States included some type of introduction to software engi-
neering during the rst two years of study, followed by six
SE courses, with a capstone project occurring throughout
the fourth and nal (senior) year of study.
THE GAP BETWEEN EDUCATION AND INDUSTRY
Tim Lethbridge of the University of Ottawa has done a
considerable amount of work in attempting to determine
what industrythinks is important knowledge for asoftware
professional to receive through academic and other educa-
tional venues throughaseries of surveys (28). The results of
these surveys show that a wide gap still exists between
what is taught in education versus what is needed in
industry. For instance, among the topics required of profes-
sionals that had to be learned onthe job include negotiation
skills, human-computer interaction methods, real-time
systemdesignmethods, management andleadershipskills,
cost estimation methods, software metrics, software relia-
bility and fault tolerance, ethics and professionalism prac-
tice guidelines, and requirements gathering and analysis
skills. Among the topics taught in most educational pro-
grams but not considered important to industry included
digital logic, analog systems, formal languages, automata
theory, linear algebra, physics andchemistry. Industryand
academia agreed that a few things were important,
including data structures, algorithmdesign, and operating
systems.
The survey results also demonstrate that it is essential
for industry and academia to work together to create future
software engineering curricula, for both groups to use their
resources more wisely and effectively in the development of
software professionals in the future.
Another excellent article outlining the industrial view
was by Tockey (29). Besides stressing the need for software
practitioners to be well-versed in computer science and
discrete mathematics, he identied software engineering
economics as an important missing link that current
software professionals do not have whenentering the work-
force.
EDUCATION AND TRAINING IN SOFTWARE ENGINEERING 7
TRACKING SOFTWARE ENGINEERING DEGREE
PROGRAMS
Over the years, several surveys have been attempted to
determine what software engineering programs exist in a
particular country or for a particular type of degree. By
1994, 25 graduate MSE-type programs existed in the
United States, and 20 other programs with software
engineering options were in their graduate catalogs,
according to an SEI survey (30) that was revised slightly
for nal release in early 1996.
Thompson and Edwards excellent article on software
engineering educationinthe UnitedKingdom(31) provides
an excellent list of 43 Bachelors degree and 11 Masters
programs in software engineering in the UK.
Doug Grant (32) of the Swinburne University of
Technology reported that Bachelors degree programs
insoftware engineering were offered by at least 9 of
Australias 39 universities, with more being planned.
In June 2002, Fred Otto of CCPE provided the author
with a list of 10 Canadian schools with undergraduate
software engineering degree programs.
Bagert has in recent years published a list of 32 under-
graduate (11) and 43 Masters level (33) SEprograms inthe
United States, along with some information concerning
those programs.
Currently, few doctoral programs in software engineer-
ing exist. In the late 1990s, the rst Doctor of Philosophy
(Ph.D.) programs in software engineering in the United
States were approved at the Naval Postgraduate School
(34) and at the Carnegie Mellon University (35). Also, in
1999, Auburn University changed its doctoral degree in
Computer Science to Computer Science and Software
Engineering.
INDUSTRY/UNIVERSITY COLLABORATIONS
Collaborations between industry and higher education are
common in engineering disciplines, where companies can
give students experience withreal-world problems through
project courses. However, formal collaborations between
university and industry have become more frequent in the
last decade. Beckman et al. (36) discussed the benets to
industry and academia of such collaborations:
Benets to industry:
Cost-effective, customized education and training
Benets to academia:
Placement of students
X
n2
i0
a
i
2
in1
1
Examples of 4-bit fractional twos complement fractions
are showninTable 1. Two points are evident fromthe table:
First, onlyasingle representationof zero exists (specically
0000) and second, the system is not symmetric because a
negative number exists, 1, (1000), for which no positive
equivalent exists. The latter property means that taking
the absolute value of, negating or squaring a valid number
(1) can produce a result that cannot be represented.
Twos complement numbers are negated by inverting all
bits and adding a ONE to the least signicant bit position.
For example, to form 3/8:
Truncation of twos complement numbers never increa-
ses the value of the number. The truncated numbers have
values that are either unchanged or shifted toward nega-
tive innity. This shift may be seen from Equation (1), in
which any truncated digits come from the least signicant
end of the word and have positive weight. Thus, to remove
them will either leave the value of the number unchanged
(if the bits that are removed are ZEROs) or reduce the value
of the number. On average, a shift toward 1of one-half
the value of the least signicant bit exists. Summing many
truncated numbers (which may occur in scientic, matrix,
and signal processing applications) can cause a signicant
accumulated error.
+3/8 = 0011
invert all bits = 1100
add 1 0001
1101 = 3/8
Check: invert all bits = 0010
add 1 0001
0011 = 3/8
Table 1. 4-Bit fractional twos complement numbers
Decimal Fraction Binary Representation
7/8 0111
3/4 0110
5/8 0101
1/2 0100
3/8 0011
1/4 0010
1/8 0001
0 0000
1/8 1111
1/4 1110
3/8 1101
1/2 1100
5/8 1011
3/4 1010
7/8 1001
1 1000
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
ARITHMETIC ALGORITHMS
This section presents a reasonable assortment of typical
xed-point algorithms for addition, subtraction, multipli-
cation, and division.
Addition
Addition is performed by summing the corresponding bits
of the two n-bit numbers, which includes the sign bit.
Subtraction is performed by summing the corresponding
bits of the minuend and the twos complement of the
subtrahend.
Overow is detected in a twos complement adder by
comparing the carry signals into and out of the most
signicant adder stage (i.e., the stage which computes
the sign bit). If the carries differ, the addition has over-
owed and the result is invalid. Alternatively, if the sign
of the sum differs from the signs of the two operands, the
addition has overowed.
ADDITION IMPLEMENTATION
A wide range of implementations exist for xed point
addition, which include ripple carry adders, carry looka-
head adders, and carry select adders. All start with a full
adder as the basic building block.
Full Adder
The full adder is the fundamental building block of most
arithmetic circuits. The operation of a full adder is dened
by the truth table shown in Table 2. The sum and carry
outputs are described by the following equations:
s
k
a
k
b
k
c
k
2
c
k1
a
k
b
k
a
k
c
k
b
k
c
k
3
where a
k
, b
k
, and c
k
are the inputs to the k-th full adder
stage, and s
k
and c
k1
are the sum and carry outputs,
respectively, and denotes the Exclusive-OR logic opera-
tion.
In evaluating the relative complexity of implementa-
tions, often it is convenient to assume a nine-gate realiza-
tion of the full adder, as shown in Fig. 1. For this
implementation, the delay from either a
k
or b
k
to s
k
is six
gate delays and the delay fromc
k
to c
k1
is two gate delays.
Some technologies, such as CMOS, form inverting gates
(e.g., NAND and NOR gates) more efciently than the
noninverting gates that are assumed in this article. Cir-
cuits with equivalent speed and complexity can be con-
structed with inverting gates.
Ripple Carry Adder
A ripple carry adder for n-bit numbers is implemented by
concatenating n full adders, as shown in Fig. 2. At the k-th
bit position, bits a
k
andb
k
of operands AandBandthe carry
signal from the preceding adder stage, c
k
, are used to
generate the kth bit of the sum, s
k
, and the carry, c
k+1
, to
Table 2. Full adder truth table
Inputs Outputs
a
k
b
k
c
k
c
k1
s
k
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1
= 0011 3/8
= 0100 + 1/2
= 7/8 0111
= 1101 3/8
= 1100 1/2
= 7/8 (1) 1001 Ignore Carry Out
= 0011 3/8
= 1100 1/2
= 1/8 1111
= 1101 3/8
= 0100 + 1/2
= 1/8 (1) 0001 Ignore Carry Out
= 0101 5/8
= 0100 + 1/2
MSB C
in
C
out
= 7/8 1001
= 1011 5/8
= 1100 1/2
MSB C
in
C
out
= 7/8 0111
2 FIXED-POINT COMPUTER ARITHMETIC
the next adder stage. This adder is called a ripple carry
adder, because the carry signals ripple from the least
signicant bit position to the most signicant.
If the ripple carry adder is implemented by concatenat-
ing n of the nine gate full adders, which were shown in
Fig. 1, an n-bit ripple carry adder requires 2n 4 gate
delays to produce the most signicant sum bit and 2n 3
gate delays to produce the carry output. A total of 9n logic
gates are required to implement the n-bit ripple carry
adder. In comparing the delay and complexity of adders,
the delay from data input to most signicant sum output
denoted by DELAY and the gate count denoted by GATES
will be used. These DELAY and GATES are subscripted by
RCA to indicate ripple carry adder. Although these simple
metrics are suitable for rst-order comparisons, more accu-
rate comparisons require more exact modeling because the
implementations may be realized with transistor networks
(as opposed to gates), which will have different delay and
complexity characteristics.
DELAY
RCA
2n 4 4
GATES
RCA
9n 5
Carry Lookahead Adder
Another popular adder approach is the carry lookahead
adder (1,2). Here, specialized logic computes the carries
in parallel. The carry lookahead adder uses eight gate
modied full adders that do not form a carry output
for each bit position and lookahead modules, which
form the carry outputs. The carry lookahead concept is
best understood by rewriting Equation (3) with g
k
a
k
b
k
and p
k
a
k
b
k
.
c
k1
g
k
p
k
c
k
6
This equation helps to explain the concept of carry
generation and propagation: A bit position generates
a carry regardless of whether there is a carry into that bit
position if g
k
is true (i.e., both a
k
and b
k
are ONEs), and a
stage propagates an incoming carry to its output if p
k
is
true (i.e., either a
k
or b
k
is a ONE). The eight-gate modied
full adder is based on the nine-gate full adder shown on
Fig. 1. It has ANDandORgates that produce g
k
and p
k
with
no additional complexity.
Extending Equation (6) to a second stage:
c
k2
g
k1
p
k1
c
k1
g
k1
p
k1
g
k
p
k
c
k
g
k1
p
k1
g
k
p
k1
p
k
c
k
7
Equation (7) results from evaluating Equation (6) for the
(k 1)th stage and substituting c
k1
from Equation (6).
Carry c
k+2
exits from stage k1 if: (1) a carry is generated
there, (2) a carry is generated in stage k and propagates
across stage k1, or (3) a carry enters stage k and propa-
gates across both stages k and k1, etc. Extending to a
third stage:
c
k3
g
k2
p
k2
c
k2
g
k2
p
k2
g
k1
p
k1
g
k
p
k1
p
k
c
k
g
k2
p
k2
g
k1
p
k2
p
k1
g
k
p
k2
p
k1
p
k
c
k
8
Although it would be possible to continue this process
indenitely, each additional stage increases the fan-in
(i.e., the number of inputs) of the logic gates. Four inputs
[as required to implement Equation (8)] frequently are
the maximum number of inputs per gate for current
technologies. To continue the process, block generate
and block propagate signals are dened over 4-bit blocks
a
n1
s
n1
c
n
c
n1
b
n1
a
2
s
2
c
2
b
2
a
1
s
1
c
1
b
1
a
0
s
0
c
0
b
0
Full
Adder
Full
Adder
Full
Adder
Full
Adder
Figure 2. Ripple carry adder.
c
k+1
a
k
b
k
c
k
s
k
HALF ADDER
HALF ADDER
Figure 1. Nine gate full adder
.
FIXED-POINT COMPUTER ARITHMETIC 3
(stages k to k + 3), g
k:k 3
and p
k:k 3
, respectively:
g
k:k3
g
k3
p
k3
g
k2
p
k3
p
k2
g
k1
p
k3
p
k2
p
k1
g
k
9
and
p
k:k3
p
k3
p
k2
p
k1
p
k
10
Equation (6) can be expressed in terms of the 4-bit block
generate and propagate signals:
c
k4
g
k:k3
p
k:k3
c
k
11
Thus, the carry out from a 4-bit wide block can be
computed in only four gate delays [the rst to comp-
ute p
i
and g
i
for i k through k 3, the second to
evaluate p
k:k 3
, the second and third to evaluate
g
k:k 3
, and the third and fourth to evaluate c
k 4
using
Equation (11)].
An n-bit carry lookahead adder requires d n 1=
r 1 e lookahead logic blocks, where r is the width of
the block. A 4-bit lookahead logic block is a direct imple-
mentation of Equations (6)(10), with 14 logic gates. In
general, an r-bit lookahead logic block requires
1
2
3r r
2
HA, (that forms the sum of its two inputs and 1) and
P = 0
i = 0
a
1
= 0
a
i+1
, a , a
1
000
or
111
001
or
010
011 100
101
or
110
P = P + B P = P + 2B P = P 2B P = P B
i : n
P = 4 P
i = i + 2
LT
GE
P = A B
i i
Figure 7. Flowchart of radix-4 modied booth multiplication.
b
4
. b
3
b
2
b
1
b
0
a
4
. a
3
a
2
a
1
a
0
1 a
0
b
4
____
a
0
b
3
a
0
b
2
a
0
b
1
a
0
b
0
a
1
b
4
____
a
1
b
3
a
1
b
2
a
1
b
1
a
1
b
0
a
2
b
4
____
a
2
b
3
a
2
b
2
a
2
b
1
a
2
b
0
a
3
b
4
____
a
3
b
3
a
3
b
2
a
3
b
1
a
3
b
0
a 1
4
b
4
a
4
b
3
____
a
4
b
2
____
a
4
b
1
____
a
4
b
0
____
p
9
p
8 .
p
7
p
6
p
5
p
4
p
3
p
2
p
1
p
0
Figure 8. 5-bit by 5-bit multiplier for twos complement operands
.
FIXED-POINT COMPUTER ARITHMETIC 7
standard full adders are used in the ve-cell strip at the
bottom. The special half adder takes care of the extra 1 in
the bit product matrix.
The delay of the array multiplier is evaluated by follow-
ing the pathways from the inputs to the outputs. The
longest path starts at the upper left corner, progresses
to the lower right corner, and then progresses across
the bottom to the lower left corner. If it is assumed that
the delay from any adder input (for either half or full
adders) to any adder output is k gate delays, then the total
delay of an n-bit by n-bit array multiplier is:
DELAY
ARRAYMPY
k2n 2 1 19
The complexity of the array multiplier is n
2
ANDand
NAND gates, n half adders (one of which is a
half adder),
and n
2
2nfull adders. If a half adder is realized withfour
gates and a full adder withnine gates, the total complexity
of an n-bit by n-bit array multiplier is
GATES
ARRAYMPY
10n
2
14n 20
Array multipliers are laid out easily in a cellular fashion
which makes them attractive for VLSI implementation,
where minimizing the design effort may be more important
than maximizing the speed.
Wallace/Dadda Fast Multiplier
A method for fast multiplication was developed by Wallace
(7) andwas rened by Dadda (8). Withthis method, a three-
step process is used to multiply two numbers: (1) The bit
products are formed, (2) the bit product matrix is reduced
to a two row matrix whose sum equals the sum of the bit
products, and (3) the two numbers are summed with a fast
adder to produce the product. Although this may seemto be
a complex process, it yields multipliers with delay propor-
tional to the logarithm of the operand word size, which is
faster than the array multiplier, which has delay propor-
tional to the word size.
The second step in the fast multiplication process is
shown for a 6-bit by 6-bit Dadda multiplier on Fig. 10.
An input 6 by 6 matrix of dots (each dot represents a bit
product) is shown as matrix 0. Regular dots are formed
with an AND gate, and dots with an over bar are formed
with a NANDgate. Columns that have more than four dots
(or that will growto more than four dots because of carries)
are reduced by the use of half adders (each half adder takes
in two dots and outputs one in the same column and one in
the next more signicant column) and full adders (each full
adder takes in three dots froma column and outputs one in
the same column and one in the next more signicant
column) so that no column in matrix 1 will have more
thanfour dots. Half adders are shownbytwo dots connected
by a crossed line in the succeeding matrix and full adders
are shown by two dots connected by a line in the succeeding
matrix. In each case, the right-most dot of the pair that are
connected by a line is in the column from which the inputs
were taken in the preceding matrix for the adder. Aspecial
half adder (that forms the sum of its two inputs and 1) is
shown with a doubly crossed line. In the succeeding steps
reduction to matrix 2, with no more than three dots per
column, and nally matrix 3, with no more than two dots
per column, is performed. The reduction shown on Fig. 10
(which requires three full adder delays) is followed by an
10-bit carry propagating adder. Traditionally, the carry
propagating adder is realizedwithacarry lookaheadadder.
The height of the matrices is determined by working
backfromthe nal (two row) matrix andlimiting the height
of eachmatrix to the largest integer that is no more than1.5
s c
c c c c
s s s s
G
NG G G G G
FA FA FA FA HA
HA HA HA HA
x
0
y
0
y
5
FA FA FA FA
y
1
y
2
y
3
y
4
p
0
p
5
p
1
p
2
p
3
p
4
x
2
x
3
x
5
x
1
x
4
p
6
p
8
p
9
p
7
p
10
G
HA
s
c s c c s c s c s
FA
s
c s c c s c s c s
s
c s c c s c s c s
FA FA FA FA FA
FA FA FA FA FA
s
c s c c s c s c s
NFA
s
c s c c s c s c s
*
NFA NFA NFA NFA
NG
NG
NG
NG
Figure 9. 6-bit by 6-bit twos complement array multiplier.
8 FIXED-POINT COMPUTER ARITHMETIC
times the height of its successor. Each matrix is produced
from its predecessor in one adder delay. Because the num-
ber of matrices is related logarithmically to the number of
rows in matrix 0, which is equal to the number of bits in the
words to be multiplied, the delay of the matrix reduction
process is proportional to log n. Because the adder that
reduces the nal two row matrix can be implemented as a
carry lookahead adder (which also has logarithmic delay),
the total delay for this multiplier is proportional to the
logarithm of the word size.
The delayof aDaddamultiplier is evaluatedbyfollowing
the pathways from the inputs to the outputs. The longest
path starts at the center column of bit products (which
require one gate delay to be formed), progresses through
the successive reduction matrices (which requires approxi-
mately Log
1.44
(n) full adder delays) and nally through the
2n 2-bit carry propagate adder. If the delay from any
adder input (for either half or full adders) to any adder
output is k gate delays, and if the carry propagate adder
is realized with a carry lookahead adder implemented with
4-bit lookahead logic blocks (with delay given by
Equation (12), the total delay (in gate delays) of an n-bit
by n-bit Dadda multiplier is:
DELAY
DADDAMPY
1 kLOG
1:44
n 2
4d Log
r
2n 2 e 21
The complexity of a Dadda multiplier is determined by
evaluating the complexity of its parts. n
2
gates (2n 2 are
NAND gates, the rest are AND gates) to form the bit
product matrix exist, (n 2)
2
full adders, n 1 half adders
and one special half adder for the matrix reduction and a
2n2-bit carry propagate adder for the additionof the nal
two row matrix. If the carry propagate adder is realized
with a carry lookahead adder (implemented with 4-bit
lookahead logic blocks), and if the complexity of a full adder
is nine gates and the complexity of a half adder (either
regular or special) is four gates, then the total complexity
is:
GATES
DADDAMPY
10n
2
6
2
3
n 26 22
The Wallace tree multiplier is very similar to the Dadda
multiplier, except that it does more reduction in the rst
stages of the reduction process, it uses more half adders,
andit uses aslightlysmaller carrypropagating adder. Adot
diagramfor a 6-bit by 6-bit Wallace tree multiplier for twos
complement operands is shown on Fig. 11. This reduction
(which requires three full adder delays) is followed by an 8-
bit carry propagating adder. The total complexity of the
Wallace tree multiplier is a bit greater than the total
complexity of the Dadda multiplier. In most cases, the
Wallace and Dadda multipliers have about the same delay.
DIVISION
Two types of division algorithms are in common use: digit
recurrence and convergence methods. The digit recurrence
approach computes the quotient on a digit-by-digit basis,
MATRIX 0
MATRIX 1
MATRIX 2
MATRIX 3
1
Figure 10. 6-bit by 6-bit twos complement Dadda multiplier.
FIXED-POINT COMPUTER ARITHMETIC 9
hence they have a delay proportional to the precision of the
quotient. In contrast, the convergence methods compute an
approximation that converges to the value of the quotient.
For the common algorithms the convergence is quadratic,
which means that the number of accurate bits approxi-
mately doubles on each iteration.
The digit recurrence methods that use a sequence of
shift, add or subtract, and compare operations are rela-
tively simple to implement. On the other hand, the con-
vergence methods use multiplication on each cycle. This
fact means higher hardware complexity, but if a fast
multiplier is available, potentially a higher speed may
result.
Digit Recurrent Division
The digit recurrent algorithms (9) are based on selecting
digits of the quotient Q (where Q N/D) to satisfy the
following equation:
P
k1
rP
k
q
nk1
Dfor k 1; 2; . . . ; n 1 23
where P
k
is the partial remainder after the selection of the
kth quotient digit, P
0
N(subject to the constraint |P
0
| <
|D|), r is the radix, q
nk1
is the kth quotient digit to the
right of the binary point, and D is the divisor. In this
subsection, it is assumed that both N and D are positive,
see Ref. 10 for details on handling the general case.
Binary SRT Divider
The binary SRT division process (also known as radix-2
SRT division) selects the quotient from three candidate
quotient digits {1, 0}. The divisor is restricted to .5
D<1. Aowchart of the basic binary SRT scheme is shown
in Fig. 12. Block 1 initializes the algorithm. In step 3, 2 P
k
and the divisor are used to select the quotient digit. In step
4, P
k+1
2 P
k
q D. Step 5 tests whether all bits of the
quotient have beenformedandgoes to step2 if more need to
be computed. Eachpass throughsteps 25forms one digit of
the quotient. The result upon exiting from step 5 is a
collection of n signed binary digits.
Step 6 converts the n digit signed digit number into an n-
bit twos complement number bysubtracting, N, whichhas a
1 for each bit position where q
i
1 and 0 elsewhere from,
P, which has a 1 for each bit position where q
i
1 and 0
elsewhere. For example:
MATRIX 0
MATRIX 1
MATRIX 2
MATRIX 3
1
Figure 11. 6-bit by 6-bit Twos Complement Wallace Multiplier.
P = 0 . 1 1 0 0 1
N = 0 . 0 0 1 0 0
Q = 0 . 1 1 0 0 1 0 . 0 0 1 0 0
Q = 0 . 1 1 0 0 1 + 1 . 1 1 1 0 0
Q = 0 . 1 0 1 0 1 = 21/32
Q = 0 . 1 1 1 0 1 = 21/32
10 FIXED-POINT COMPUTER ARITHMETIC
The selection of the quotient digit can be visualized with
a P-DPlot such as the one shown in Fig. 13. The plot shows
the divisor along the x axis and the shifted partial remain-
der (inthis case 2 P
k
) along the y-axis. Inthe area where 0.5
D<1, values of the quotient digit are shownas a function
of the value of the shifted partial remainder. In this case,
the relations are especially simple. The digit selection and
resulting partial remainder are given for the k-th iteration
by the following relations:
If P
k
>:5;
If :5 < P
k
< :5;
If P
k
:5;
q
nk1
1
q
nk1
0
q
nk1
1
and P
k1
2P
k
D
and P
k1
2P
k
and P
k1
2P
k
D
24
25
26
Computing an n-bit quotient will involve selecting n quo-
tient digits and up to n 1 additions.
1. P
0
=N
k =1
2. k =k +1
3. Select q
n-k-1
Based on 2 P
k
& D
5.
k : n 1
LT
6. Form P & N
Q =P N
Q =N/D
4. P
k+1
=2 P
k
q
n-k-1
D
GE
Figure 12. Flowchart of binary SRT division.
2
1
0
1
2
1 0.75 0.5 0.25
q
i
=1
q
i
=0
q
i
=1
DIVISOR (D)
S
H
I
F
T
E
D
P
A
R
T
I
A
L
R
E
M
A
I
N
D
E
R
(
2
P
K
)
Figure 13. P-D plot for binary SRT division.
FIXED-POINT COMPUTER ARITHMETIC 11
Radix-4 SRT Divider
The higher radix SRT division process is similar to the
binary SRTalgorithms. Radix 4 is the most common higher
radix SRT division algorithm with either a minimally
redundant digit set of {2, 1, 0} or the maximally redun-
dant digit set of {3, 2, 1, 0}. The operation of the
algorithm is similar to the binary SRT algorithm shown
on Fig. 12, except that in step 3, 4 P
k
, and D are used to
determine the quotient digit. AP-DPlot is shown on Fig. 14
for the maximum redundancy version of radix-4 SRT
division. Seven values are possible for the quotient digit
at each stage. The test for completion in step 5 becomes
k :
n
2
1. Also the conversionto twos complement instep6is
modied slightly because each quotient digit provides two
bits of the P and Nnumbers that are used to formthe twos
complement number.
NewtonRaphson Divider
The second category of division techniques uses a multi-
plication-based iteration to compute a quadratically con-
vergent approximation to the quotient. In systems that
include a fast multiplier, this process may be faster than
the digit recurrent methods. One popular approach is the
Newton-Raphson algorithm that computes an approxima-
tionto the reciprocal of the divisor that is thenmultiplied by
thedividendtoproducethequotient. Theprocess tocompute
Q N/D consists of three steps:
1. Calculate a starting estimate of the reciprocal
of the divisor, R
(0)
. If the divisor, D, is normalized
(i.e.,
1
2
D<1), then R
(0)
3 2Dexactly computes 1/
D at D .5 and D 1 and exhibits maximum
error (of approximately 0.17) at D
1
2
:5
. Adjusting
R
(0)
downward to by half the maximum error
gives
R
0
2:915 2D 27
This produces an initial estimate, that is within
about 0.087 of the correct value for all points in the
interval
1
2
D<1.
2. Compute successively more accurate estimates of the
reciprocal by the following iterative procedure:
R
i1
R
i
2 DR
i
for i 0; 1; . . . ; k 28
3. Compute the quotient by multiplying the dividend
times the reciprocal of the divisor.
Q NR
k
29
where i is the iteration count and Nis the numerator.
Figure 15 illustrates the operation of the Newton
Raphson algorithm. For this example, three itera-
tions (which involve a total of four subtractions and
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
1 0.75 0.5 0.25
q
i
=3
q
i
=2
q
i
=1
q
i
=0
q
i
=1
q
i
=2
q
i
=3
DIVISOR (D)
S
H
I
F
T
E
D
P
A
R
T
I
A
L
R
E
M
A
I
N
D
E
R
(
4
P
k
)
Figure 14. P-D Plot for radix-4 maximally redundant SRT division.
12 FIXED-POINT COMPUTER ARITHMETIC
seven multiplications) produces an answer accurate
to nine decimal digits (approximately 30 bits).
With this algorithm, the error decreases quadratically
so that the number of correct bits in each approximation is
roughly twice the number of correct bits on the previous
iteration. Thus, from a 3.5-bit initial approximation, two
iterations produce a reciprocal estimate accurate to 14-bits,
four iterations produce a reciprocal estimate accurate to
56-bits, and so on.
The efciency of this process is dependent on the availa-
bility of a fast multiplier, because each iteration of
Equation (28) requires two multiplications and a subtrac-
tion. The complete process for the initial estimate, three
iterations, and the nal quotient determination requires
four subtraction operations and seven multiplication opera-
tions to produce a 16-bit quotient. This process is faster
than a conventional nonrestoring divider if multipli-
cation is roughly as fast as addition, which is a condition
that is satised for some systems that include a hardware
multiplier.
CONCLUSIONS
This article has presented an overview of the twos
complement number system and algorithms for the basic
xed-point arithmetic operations of addition, subtrac-
tion, multiplication, and division. When implementing
arithmetic units, often an opportunity exists to optimize
the performance and the complexity to match the
requirements of the specic application. In general,
faster algorithms require more area and power; often
it is desirable to use the fastest algorithm that will t the
available area and power bugdets.
BIBLIOGRAPHY
1. A. Weinberger and J. L. Smith, Alogic for high-speed addition,
National Bureau of Standards Circular, 591: 312, 1958.
2. O. L. MacSorley, High-speed arithmetic in binary computers,
Proceedings of the IRE, 49: 6791, 1961.
3. T. Kilburn, D. B. G. Edwards, and D. Aspinall, A parallel
arithmetic unit using a saturated transistor fast-carry circuit,
Proceedings of the IEEE, Part B, 107: 573584, 1960.
4. A. D. Booth, A signed binary multiplication technique, Quar-
terly J. Mechanics Appl. Mathemat., 4: 236240, 1951.
5. H. Sam and A. Gupta, A generalized multibit recoding of twos
complement binary numbers and its proof with application in
multiplier implementations, IEEE Trans. Comput., 39: 1006
1015, 1990.
6. B. Parhami, Computer Arithmetic: Algorithms and Hardware
Designs, New York: Oxford University Press, 2000.
7. C. S. Wallace, A suggestion for a fast multiplier, IEEE Trans.
Electron. Comput, 13: 1417, 1964.
8. L. Dadda, Some schemes for parallel multipliers, Alta Fre-
quenza, 34: 349356, 1965.
= 1.415 (2 .75 1.415)
= 1.415 .95875
R
(1)
= 1.32833125
R
(2)
= R
(1)
(2 B R
(1)
2 Multiplies, 1 Subtract )
R
(1)
= R
(0)
(2 B R
(0)
2 Multiplies, 1 Subtract )
= 2.915 2 .75
R
(0)
R
(0)
1.415 =
A = .625
B = .75
= 2.915 2 B 1 Subtract
= 1.32833125 (2 .75 1.32833125)
= 1.32833125 1.00375156
R
(2)
= 1.3333145677
R
(3)
= R
(2)
(2 B R
(2)
2 Multiplies, 1 Subtract )
= 1.3333145677 (2 .75 1.3333145677)
= 1.3333145677 1.00001407
R
(3)
= 1.3333333331
= A R Q
(3)
1 Multiply
= .625 1.3333333331
= .83333333319 Q
Figure 15. Example of NewtonRaphson division.
FIXED-POINT COMPUTER ARITHMETIC 13
9. J. E. Robertson, A new class of digital division methods, IEEE
Trans. Electr. Comput, 7: 218222, 1958.
10. M. D. Ercegovac and T. Lang, Division and Square Root: Digit-
Recurrence Algorithms and Their Implementations, Boston,
MA: Kluwer Academic Publishers, 1994.
11. IEEE Standard for Binary Floating-Point Arithmetic, IEEE
Std 7541985, Reafrmed 1990.
12. M. D. Ercegovac and T. Lang, Digital Arithmetic, San
Francisco, CA: Morgan Kaufmann Publishers, 2004.
13. I. Koren, Computer Arithmetic Algorithms, 2nd Edition,
Natick, MA: A. K. Peters, 2002.
EARL E. SWARTZLANDER, JR.
University of Texas at Austin
Austin, Texas
14 FIXED-POINT COMPUTER ARITHMETIC
F
FLUENCY WITH INFORMATION TECHNOLOGY
Fluency with Information Technology species a degree of
competency with computers and information in which
users possess the skills, concepts, and capabilities needed
to apply Information Technology (IT) condently and effec-
tively and can acquire new knowledge independently. Flu-
ency with Information Technology transcends computer
literacy and prepares students for lifelong learning of IT.
The concept of Fluency with Information Technology
derives from a National Research Council (NRC) study
initiated in 1997 and funded by the National Science
Foundation (NSF). The study focused on dening what
everyone should knowabout Information Technology. For
the purposes of the study, everyone meant the population
at large, and IT was dened broadly to include computers,
networking, software and applications, as well as informa-
tion resourcesvirtually anything one would encounter
using a network-connected personal computer.
The NSFmotivationfor requesting the study was driven
by the belief that much of the United States population is
already computer literate, but that literacy is not enough.
Withmore knowledge, people wouldmake greater use of IT,
and doing so would generally be benecial. Specically, the
NSF noted these points:
Employee.Salary
This seemingly straightforward paradigm has gener-
ated a large body of research, both academic and indus-
trial, which resulted in several prototype systems as well
as its acceptance as a part of the SQL99 (11) standardthat,
in turn, has made triggers part of the commercially avail-
able DBMS. Inthe rest of this article, we will present some
of the important aspects of the management of reactive
behavior in ADBS and will discuss their distinct features.
In particular, in the section on formalizing and reasoning,
1
Observe that to fully capture the behavior described in this
scenario, other triggers are neededones that would react to
the UPDATE and DELETE of tuples in the Employee relation.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
we motivate the need to formalize the active database
behavior. In the section on semantic Dimensions, we dis-
cuss their various semantic dimensions that have been
identied in the literature. In the overview section we
present the mainfeatures of some prototype ADBSbriey,
along with the discussion of some commercially available
DBMS that provide the triggers capability. Finally, in the
last section, we outline some of the current research
trends relatedto the reactive behavior innovel application
domains for data management, such as workows (12),
data streams (8,9), moving objects databases (13,14), and
sensor networks (6).
FORMALIZING AND REASONING ABOUT THE ACTIVE
BEHAVIOR
Historically, the reactive behavior expressed as a set of
condition!actionrules (IFconditionholds, THENexecute
action) was introduced in the Expert Systems literature
[e.g., OPS5 (15)]. Basically, the inference engine of the
system would cycle through the set of such rules and,
whenever a left-hand side of a rule is encountered that
matches the current status of the knowledge base (KB), the
action of the right-hand side of that rule would be executed.
From the perspective of the ECA paradigm of ADBS, this
system can be viewed as one extreme point: CA rules,
without an explicit event. Clearly, some kind of implicit
event, along with a corresponding formalism, is needed so
that the C-part (condition) can reect properly and moni-
tor/evaluate the desired behavior along the evolution of the
database. Observe that, in general, the very concept of an
evolution of the database must be dened clearly for exam-
ple, the state of the data in a given instance together with
the activities log (e.g., an SQL query will not change the
data; however, the administrator may need to know which
user queried which dataset). A particular approach to
specify such conditions in database triggers, assuming
that the clock-tick is the elementary implicit event,
was presented by Sistla and Wolfson (16) and is based on
temporal logic as anunderlying mechanismto evaluate and
to detect the satisability of the condition.
As another extreme, one may consider the EA type of
rules, with a missing condition part. In this case, the detec-
tion of events must be empowered with the evaluation of a
particular set of factsinagivenstateof thedatabase[i.e., the
evaluation of the C-part must be embedded within the
detectionof the events (5)]. Anoteworthy observationis that
even outside the context of the ADBS, the event manage-
ment has spurreda large amount of research. Anexample is
the eld known as event notication systems in which
various users can, in a sense, subscribe for notications
that, in turn, are generated by entities that have a role of
publishersall in distributed settings (10). Researchers
have proposed various algebras to specify a set of composite
events, based on the operators that are applied to the basic/
primitive events (5,17). For example, the expression EE1
^ E2 species that an instance of the event E should be
detected in a state of the ADBSin which both E1 and E2 are
present. On the other hand, E E1;E2 species that an
instance of the event Eshouldbe detectedina state inwhich
the prior detection of E1 is followed by a detection of E2 (in
that order). Clearly, one also needs an underlying detection
mechanismfor the expressions, for example, Petri Nets (17)
or tree-like structures (5). Philosophically, the reason to
incorporate bothE andC parts of the ECArules inADBS
is twofold: (1) It is intuitive to state that certain conditions
should not always be checked but only upon the detection of
certain events and (2) it is more cost-effective in actual
implementations, as opposed to constant cycling through
the set of rules.
2
Incorporatingbothevents andconditions in
the triggers has generated a plethora of different problems,
such as the management of database state(s) during the
executionof the triggers (18) and the binding of the detected
events with the state(s) of the ADBS for the purpose of
condition evaluation (19).
The need for formal characterization of the active rules
(triggers) was recognized by the research community in the
early 1990s. One motivation was caused by the observation
that in different prototype systems [e.g., Postgres (4) vs.
Starburst (2)], triggers with very similar syntactic struc-
ture would yield different executional behavior. Along with
this was the need to perform some type of reasoning about
the evolution of an active database system and to predict
(certain aspects of) their behavior. As a simple example,
given a set of triggers and a particular state of the DBMS, a
database/application designer may wish to know whether
a certain fact will hold in the database after a sequence of
modications (e.g., insertions, deletions, updates) have
been performed. In the context of our example, one may
be interested in the query will the average salary of the
employees in the Shipping department exceed 55K in any
valid state which results via salary updates. A translation
of the active database specicationinto a logic programwas
proposed as a foundation for this type of reasoning in
Ref. (20).
Two global properties that have been identied as desir-
able for anyapplicationof anADBSare the terminationand
the conuence of a given set of triggers (21,22). The termi-
nation property ensures that for a given set of triggers in
any initial state of the database and for any initial mod-
ication, the ring of the triggers cannot proceed inde-
nitely. On the other hand, the conuence property ensures
that for a given set of triggers, in any initial state of the
database and for any initial modication, the nal state of
the database is the same, regardless of the order of execut-
ing the (enabled) triggers. The main question is, given the
specications of a set of triggers, can one statically, (i.e., by
applying some algorithmic techniques only to the triggers
specication) determine whether the properties of termi-
nation and/or conuence hold? To give a simple motivation,
inmany systems, the number of cascaded/recursive invoca-
tions of the triggers is bounded by a predened constant to
avoid innite sequences of ring the triggers because of a
particular event. Clearly, this behavior is undesirable, if
the termination could have been achieved in a few more
recursive executions of the triggers. Although run-time
termination analysis is a possible option, it is preferable
2
A noteworthy observation here is that the occurrence of a parti-
cular event is, strictly speaking, different fromits detection, which
is associated with a run-time processing cost.
2 ACTIVE DATABASE SYSTEMS
to have static tools. In the earlier draft of the SQL3 stan-
dard, compile-time syntactic restrictions were placedonthe
triggers specications to ensure termination/conuence.
However, it was observed that these specications may
put excessive limitations on the expressive power on the
triggers language, which is undesirable for many applica-
tions, and they were removed from the subsequent SQL99
draft.
For the most part, the techniques to analyze the termi-
nation and the conuence properties are based on labeled
graph-based techniques, such as the triggering hyper
graphs (22). For a simplied example, Fig. 1a illustrates
a triggering graph in which the nodes denote the particular
triggers, and the edge between two nodes indicates that the
modications generatedby the actionpart of agiventrigger
node may generate the event that enables the trigger
representedby the other node. If the graphcontains a cycle,
then it is possible for the set of triggers along that cycle to
enable eachother indenitely througha cascaded sequence
of invocations. In the example, the cycle is formed among
Trigger1, Trigger3, Trigger4, and Trigger5. Hence, should
Trigger1 ever become enabled because of the occurrence of
its event, these four triggers could loop perpetually in a
sequence of cascading rings. On the other hand, gure 1b
illustrates a simple example of a conuent behavior of a set
of triggers. When Trigger1 executes its action, both Trig-
ger2 and Trigger3 are enabled. However, regardless of
which one is selected for an execution, Trigger4 will be
the next one that is enabled. Algorithms for static analysis
of the ECArules are presented inRef. (21), whichaddresses
their application to the triggers that conform to the SQL99
standard.
SEMANTIC DIMENSIONS OF ACTIVE DATABASES
Many of the distinctions among the various systems stem
from the differences in the values chosen for a particular
parameter (23). In some cases, the choice of that value is an
integral part of the implementation (hard wired),
whereas in other cases the ADBS provide a declarative
syntax for the users to select a desired value. To better
understand this concept, recall again our average salary
maintenance scenario from the introduction. One of the
possible sources that can cause the database to arrive at an
undesired state is an update that increases the salary of a
set of employees. We already illustrated the case of an
insertion, now assume that the trigger that would corre-
spond to the second case is specied as follows:
CREATE TRIGGER Update-Salary-Check
ON UPDATE OF Employee.Salary
IF (SELECT AVG Employee.Salary) > 65,000
UPDATE Employee
SET Employee.Salary 0.95
Employee.Salary
Assume that it was decided to increase the salary of
every employee in the Maintenance department by 10%,
which would correspond to the following SQL statement:
UPDATE Employee
SET Employee.Salary 1.10
Employee.Salary
WHERE Employee.Department Maintenance
For the sake of illustration, assume that three employ-
ees are in the Maintenance department, Bob, Sam, and
Tom, whose salaries need to be updated. Strictly speaking,
an update is essentially a sequence of a deletion of an old
tuple followedbyaninsertionof that tuple withthe updated
values for the respective attribute(s); however, for the
purpose of this illustration, we canassume that the updates
execute automatically. Now, some obvious behavioral
options for this simple scenario are:
An individual instance of the trigger Update-Salary-
Check may be red immediately, for every single
update of a particular employee, as shown in Fig. 2a.
The DBMS may wait until all the updates are com-
pleted, and then execute the Update-Salary-Check,
as illustrated in Fig. 2b.
The DBMS waits for the completion of all the updates
and the evaluation of the condition. If satised, the
execution of the action for the instances of the
Update-Salary-Check trigger may be performed
Analog Amplifier
Speaker
Microphone
A Simple Analog System
Personal Digital Assistant and a Mobile Phone
Speaker
Microphone
A Digital System
Figure 1. Triggering graphs for termination and conuence.
ACTIVE DATABASE SYSTEMS 3
either within the same transaction as the UPDATE
Employee statement or in a different/subsequent
transaction.
These issues illustrate some aspects that have moti-
vated the researchers to identify various semantic dimen-
sions, a term used to identify the various parameters
collectively whose values may inuence the executional
behavior of ADBS. Strictly speaking, the model of the
triggers processing is coupled closely with the properties
of the underlying DBMS, such as the data model, the
transaction manager, and the query optimizer. However,
some identiable stages exist for every underlying DBMS:
Detection of events: Events can be either internal,
caused by a DBMS-invoked modication, transaction
command, or, more generally, by any server-based
action (e.g., clicking a mouse button on the display);
or external, which report an occurrence of something
outside the DBMS server (e.g., a humidity reading of
a particular sensor, a location of a particular user
detected by an RFID-based sensor, etc). Recall that
(c.f. formalizing and reasoning), the events can be
primitive or composite, dened in terms of the primi-
tive events.
Detectionof affectedtriggers: This stage identies the
subset of the specied triggers, whose enabling
events are among the detected events. Typically,
this stage is also called the instantiation of the trig-
gers.
Conditions evaluation: Given the set of instantiated
triggers, the DBMS evaluates their respective condi-
tions anddecides whichones are eligible for execution.
Observe that the evaluation of the conditions may
sometimes require a comparison between the values
inthe OLD(e.g., pretransactionstate, or the state just
before the occurence of the instantiating event) with
the NEW (or current) state of the database.
Scheduling and execution: Given the set of instan-
tiated triggers, whose condition part is satised, this
stage carries out their respective action-parts. In
some systems [e.g., Starburst (2)] the users are
allowed to specify a priority-based ordering among
the triggers explicitly for this purpose. However, in
the SQL standard (11), and in many commercially
available DBMSs, the ordering is based on the time
stampof the creationof the trigger, andeventhis may
not be enforced strictly at run-time. Recall (c.f. for-
malizing and reasoning) that the execution of the
actions of the triggers may generate events that
enable some other triggers, which causes a cascaded
ring of triggers.
In the rest of this section, we present a detailed discus-
sion of some of the semantic dimensions of the triggers.
Granularity of the Modications
In relational database systems, a particular modication
(insertion, deletion, update) maybe appliedto asingle tuple
or to a set of tuples. Similarly, in an object-oriented data-
base system, the modications may be applied to a single
object or to acollectionof objectsinstances of agivenclass.
Based on this distinction, the active rules can be made to
react in a tuple/instance manner or in a set-oriented man-
ner. An important observation is that this type of granu-
larity is applicable in two different places in the active
rules: (1) the events to which a particular rule reacts (is
awoken by) and (2) the modications executed by the
action part of the rules. Typically, in the DBMS that
complies with the SQL standard, this distinction is speci-
ed by using FOR EACH ROW (tuple-oriented) or FOR
EACH STATEMENT (set-oriented) specication in the
respective triggers. In our motivational scenario, if one
would like the trigger Update-Salary to react to the mod-
ications of the individual tuples, which correspond to the
behavior illustrated in Fig. 2a, its specication should be:
CREATE TRIGGER Update-Salary-Check
ON UPDATE OF Employee.Salary
FOR EACH ROW
IF SELECT(...)
...
Coupling Among Triggers Components
Because each trigger that conforms to the ECA paradigm
has three distinct partsthe Event, Condition, and
Actionone of the important questions is how they are
synchronized. This synchronization is often called the cou-
pling among the triggers components.
E-Ccoupling: This dimension describes the temporal
relationship among the events that enable certain
triggers and the time of evaluating their conditions
parts, with respect to the transaction in which the
events were generated. Withimmediate coupling, the
conditions are evaluated as soon as the basic mod-
ication that produced the events is completed.
4 ACTIVE DATABASE SYSTEMS
Under the delayed coupling mode, the evaluation of
the conditions is delayed until a specic point (e.g., a
special event takes place such as an explicit rule-
processing point in the transaction). Specically, if
this special event is the attempt to commit the trans-
action, then the coupling is also called deferred.
C-A coupling: Similarly, for a particular trigger, a
temporal relationship exists between the evaluation
of its condition and (if satised) the instant its action
is executed. The options are the same as for the E-C
coupling: immediate, in which case the action is
executed as soon as the condition-evaluation is com-
pleted (in case it evaluates to true); delayed, which
executes the action at some special point/event; and
deferred, which is the case when the actions are
executed at the end (just before commit) of the trans-
action in which the condition is evaluated.
A noteworthy observation at this point is that the
semantic dimensions should not be understood as isolated
completely from each other but, to the contrary, their
values may be correlated. Among the other reasons, this
correlation exists because the triggers manager cannot be
implemented in isolation from the query optimizer and the
transaction manager. In particular, the coupling modes
discussed above are not independent from the transaction
processing model and its relationship with the individual
parts of the triggers. As another semantic dimensioninthis
context, one may consider whether the conditions evalua-
tion and the actions executions should be executed in the
same transaction in which the triggering events have
occurred (note that the particular transaction may be
aborted because of the effects of the triggers processing).
In a typical DBMS setting, in which the ACID (atomicity,
consistency, isolation, and durability) properties of the
transactions must be ensured, one would like to maintain
the conditions evaluations and actions executions within
the same transaction in which the triggering event origi-
nated. However, if a more sophisticated transaction man-
agement is available [e.g., nested transactions (24)] they
may be processed in a separate subtransaction(s), in which
the failure of a subtransaction may cause the failure of a
parent transactioninwhichthe events originated, or intwo
different transactions. This transaction is known com-
monly as a detached coupling mode.
Events Consumption and Composition
These dimensions describe how a particular event is trea-
tedwhenprocessing a particular trigger that is enableddue
to its occurrence, as well as howthe impact of the net effect
of a set of events is considered.
One of the differences in the behavior of a particular
ADBSis caused by the selectionof the scope(23) of the event
consumptions:
NOconsumption: the evaluation and the execution of
the conditions part of the enabled/instantiated trig-
gers have no impact on the triggering event. In
essence, this means that the same event can enable
a particular trigger over and over again. Typically,
such behavior is found in the production rule systems
used in expert systems (15).
Local consumption: once an instantiated trigger has
proceeded with its condition part evaluation, that
trigger can no longer be enabled by the same event.
However, that particular event remains eligible for
evaluation of the condition the other triggers that it
has enabled. This feature is the most common in the
existing active database systems. Inthe setting of our
motivational scenario, assume that we have another
trigger, for example, Maintain-Statistics, which also
reacts to an insertion of newemployees by increasing
properly the total number of the hired employees in
the respective departmental relations. Upon inser-
tion of a set of new employees, both New-Employee-
Salary-Check and Maintain-Statistics triggers will
be enabled. Under the local consumption mode, in
case the New-Employee-Salary-Check trigger exe-
cutes rst, it is no longer enabled by the same inser-
tion. The Maintain-Statistics trigger, however, is left
enabledandwill checkits conditionand/or execute its
action.
Global consumption: Essentially, global consumption
means that once the rst trigger has been selected for
its processing, a particular event can no longer be
used to enable any other triggers. In the settings of
our motivational scenario, once the given trigger
New-Employee-Salary-Check has been selected for
evaluation of its condition, it would also disable the
Maintain-Statistics despite that it never had its con-
ditionchecked. Ingeneral, this type of consumptionis
appropriate for the settings in which one can distin-
guish among regular rules and exception rules
that are enabled by the same event. The exception
not only has a higher priority, but it also disables the
processing of the regular rule.
A particular kind of event composition, which is
encountered in practice, frequently is the event net effect.
The basic distinction is whether the system should con-
sider the impact of the occurrence of a particular event,
regardless of what are the subsequent events in the
transaction, or consider the possibility of invalidating
some of the events that have occurred earlier. As a parti-
cular example, the following intuitive policyfor computing
the net effects has been formalized and implemented in
the Starburst system (2):
If a particular tuple is created (and possibly updated)
in a transaction, and subsequently deleted within
that same transaction, the net effect is null.
If a particular tuple is created (respectively, updated)
in a transaction, and that tuple is updated subse-
quently several times, the net effect is the creation of
the nal version of that tuple (respectively, the single
update equivalent to the nal value).
If a particular tuple is updated and deleted subse-
quently in a given transaction, then the net effect is
the deletion of the original tuple.
ACTIVE DATABASE SYSTEMS 5
Combining the computation of the net effects in systems
that allow specication of composite events via an event
algebra (5,17) is a very complex problem. The main reason
is that in a given algebra, the detection of a particular
composite event may be ina state inwhichseveral different
instances of one of its constituent events have occurred.
Now, the questionbecomes what is the policy for consuming
the primitive events uponadetectionof acomposite one. An
illustrative example is provided in Fig. 3. Assume that the
elementary (primitive) events correspond to tickers from
the stockmarket and the user is interested in the composite
event: CE (two consecutive increases of the IBM stock)
AND(two consecutive increases of the General Electric [GE]
stock). Given the timeline for the sequence of events illu-
stratedinFig. 3, uponthe secondoccurrence of the GEstock
increase (GE2), the desired composite event CE can be
detected. However, now the question becomes which of the
primitive events should be used for the detection of CE (6
ways exist to couple IBM-basedevents), andhowshouldthe
rest of the events from the history be consumed for the
future (e.g., if GE2 is not consumed upon the detection of
CE, then when GE3 occurs, the system will be able to
detect another instance of CE). Chakravarthy et al. (5) have
identied four different contexts (recent, chronicle, contin-
uous, andcumulative) of consuming the earlier occurrences
of the primitive constituent events which enabled the
detection of a given composite event.
Data Evolution
In many ADBSs, it is important to query the history con-
cerning the execution of the transaction(s). For example, in
our motivational scenario, one may envision a modied
constraint that states that the average salary increase in
the enterprise should not exceed 5%fromits previous value
when newemployees are and/or inserted when the salaries
of the existing employees are updated. Clearly, in such
settings, the conditions part of the respective triggers
should compare the current state of the database with
the older state.
Whenit comes to past database states, aspecial syntaxis
required to specify properly the queries that will retrieve
the correct information that pertains to the prior database
states. It can be speculated that every single state that
starts from the begin point of a particular transaction
should be available for inspection; however, in practice,
only a few such states are available (c.f. Ref. (23)):
Pretransaction state: the state of the database just
before the executionof the transactionthat generated
the enabling event.
Last consideration state: given a particular trigger,
the state of the database after the last time that
trigger has been considered (i.e., for its condition
evaluation).
Pre-event state: given a particular trigger, the state of
the database just before the occurrence of the event
that enabled that trigger.
Typically, in the existing commercially available DBMS
that offers active capabilities, the ability to query the past
states refers to the pretransaction state. The users are
given the keywords OLD and NEW to specify declaratively
which part needs to be queried when specifying the condi-
tion part of the triggers (11).
Another option for inspecting the history of the active
database system is to query explicitly the set of occurred
events. The main benet of this option is the increased
exibility to specify the desired behavioral aspects of a
given application. For example, one may wish to query
not all the items affected by a particular transaction, but
only the ones that participated in the generation of the
given composite event that enabled a particular trigger (5).
Some prototype systems, [e.g., Chimera (25) offer this
extended functionality, however, the triggers in the
commercially available DBMS that conform to the SQL
standardare restrictedto queryingthe database states only
(c.f., the OLD and NEW above).
Recent works (26) have addressed the issues of extend-
ing the capabilities of the commercially available ORDBMS
Oracle 10g (27) with features that add a exibility for
accessing various portions (states) of interest throughout
the evolution of the ADBS, which enable sophisticated
management of events for wide variety of application
domains.
Effects Ordering
We assumed that the execution of the action part of a
particular trigger occurs not only after the occurrence of
the event, but also after the effects of executing the mod-
ications that generated that event have been incorpo-
rated. In other words, the effects of executing a
particular trigger were adding to the effects of the mod-
ications that were performed by its enabling event.
Although this seems to be the most intuitive approach,
in some applications, such as alerting or security monitor-
ing, it may be desirable to have the action part of the
corresponding trigger execute before the modications of
the events take place, or even instead of the modications.
Typical example is atrigger that detects whenanunauthor-
ized user has attempted to update the value of a particular
tuple in a given relation. Before executing the users
request, the respective log-le needs to be updated prop-
erly. Subsequently, the user-initiated transaction must be
aborted; instead, an alert must be issued to the database
administrator. Commercially available DBMS offer the
exibility of stating the BEFORE, AFTER, and INSTEAD
preferences in the specication of the triggers.
Figure 3. Composite events and
consumption.
6 ACTIVE DATABASE SYSTEMS
Conict Resolution and Atomicity of Actions
We already mentioned that if more than one trigger is
enabled by the occurrence of a particular event, some
selection must be performed to evaluate the respective
conditions and/or execute the actions part. From the
most global perspective, one may distinguish between
the serial execution, which selects a single rule according
to a predened policy, and a parallel execution of all the
enabled triggers. The latter was envisioned in the HiPAC
active database systems (c.f. Ref. (28)) and requires sophis-
ticated techniques for concurrency management. The
former one can vary from specifying the total priority
orderingcompletelybythe designer, as done inthe Postgres
system (4), to partial ordering, which species an incom-
plete precedence relationship among the triggers, as is the
case in the Starburst system(20). Although the total order-
ingamongthe triggers mayenable adeterministic behavior
of the active database, it may be too demanding on the
designer, who always is expected to know exactly the
intended behavior of all the available rules (23). Commer-
cial systems that conform with the SQL99 standard do not
offer the exibility of specifying an ordering among the
triggers. Instead, the default ordering is by the timestamp
of their creation.
When executing the action part of a given trigger, a
particular modication may constitute an enabling event
for some other trigger, or even for a new instance of the
same trigger whose actions execution generated that
event. One option is to interrupt the action of the currently
executing trigger and process the triggers that were awo-
ken by it, which could result in cascaded invocation where
the execution of the trigger that produced the event is
suspended temporarily. Another option is to ignore the
occurrence of the generated event temporarily, until the
action part of the currently executing trigger is completed
(atomic execution). This actionillustrates that the values in
different semantic dimensions are indeed correlated.
Namely, the choice of the atomicity of the execution will
impact the value of the E-C/C-Acoupling modes: one cannot
expect animmediate coupling if the executionof the actions
is to be atomic.
Expressiveness Issues
As we illustrated, the choice of values for a particular
semantic dimension, especially when it comes to the rela-
tionship with the transaction model, may yield different
outcomes of the execution of a particular transaction by
the DBMS (e.g., deferred coupling will yield different
behavior fromthe immediate coupling). However, another
subtle aspect of the active database systems is dependent
strongly on their chosen semantic dimensions the
expressive power. Picouet and Vianu (29) introduced a
broad model for active databases based on the unied
framework of relational Turing machines. By restricting
some of the values of the subset of the semantic dimen-
sions and thus capturing the interactions between the
sequence of the modications and the triggers, one can
establish a yardstick to compare the expressive powers of
the various ADBSs. For example, it can be demonstrated
that:
The A-RDL system (30) under the immediate cou-
pling mode is equivalent to the Postgres system(4) on
ordered databases.
The Starburst system (2) is incomparable to the
Postgres system (4).
The HiPAC system (28) subsumes strictly the Star-
burst (2) and the Postgres (4) systems.
Althoughthis type of analysis is extremely theoretical in
nature, it is important because it provides some insights
that may have an impact on the overall application design.
Namely, when the requirements of a given application of
interest are formalized, the knowledge of the expressive
power of the set of available systems may be a crucial factor
to decide which particular platform should be used in the
implementation of that particular application.
OVERVIEW OF ACTIVE DATABASE SYSTEMS
In this section, we outline briey some of the distinct
features of the existing ADBSboth prototypes as well
as commercially available systems. A detailed discussion
of the properties of some systems will also provide an
insight of the historic development of the research in the
eld of ADBS, can be found in Refs. (1) and (7).
Relational Systems
A number of systems have been proposed to extend the
functionality of relational DBMS with active rules. Typi-
cally, the events in such systems are mostly database
modications (insert, delete, update) and the language to
specify the triggers is based on the SQL.
Ariel (31): The Ariel system resembles closely the
traditional Condition ! Action rules from expert
systems literature (15), because the specication of
the Event part is optional. Therefore, in general, NO
event consumption exists, and the coupling modes
are immediate.
Starburst (2): This system has been used extensively
for database-internal applications, such as integrity
constraints and views maintenance. Its most notable
features include the set-based execution model and
the introduction of the net effects when considering
the modications that have led to the occurrence of a
particular event. Another particular feature intro-
duced by the Starburst system is the concept of rule
processing points, which may be specied to occur
during the execution of a particular transaction or at
its end. The execution of the action part is atomic.
Postgres (4): The key distinction of Postgres is that
the granularity of the modications to which the
triggers react is tuple (row) oriented. The coupling
modes between the E-C and the C-A parts of the
triggers are immediate and the execution of
the actions part is interruptable, which means that
ACTIVE DATABASE SYSTEMS 7
the recursive enabling of the triggers is an option.
Another notable feature of the Postgres systemis that
it allows for INSTEAD OF specication in its active
rules.
Object-Oriented Systems
One of the distinct features of object-oriented DBMS
(OODBMS) is that it has methods that are coupled with
the denition of the classes that specify the structure of the
data objects stored in the database. This feature justies
the preference for usingOODBMSfor advancedapplication
domains that include extended behavior management.
Thus, the implementation of active behavior in these sys-
tems is coupled tightly with a richer source of events for the
triggers (e.g., the execution of any method).
ODE (32): The ODE system was envisioned as an
extension of the C++ language with database cap-
abilities. The active rules are of the C-A type and are
divided into constraints andtriggers for the efciency
of the implementations. Constraints and triggers are
both dened at a class level and are considered a
property of a given class. Consequently, they can be
inherited. One restriction is that the updates of the
individual objects, caused by private member func-
tions, cannot be monitored by constraints and trig-
gers. The system allows for both immediate coupling
(called hard constraints) and deferred coupling
(called soft constraints), and the triggers can be
declared as executing once-only or perpetually (reac-
tivated).
HiPAC (28): The HiPAC project has pioneered many
of the ideas that were used subsequently in various
research results on active database systems. Some of
the most important contributions were the introduc-
tion of the coupling modes and the concept of compo-
site events. Another important feature of the HiPAC
system was the extension that provided the so called
delta-relation, whichmonitors the net effect of a set of
modications and made it available as a part of the
querying language. HiPAC also introduced the
visionary features of parallel execution of multiple
triggers as subtransactions of the original transac-
tion that generated their enabling events.
Sentinel (5): The Sentinel project provided an active
extension of the OODBMS, which represented the
active rules as database objects and focused on
the efcient integration of the rule processing mod-
ule within the transaction manager. One of the main
novelties discovered this particular research project
was the introduction of a rich mechanism for to
specify and to detect composite events.
SAMOS (18): The SAMOS active database prototype
introduced the concept of an interval as part of the
functionality needed to manage composite events. A
particular novelty was the ability to include the
monitoring intervals of interest as part of the speci-
cation of the triggers. The underlying mechanismto
detect the composite events was based on Colored
Petri-Nets.
Chimera(25): The Chimera systemwas envisioned as
a tool that would seamlessly integrate the aspects of
object orientation, deductive rules, and active rules
into a unied paradigm. Its model has strict under-
lying logical semantics (xpoint based) and very
intuitive syntax to specify the active rules. It is based
on the EECA (Extended-ECA) paradigm, specied in
Ref. (23), and it provides the exibility to specify a
wide spectrum of behavioral aspects (e.g., semantic
dimensions). The language consists of two main com-
ponents: (1) declarative, which is used to specify
queries, deductive rules, and conditions of the active
rules; and (2) procedural, which is used to specify the
nonelementary operations to the database, as well as
the action parts of the triggers.
Commercially Available Systems
One of the earliest commercially available active database
systems was DB2 (3), which integrated trigger processing
with the evaluation and maintenance of declarative con-
straints in a manner fully compatible with the SQL92
standard. At the time it served as a foundation model for
the draft of the SQL3standard. Subsequently, the standard
has migrated to the SQL99 version (11), in which the
specication of the triggers is as follows:
<trigger denition> ::=
CREATE TRIGGER <trigger name>
{BEFORE|AFTER}<trigger event> ON
<table name>
[REFERENCING <old or new values alias list>]
[FOR EACH {ROW | STATEMENT}]
[<trigger condition>]
<trigger action>
<trigger event> ::= INSERT | DELETE | UPDATE
[OF <column name list>]
<old or new values alias list> ::= {OLD | NEW}
[AS] <identier>|{OLD_TABLE|
NEW_TABLE} [AS] <identier>
The conditionpart inthe SQL99 triggers is optional and,
if omitted, it is considered to be true; otherwise, it can be
any arbitrarily complex SQL query. The action part, on the
other hand, is any sequence of SQL statements, which
includes the invocation of stored procedures, embedded
within a single BEGIN END block. The only statements
that are excluded from the available actions pertain to
connections, sessions, and transactions processing. Com-
mercially available DBMS, with minor variations, follow
the guidelines of the SQL99 standards.
In particular, the Oracle 10g (27), an object-relational
DBMS (ORDBMS), not only adheres to the syntax speci-
cations of the SQL standard for triggers (28), but also
provides some additions: The triggering event can be spe-
cied as a logical disjunction (ON INSERT OR UPDATE)
and the INSTEAD OF option is provided for the actions
execution. Also, some system events (startup/shutdown,
8 ACTIVE DATABASE SYSTEMS
server error messages), as well as user events (logon/logoff
and DDL/DML commands), can be used as enabling events
in the triggers specication. Just like in the SQL standard,
if more than one trigger is enabled by the same event, the
Oracle server will attempt to assign a priority for their
execution based on the timestamps of their creation. How-
ever, it is not guarantees that this case will actually occur at
run time. When it comes to dependency management,
Oracle 10g server treats triggers in a similar manner to
the stored procedures: they are inserted automatically into
the data dictionary and linked with the referenced objects
(e.g., the ones which are referenced by the action part of the
trigger). Inthe presence of integrity constraints, the typical
executional behavior of the Oracle 10g server is as follows:
1. Run all BEFOREstatement triggers that apply to the
statement.
2. Loop for each row affected by the SQL statement.
a. Run all BEFORE row triggers that apply to the
statement.
b. Lock and change row, and perform integrity con-
straint checking. (The lock is not released until the
transaction is committed.)
c. Run all AFTER row triggers that apply to the
statement.
3. Complete deferred integrity constraint checking.
4. Run all AFTER statement triggers that apply to the
statement.
The Microsoft Server MS-SQL also follows closely the
syntax prescribed by the SQL99 standard. However, it has
its own additions; for example, it provides the INSTEAD
OFoptionfor triggers execution, as well as a specicationof
a restricted form of composite events to enable the parti-
cular trigger. Typically, the statements execute in a tuple-
oriented manner for each row. A particular trigger is asso-
ciatedwithasingle table and, uponits denition, the server
generates a virtual table automatically for to access the old
data items. For example, if a particular trigger is supposed
to react to INSERT on the table Employee, then upon
insertion to Employee, a virtual relation called Inserted
is maintained for that trigger.
NOVEL CHALLENGES FOR ACTIVE RULES
We conclude this article with a brief description of some
challenges for the ADBSs innovel applicationdomains, and
with a look at an extended paradigm for declaratively
specifying reactive behavior.
Application Domains
Workow management systems (WfMS) provide tools to
manage (modeling, executing, and monitoring) workows,
which are viewed commonly as processes that coordinate
various cooperative activities to achieve a desired goal.
Workow systems often combine the data centric view of
the applications, which is typical for information systems,
with their process centric behavioral view. It has already
beenindicated(12) that WfMScouldbenet greatlybyafull
use of the tools andtechniques available inthe DBMSwhen
managing large volumes of data. In particular, Shankar et
al. (12) have applied active rules to the WfMS settings,
which demonstrates that data-intensive scientic work-
ows canbenet fromthe concept of active tables associated
with the programs. One typical feature of workows is that
many of the activities may need to be executed by distrib-
uted agents (actors of particular roles), which need to be
synchronized to optimize the concurrent execution. A par-
ticular challenge, from the perspective of triggers manage-
ment in such distributed settings, is to establish a common
(e.g., transaction-like) context for their maincomponents
events, conditions, and actions. As a consequence, the
corresponding triggers must execute in a detached mode,
which poses problems related not only to the consistency,
but also to their efcient scheduling and execution (33).
Unlike traditional database applications, many novel
domains that require the management of large quantities of
information are characterized by the high volumes of data
that arrive very fast in a stream-like fashion (8). One of the
main features of such systems is that the queries are no
longer instantaneous; they become continuous/persistent
in the sense that users expect the answers to be updated
properly to reect the current state of the streamed-in
values. Clearly, one of the main aspects of the continuous
queries (CQ) management systems is the ability to react
quickly to the changes caused by the variation of the
streams and process efciently the modication of the
answers. As such, the implementation of CQ systems
may benet from the usage of the triggers as was demon-
strated in the Niagara project (9). One issue related to the
scalability of the CQ systems is the very scalability of the
triggers management (i.e., many instances of various trig-
gers may be enabled). Although it is arguable that the
problem of the scalable execution of a large number of
triggers may be coupled closely with the nature of the
particular application domain, it has been observed that
some general aspects of the scalability are applicable uni-
versally. Namely, one can identify similar predicates (e.g.,
inthe conditions) across many triggers andgrouptheminto
equivalence classes that canbeindexedonthose predicates.
This project may require a more involved system catalog
(34), but the payoff is a much more efcient execution of a
set of triggers. Recent researchhas also demonstrated that,
to capture the intended semantics of the application
domain in dynamic environments, the events may have
to be assigned an interval-based semantics (i.e., duration
may need to be associated with their detection). In parti-
cular, in Ref. (35), the authors have demonstrated that if
the commonly accepted instantaneous semantics for events
occurrence is used in trafc management settings, one may
obtain an unintended meaning for the composite events.
Moving objects databases (MODs) are concerned with
the management of large volumes of datathat pertainto the
location-in-time information of the moving entities, as well
as efcient processing of the spatio-temporal queries that
pertain to that information (13). By nature, MOD queries
are continuous and the answers to the pending queries
change because of the changes in the location of the mobile
objects, which is another natural setting for exploiting an
efcient formof a reactive behavior. In particular, Ref. (14)
ACTIVE DATABASE SYSTEMS 9
proposed a framework based on the existing triggers in
commercially available systems to maintain the correct-
ness of the continuous queries for trajectories. The problem
of the scalable execution of the triggers in these settings
occurs whena trafc abnormality in a geographically small
region may cause changes to the trajectories that pass
through that region and, in turn, invalidate the answers
to spatio-temporal queries that pertain to a much larger
geographic area. The nature of the continuous queries
maintenance is dependent largely on the model adopted
for the mobility representation, and the MOD-eld is still
very active in devising efcient approaches for the queries
management which, inone way or another, do require some
form of active rules management.
Recently, the wireless sensor networks (WSNs) have
opened a wide range of possibilities for novel applications
domains inwhich the whole process of gathering and mana-
ging the information of interest requires new ways of per-
ceiving the data management problems (36). WSNconsist of
hundreds, possibly thousands, of low-cost devices (sensors)
that are capable of measuring the values of a particular
physical phenomenon (e.g., temperature, humidity) and of
performing some elementary calculations. In addition, the
WSNs are also capable of communicating and self-organiz-
inginto anetworkinwhichthe informationcanbegathered,
processed, and disseminated to a desired location. As an
illustrative example of the benets of the ECA-like rules in
WSN settings, consider the following scenario (c.f. Ref. (6)):
whenever the sensors deployed ina givengeographic area of
interest have detected that the average level of carbon
monoxide in the air over any region larger than 1200 ft
2
exceeds 22%, an alarm should be activated. Observe that
here the event corresponds to the updates of the (readings
of the) individual sensors; the condition is a continuous
query evaluated over the entire geographic zone of interest,
and with a nested sub-query of identifying the potentially-
dangerous regions. At intuitive level, this seems like a
straightforward application of the ECA paradigm. Numer-
ous factors insensor networks affect the efcient implemen-
tation of this type of behavior: the energy resource of the
individual nodes is very limited, the communication
between nodes drains more current from the battery than
thesensingandlocal calculations, andunlikethetraditional
systems where there are fewvantage points to generate new
events, in WSN settings, any sensor node can be an event-
generator. The detection of composite events, as well as the
evaluation of the conditions, must to be integrated in a fully
distributed environment under severe constraints (e.g.,
energy-efcient routing is a paramount). Efcient imple-
mentation of the reactive behavior in a WSN-based
databases is an ongoing research effort.
The (ECA)
2
Paradigm
Given the constantly evolving nature of the streaming or
moving objects data, along with the consideration that it
maybe managedbydistributedandheterogeneous sources,
it is important to offer a declarative tool in which the users
can actually specify how the triggers themselves should
evolve. Users can adjust the events that they monitor, the
conditions that they need to evaluate, and the action that
they execute. Consider, for example, a scenario in which a
set of motion sensors deployed around a regionof interest is
supposed to monitor whether an object is moving continu-
ously toward that region for a given time interval. Aside
from the issues of efcient detection of such an event, the
application may require an alert to be issued when the
status of the closest air eld is such that fewer than a
certain number of ghter jets are available. In this setting,
both the event detection and the condition evaluation are
done in distributed manner and are continuous in nature.
Aside from the need of their efcient synchronization, the
applicationdemands that whenaparticular object ceases to
move continuously toward the region, the condition should
not be monitored any further for that object. However, if the
object in question is closer than a certain distance (after
moving continuously toward the region of interest for a
given time), in turn, another trigger may be enabled, which
will notify the infantry personnel. An approach for declara-
tive specication of triggers for such behavior was pre-
sented in Ref. (37) where the (ECA)
2
paradigm (evolving
and context-aware event-condition-action) was introduced.
Under this paradigm, for a given trigger, the users can
embed children triggers in the specications, which will
become enabled upon the occurrences of certain events in
the environment, and only when their respective parent
triggers are no longer of interest. The childrentriggers may
consume their parents either completely, by eliminating
them from any consideration in the future or partially, by
eliminating only the particular instance from the future
consideration, but allowing a creation of subsequent
instances of the parent trigger. Obviously, in these
settings, the coupling modes among the E-C and C-A com-
ponents of the triggers must to be detached, and for
the purpose of their synchronization the concept of meta-
triggers was proposed in Ref. (37). The efcient processing
of such triggers is still an open challenge.
BIBLIOGRAPHY
1. J. Widom and S. Ceri, Active Database Systems: Triggers and
Rules for Advanced Database Processing, San Francisco:
Morgan Kauffman, 1996.
2. J. Widom, The Starburst Active Database Rule System, IEEE
Trans. Knowl. Data Enginee., 8(4): 583595, 1996.
3. R. Cochrane, H. Pirahesh and N. M. Mattos, Integrating Trig-
gers and Declarative Constraints in SQL Database Systems,
International Conference on Very Large Databases, 1996.
4. M. Stonebraker, The integration of rule systems and database
systems, IEEE Trans. Knowl. Data Enginee., 4(5): 416432,
1992.
5. S. Chakravarthy, V. Krishnaprasad, E. Answar, andS. K. Kim,
Composite Events for Active Databases: Semantics, Contexts
and Detection, International Conference on Very Large Data-
bases (VLDB), 1994.
6. M. Zoumboulakis, G. Roussos, and A. Poulovassilis, Active
Rules for Wireless Networks of Sensors & Actuators, Interna-
tional Conference on Embedded Networked Sensor Systems,
2003.
7. N. W. Paton, Active Rules in Database Systems, New York:
Springer Verlag, 1999.
10 ACTIVE DATABASE SYSTEMS
8. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom,
Models and Issues in Data Stream Systems, International
Conference on Principles of Database Systems, 2002.
9. J. Chen, D.J. DeWitt, F. Tian, and Y. Wang, NiagaraCQ: A
Scalable Continuous Query System for Internet Databases,
ACM SIGMOD International Conference on Management of
Data, 2000.
10. A. Carzaniga, D. Rosenblum, and A. Wolf, Achieving Scalabil-
ity and Expressiveness in an Internet-scale Event Notication
Service, ACM Symposium on Principles of Distributed Com-
puting, 2000.
11. ANSI/ISO International Standard: Database Language SQL.
Available: http://webstore.ansi.org.
12. S. Shankar, A. Kini, D. J. DeWitt, and J. F. Naughton, Inte-
grating databases and workow systems, SIGMOD Record,
34(3): 511, 2005.
13. R.H. Guting and M. Schneider, Moving Objects Databases, San
Francisco: Morgan Kaufman, 2005.
14. G. Trajcevski and P. Scheuermann, Reactive maintenance of
continuousqueries, Mobile Comput. Commun. Rev., 8(3): 2233,
2004.
15. L. Brownston, K. Farrel, E. Kant, andN. Martin, Programming
Expert Systems in OPS5: An Introduction to Rule-Base Pro-
gramming, Boston: Addison-Wesley, 2005.
16. A.P. Sistla and O. Wolfson, Temporal Conditions and Integrity
Constraints inActive Database Systems ACMSIGMOD, Inter-
national Conference on Management of Data, 1995.
17. S. GatziuandK. R. Ditrich, Events inanActive Object-Oriented
Database System, International Workshop on Rules in Data-
base Systems, 1993.
18. K. Dittrich, H. Fritschi, S. Gatziu, A. Geppert, and A. Vaduva,
SAMOS in hindsight: Experiences in building an active object-
oriented DBMS, Informat. Sys., 30(5): 369392, 2003.
19. S.D. Urban, T. Ben-Abdellatif, S. Dietrich, and A. Sundermier,
Delta abstractions: a technique for managing database states
in runtime debugging of active database rules, IEEE-TKDE,
15(3): 597612, 2003.
20. C. Baral, J. Lobo, and G. Trajcevski, Formal Characterization
of Active Databases: Part II, International Conference on
Deductive and Object-Oriented Databases (DOOD), 1997.
21. E. Baralis and J. Widom, An algebraic approach to static
analysis of active database rules, ACM Trans. Database
Sys., 27(3): 289332, 2000.
22. S.D. Urban, M.K. Tschudi, S.W. Dietrich, and A.P. Karadimce,
Active rule termination analysis: an implementation and eva-
luation of the rened triggering graph method, J. Intell. Infor-
mat. Sys., 14(1): 2960, 1999.
23. P. Fraternali and L. Tanca, A structured approach for the
denition of the semantics of active databases, ACM Trans.
Database Sys., 22(4): 416471, 1995.
24. G. Weikum and G. Vossen, Transactional Information Sys-
tems: Theory, Algorithms and the Practice of Concurrency
Control, San Francisco: Morgan Kauffman, 2001.
25. P. Fraternali and S. Parabochi, Chimera: a language for
designing rule applications, in: N. W. Paton (ed.), Active Rules
in Database System, Berlin: Springer-Verlog, 1999.
26. M. Thome, D. Gawlick, and M. Pratt, Event Processing with an
Oracle Database, SIGMOD International Conference on Man-
agement of Data, 2005.
27. K. Owens, Programming Oracle Triggers and Stored Proce-
dures, (3
rd
ed.), OReily Publishers, 2003.
28. U. Dayal, A.P. Buchmann, and S. Chakravarthy, The HiPAC
project, in J. Widom and S. Ceri, Active Database Systems.
Triggers and Rules for Advanced Database Processing, San
Francisco: Morgan Kauffman, 1996.
29. P. Picouet and V. Vianu, Semantics and expressiveness issues
in active databases, J. Comp. Sys. Sci., 57(3): 327355, 1998.
30. E. Simon and J. Kiernan, The A-RDL system, in J. Widomand
S. Ceri, Active Database Systems: Triggers and Rules for
Advanced Database Processing, San Francisco: Morgan Kauff-
man, 1996.
31. E. N. Hanson, Rule Condition Testing and Action Execution in
Ariel, ACM SIGMOD International Conference on Manage-
ment of Data, 1992.
32. N. Gehani and H.V. Jagadish, ODE as an Active Database:
Constraints and Triggers, International Conference on Very
Large Databases, 1992.
33. S. Ceri, C. Gennaro, X. Paraboschi, and G. Serazzi, Effective
scheduling of detached rules in active databases, IEEE Trans.
Knowl. Data Enginee., 16(1): 215, 2003.
34. E.N. Hanson, S. Bodagala, and U. Chadaga, Trigger condition
testing and view maintenance using optimized discrimination
networks, IEEE Trans. Know. Data Enginee., 16(2): 281300,
2002.
35. R. Adaikkalvan and S. Chakravarthy, Formalization and
Detection of Events Using Interval-Based Semantics, Interna-
tional Conference on Advances in Data Management, 2005.
36. F. Zhao and L. Guibas, Wireless Sensor Networks: An Informa-
tion Processing Approach, San Francisco: Morgan Kaufmann,
2004.
37. G. Trajcevski, P. Scheuermann, O. Ghica, A. Hinze, and A.
Voisard, Evolving Triggers for Dynamic Environments, Inter-
national Conference on Extending the Database Technology,
2006.
PETER SCHEUERMANN
GOCE TRAJCEVSKI
Northwestern University
Evanston, Illinois
ACTIVE DATABASE SYSTEMS 11
A
ALGEBRAIC CODING THEORY
INTRODUCTION
Incomputers and digital communication systems, informa-
tion almost always is represented in a binary form as a
sequence of bits each having the values 0 or 1. This
sequence of bits is transmittedover a channel froma sender
to a receiver. In some applications the channel is a storage
mediumlike a DVD, where the informationis writtento the
medium at a certain time and retrieved at a later time.
Because of the physical limitations of the channel, some
transmitted bits may be corrupted (the channel is noisy)
and thus make it difcult for the receiver to reconstruct the
information correctly.
In algebraic coding theory, we are concerned mainly
with developing methods to detect and correct errors
that typically occur during transmission of information
over a noisy channel. The basic technique to detect and
correct errors is by introducing redundancy inthe data that
is to be transmitted. This technique is similar to commu-
nicating in a natural language in daily life. One can under-
stand the information while listening to a noisy radio or
talking on a bad telephone line, because of the redundancy
in the language.
For example, suppose the sender wants to communicate
one of 16 different messages to a receiver. Each message m
canthenbe representedas a binary quadruple (c
0
, c
1
, c
2
, c
3
).
If the message (0101) is transmitted and the rst position is
corrupted such that (1101) is received, this leads to an
uncorrectable error because this quadruple represents a
different valid message than the message that was sent
across the channel. The receiver will have no way to detect
and to correct a corrupted message in general, because any
quadruple represents a valid message.
Therefore, to combat errors the sender encodes the data
by introducing redundancy into the transmitted informa-
tion. If Mmessages are to be transmitted, the sender selects
a subset of Mbinary n-tuples, where M < 2
n
. Each of the M
messages is encoded into an n-tuple. The set consisting of
the Mn-tuples obtained after encoding is called a binary (n,
M) code, and the elements are called codewords. The code-
words are sent over the channel.
It is customary for many applications to let M = 2
k
, such
that eachmessage canbe represented uniquely by a k-tuple
of information bits. To encode each message the sender can
appendn kparitybits dependingonthe message bits and
use the resulting n bit codeword to represent the corre-
sponding message.
Abinary code Cis called a linear code if the sum(modulo
2) of two codewords is again a codeword. This is always the
case when the parity bits are linear combinations of the
information bits. In this case, the code Cis a vector space of
dimension k over the binary eld of two elements, contain-
ing M = 2
k
codewords, andis called an[n, k] code. The main
reason for using linear codes is that these codes have more
algebraic structure and are therefore often easier to ana-
lyze and to decode in practical applications.
The simplest example of a linear code is the [n, n 1]
even-weight code (or parity-check code). The encoding con-
sists of appending a single parity bit to the n 1 informa-
tion bits so that the codeword has an even number of ones.
Thus, the code consists of all 2
n1
possible n-tuples of even
weight, where the weight of a vector is the total number of
ones in its components. This code can detect all errors in an
oddnumber of positions, because if suchanerror occurs, the
received vector also will have odd weight. The even-weight
code, however, can only detect errors. For example, if
(000. . .0) is sent and the rst bit is corrupted, then
(100. . .0) is received. Also, if (110. . .0) was sent and the
second bit was corrupted, then (100. . .0) is received. Hence,
the receiver cannot correct this single error or, in fact, any
other error.
An illustration of a code that can correct any single error
is shown in Fig. 1. The three circles intersect and divide the
plane into seven nite areas and one innite area. Each
nite area contains a bit c
i
for i = 0, 1,. . ., 6. Each of the 16
possible messages, denoted by (c
0
, c
1
, c
2
, c
3
), is encoded into
a codeword (c
0
, c
1
, c
2
, c
3
, c
4
, c
5
, c
6
), in such a way that the
sum of the bits in each circle has an even parity.
In Fig. 2, an example is shown of encoding the message
(0011) into the codeword (0011110). Because the sumof two
codewords also obeys the parity-checks and thus is a code-
word, the code is a linear [7, 4] code.
Suppose, for example, that the transmitted codeword is
corrupted in the bit c
1
such that the received word is
(0111110). Then, calculating the parity of each of the three
circles, we see that the parity fails for the upper circle as
well as for the leftmost circle, whereas the parity of the
rightmost circle is correct. Hence, fromthe received vector,
indeed we can conclude that bit c
1
is in error and should be
corrected. In the same way, any single error can be cor-
rected by this code.
LINEAR CODES
An (n, M) code is simply a set of M vectors of length n with
components from a nite eld F
2
= {0, 1}, where addition
and multiplication are done modulo 2. For practical appli-
cations it is desirable to provide the code with more struc-
ture. Therefore, linear codes often are preferred. A linear
[n, k] code C is a k-dimensional subspace C of F
n
2
, where F
n
2
is the vector space of n-tuples with coefcients from the
nite eld F
2
.
A linear code C usually is described in terms of a gen-
erator matrix or a parity-check matrix. A generator matrix
G of C is a k n matrix whose row space is the code
C; i.e.,
C = xG[x F
k
2
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Aparity-check matrix His an (n k) n matrix such that
C = c F
n
2
[cH
tr
= 0
where H
tr
denotes the transpose of H.
Example. The codewords in the code in the previous
section are the vectors (c
0
, c
1
, c
2
, c
3
, c
4
, c
5
, c
6
) that satisfy
the following system of parity-check equations:
c
0
c
1
c
2
c
4
= 0
c
0
c
1
c
3
c
5
= 0
c
0
c
2
c
3
c
6
= 0
where all additions are modulo 2. Each of the three parity-
check equations corresponds to one of the three circles.
The coefcient matrix of the parity-check equations is
the parity-check matrix
H =
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 0 0 1
_
_
_
_
(1)
The code C is therefore given by
C = c = (c
0
; c
1
; . . . ; c
6
)[cH
tr
= 0
A generator matrix for the code in the previous example
is given by
G =
1 0 0 0 1 1 1
0 1 0 0 1 1 0
0 0 1 0 1 0 1
0 0 0 1 0 1 1
_
_
_
_
_
_
_
_
Two codes are equivalent if the codewords in one of the
codes can be obtained by a xed permutation of the posi-
tions in the codewords in the other code. If G(respectively,
H) is a generator (respectively, parity-check) matrix of a
code, then the matrices obtained by permuting the columns
of these matrices inthe same way give the generator matrix
(respectively, parity-check) matrix of the permuted code.
The Hamming distance between x = (x
0
, x
1
,. . ., x
n1
)
and y = (y
0
, y
1
,. . ., y
n1
) in F
n
2
is the number of positions
in which they differ. That is,
d(x; y) = [i[x
i
,=y
i
; 0 _ i _ n 1[
The Hamming distance has the properties required to be
a metric:
1. d(x, y) _ 0 for all x, y F
n
2
and equality holds if and
only if x = y.
2. d(x, y) = d(y, x) for all x, y F
n
2
.
3. d(x, z) _ d(x, y) d(y, z) for all x, y, z F
n
2
.
For any code C, one of the most important parameters is
its minimum distance, dened by
d = mind(x; y)[x ,=y; x; y C
The Hamming weight of a vector xinF
n
2
is the number of
nonzero components in x = (x
0
, x
1
,. . ., x
n1
). That is,
w(x) = [i[x
i
,=0; 0 _ i _ n 1[ = d(x; 0)
Note that because d(x, y) = d(x y, 0) = w(x y) for a
linear code C, it follows that
d = minw(z)[z C; z ,=0
Therefore, nding the minimumdistance of a linear code is
equivalent to nding the minimum nonzero weight among
all codewords in the code.
If w(c) = i, thencH
tr
is the sumof i columns of H. Hence,
an alternative description of the minimum distance of a
linear code is as follows: the smallest d such that d linearly
c
4
c
1
c
5
c
6
c
2
c
0
c
3
Figure 1. The message (c
0
, c
1
, c
2
, c
3
) is encoded into the codeword
(c
0
, c
1
, c
2
, c
3
, c
4
, c
5
, c
6
), where c
4
, c
5
, c
6
are chosen such that there is
an even number of ones within each circle.
1
0
1 0
1
1
0
Figure 2. Example of the encoding procedure given in Fig. 1. The
message (0011) is encoded into (0011110). Note that there is an
even number of ones within each circle.
2 ALGEBRAIC CODING THEORY
dependent columns exist in the parity-check matrix. In
particular, to obtain a binary linear code of minimum
distance of at least three, it is sufcient to select the
columns of aparity-checkmatrix to be distinct andnonzero.
Sometimes we include d in the notation and refer to an
[n, k] code with minimumdistance d as an [n, k, d] code. If t
components are corrupted during transmission of a code-
word, we say that t errors have occurredor that anerror eof
weight t has occurred [where e = (e
0
, e
1
,. . ., e
n1
) F
n
2
,
where e
i
= 1 if andonly if the ithcomponent was corrupted,
that is, if c was sent, c + e was received].
The error-correcting capability of a code is dened as
t =
d 1
2
_ _
where x| denotes the largest integer _ x.
Acode with minimumdistance d can correct all errors of
weight t or less, because if a codeword c is transmitted and
an error e of weight e _ t occurs, the received vector
r = c + e is closer in Hamming distance to the transmitted
codeword c than to any other codeword. Therefore, decod-
ing any received vector to the closest codeword corrects all
errors of weight _ t.
The code can be used for error detection only. The code
can detect all errors of weight < d because if a codeword is
transmitted and the error has weight < d, then the
received vector is not another codeword.
The code also can be used for a combination of error-
correction and error-detection. For a given e _ t, the code
cancorrect all errors of weight _e andinadditioncandetect
all errors of weight at most d e 1. This is caused by the
fact that no vector in F
n
2
can be at distance _ e from one
codeword and at the same time at a distance _ d e 1
fromanother codeword. Hence, the algorithminthis case is
to decode a received vector to a codeword at distance _ e if
such a codeword exists and otherwise detect an error.
If C is an [n, k] code, the extended code C
ext
is the [n + 1,
k] code dened by
c
ext
= (c
ext
; c
0
; c
1
; . . . c
n1
)
(c
0
; c
1
; . . . ; c
n1
) C;
_
c
ext
=
n1
i=0
c
i
_
That is, each codeword in C is extended by one parity bit
such that the Hamming weight of each codeword becomes
even. In particular, if C has odd minimum distance d, then
the minimumdistance of C
ext
is d + 1. If His a parity-check
matrix for C, then a parity check matrix for C
ext
is
1 1
0
tr
H
_ _
where 1 = (11. . .1).
For any linear [n, k] code C, the dual code C
l
is the
[n, n k] code dened by
C
l
= x F
n
2
[(x; c) = 0 for all c C
where (x; c) =
n1
i=0
x
i
c
i
. We say that x and c are orthogo-
nal if (x, c) = 0. Therefore C
l
consists of all n-tuples that
are orthogonal to all codewords in C and vice versa; that is,
(C
l
)
l
= C. It follows that C
l
has dimension n k because
it consists of all vectors that are solutions of a system of
equations with coefcient matrix G of rank k. Hence, the
parity-check matrix of C
l
is a generator matrix of C, and
similarly the generator matrix of C
l
is a parity-check
matrix of C. In particular, GH
tr
= O [the k (n k)
matrix of all zeros].
Example. Let C be the [n, n 1, 2] even-weight code
where
G =
1 0 . . . 0 1
0 1 . . . 0 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . 1 1
_
_
_
_
_
_
_
_
and
H = ( 1 1 . . . 1 1 )
Then C
l
has H and G as its generator and parity-check
matrices, respectively. It follows that C
l
is the [n, 1, n]
repetition code consisting of the two codewords (00 000)
and (11 111).
Example. Let C be the [2
m
1, 2
m
1 m, 3] code,
where H contains all nonzero m-tuples as its columns.
This is known as the Hamming code. In the case when
m = 3, a parity-check matrix already is described in Equa-
tion (1). Because all columns of the parity-check matrix are
distinct and nonzero, the code has minimum distance of at
least 3. The minimum distance is indeed 3 because three
columns exist whose sumis zero; in fact, the sumof any two
columns of Hequals another columninHfor this particular
code.
The dual code C
l
is the [2
m
1, m, 2
m1
] simplex code,
all of whose nonzero codewords have weight 2
m1
. This
follows because the generator matrix has all nonzero vec-
tors as its columns. In particular, taking any linear combi-
nation of rows, the number of columns with odd parity in
the corresponding subset of rows equals 2
m1
(and the
number with even parity is 2
m1
1).
The extended code of the Hamming code is a [2
m
,
2
m
1 m, 4] code. Its dual code is a [2
m
, m + 1, 2
m1
]
code that is known as the rst-order ReedMuller code.
SOME BOUNDS ON CODES
The Hamming bound states that for any (n, M, d) code, we
have
M
e
i=0
n
i
_ _
_ 2
n
ALGEBRAIC CODING THEORY 3
where e = (d 1)/2|. This follows fromthe fact that the M
spheres
S
c
= x[d(x; c) _ e
centered at the codewords c C are disjoint and that each
sphere contains
e
i=0
n
i
_ _
vectors.
If the spheres ll the whole space, that is,
_
c C
S
c
= F
n
2
then Cis called perfect. The binary linear perfect codes are
as follows:
v
the [n, 1, n] repetition codes for all odd n
v
the [2
m
1, 2
m
1 m, 3] Hamming codes H
m
for all
m _ 2
v
the [23, 12, 7] Golay code G
23
We will return to the Golay code later.
GALOIS FIELDS
Finite elds, also known as Galois elds, with p
m
elements
exist for any prime p and any positive integer m. A Galois
eldof agivenorder p
m
is unique (upto isomorphism) andis
denoted by F
p
m.
For a prime p, let F
p
= {0, 1,. . ., p 1} denote the
integers modulo p with the two operations addition and
multiplication modulo p.
To construct a Galois eld with p
m
elements, select a
polynomial f(x) with coefcients in F
p
, which is irreducible
over F
p
; that is, f(x) cannot be written as a product of two
polynomials with coefcients from F
p
of degree _ 1 (irre-
ducible polynomials of any degree m over F
p
exist).
Let
F
p
m =a
m1
x
m1
a
m2
x
m2
. . . a
0
[a
0
; . . . ; a
m1
F
p
Then F
p
m is a nite eld when addition and multiplica-
tion of the elements (polynomials) are done modulo f(x)
and modulo p. To simplify the notations, let a denote a
zero of f(x), that is, f(a) = 0. If such an a exists, formally it
can be dened as the equivalence class of x modulo f(x).
For coding theory, p = 2 is by far the most important
case, and we assume this from now on. Note that for any
a, b F
2
m,
(a b)
2
= a
2
b
2
Example. The Galois eld F
2
4 can be constructed as
follows. Let f(x) = x
4
+ x + 1 that is an irreducible polyno-
mial over F
2
. Then a
4
= a + 1 and
F
2
4 = a
3
a
3
a
2
a
2
a
1
a a
0
[a
0
; a
1
; a
2
; a
3
F
2
k
i=0
m
i
x
i
has coefcients
in F
2
and b as a zero, then
m(b
2
) =
k
i=0
m
i
b
2i
=
k
i=0
m
2
i
b
2i
=
k
i=0
m
i
b
i
_ _
2
= (m(b))
2
= 0
Hence, m(x) has b; b
2
; . . . ; b
2
k1
as zeros, where k is the
smallest integer such that b
2
k
= b. Conversely, the poly-
nomial with exactly these zeros can be shown to be a binary
irreducible polynomial.
Example. We will nd the minimal polynomial of all the
elements in F
2
4 . Let a be a root of x
4
+ x + 1 = 0; that is,
a
4
= a + 1. The minimal polynomials over F
2
of a
i
for 0 _ i
_14are denotedm
i
(x). Observe bythe above argument that
m
2i
(x) = m
i
(x), where the indices are taken modulo 15. It
i a
i
i a
i
i a
i
0 0001 5 0110 10 0111
1 0010 6 1100 11 1110
2 0100 7 1011 12 1111
3 1000 8 0101 13 1101
4 0011 9 1010 14 1001
4 ALGEBRAIC CODING THEORY
follows that
m
0
(x) = (x a
0
) = x 1
m
1
(x) = (x a)(x a
2
)(x a
4
)(x a
8
) = x
4
x 1
m
3
(x) = (x a
3
)(x a
6
)(x a
12
)(x a
9
)
= x
4
x
3
x
2
x 1
m
5
(x) = (x a
5
)(x a
10
) = x
2
x 1
m
7
(x) = (x a
7
)(x a
14
)(x a
13
)(x a
11
) = x
4
x
3
1
m
9
(x) = m
3
(x)
m
11
(x) = m
7
(x)
m
13
(x) = m
7
(x)
To verify this, one simply computes the coefcients and
uses the preceding table of F
2
4 in the computations. For
example,
m
5
(x) = (x a
5
)(x a
10
) = x
2
(a
5
a
10
)x a
5
a
10
= x
2
x 1
This also leads to a factorization into irreducible polyno-
mials,
x
2
4
x = x
14
j=0
(x a
j
)
= x(x 1)(x
2
x 1)(x
4
x 1)
(x
4
x
3
x
2
x 1)(x
4
x
3
1)
= xm
0
(x)m
1
(x)m
3
(x)m
5
(x)m
7
(x)
In fact, in general it holds that x
2
m
+ x is the product of all
irreducible polynomials over F
2
of degree that divides m.
Let C
i
= i 2
j
mod n[ j = 0; 1; . . ., which is called the
cyclotomic coset of i (mod n). Then, the elements of the
cyclotomic coset C
i
(mod 2
m
1) correspond to the expo-
nents of the zeros of m
i
(x). That is,
m
i
(x) = P
j C
i
(x a
j
)
The cyclotomic cosets (mod n) are important in the next
section when cyclic codes of length n are discussed.
CYCLIC CODES
Many good linear codes that have practical and efcient
decoding algorithms have the property that a cyclic shift of
a codeword is againa codeword. Such codes are called cyclic
codes.
We can represent the set of n-tuples over F
n
2
as poly-
nomials of degree < n in a natural way. The vector
c = (c
0
, c
1
,. . ., c
n1
)
is represented as the polynomial
c(x) = c
0
+ c
1
x + c
2
x
2
+ + c
n1
x
n1
.
A cyclic shift
s(c) = (c
n1
; c
0
; c
1
; . . . ; c
n2
)
of c is then represented by the polynomial
s(c(x)) = c
n1
c
0
x c
1
x
2
c
n2
x
n1
= x(c
n1
x
n1
c
0
c
1
x c
n2
x
n2
)
c
n1
(x
n
1)
= xc(x) (modx
n
1)
Example. Rearranging the columns in the parity-check
matrix of the [7, 4] Hamming code in Equation (1), an
equivalent code is obtained with parity-check matrix
H =
1 0 0 1 0 1 1
0 1 0 1 1 1 0
0 0 1 0 1 1 1
_
_
_
_
(2)
This code contains 16 codewords, which are represented
next in polynomial form:
1000110 x
5
x
4
1 = (x
2
x 1)g(x)
0100011 x
6
x
5
x = (x
3
x
2
x)g(x)
1010001 x
6
x
2
1 = (x
3
x 1)g(x)
1101000 x
3
x 1 = g(x)
0110100 x
4
x
2
x = xg(x)
0011010 x
5
x
3
x
2
= x
2
g(x)
0001101 x
6
x
4
x
3
= x
3
g(x)
0010111 x
6
x
5
x
4
x
2
= (x
3
x
2
)g(x)
1001011 x
6
x
5
x
3
1 = (x
3
x
2
x 1)g(x)
1100101 x
6
x
4
x 1 = (x
3
1)g(x)
1110010 x
5
x
2
x 1 = (x
2
1)g(x)
0111001 x
6
x
3
x
2
x = (x
3
x)g(x)
1011100 x
4
x
3
x
2
1 = (x 1)g(x)
0101110 x
5
x
4
x
3
x = (x
2
x)g(x)
0000000 0 = 0
1111111x
6
x
5
x
4
x
3
x
2
x 1
= (x
3
x
2
1)g(x)
By inspection it is easy to verify that any cyclic shift of a
codeword is again a codeword. Indeed, the 16 codewords in
the code are 0, 1 and all cyclic shifts of (1000110) and
(0010111). The unique nonzero polynomial in the code of
lowest possible degree is g(x) = x
3
x 1, and g(x) is
called the generator polynomial of the cyclic code. The
code consists of all polynomials c(x), which are multiples
of g(x). Note that the degree of g(x) is n k = 3andthat g(x)
divides x
7
1 because
x
7
1 = (x 1)(x
3
x 1)(x
3
+ x
2
1).
Therefore the code has a simple description in terms of
the set of code polynomials as
C = c(x)[c(x) = u(x)(x
3
x 1); deg(u(x)) <4
This situation holds in general for any cyclic code.
For any cyclic [n, k] code C, we have
C = c(x)[c(x) = u(x)g(x); deg(u(x)) <k
for a polynomial g(x) of degree n k that divides x
n
1.
ALGEBRAIC CODING THEORY 5
We can show this as follows: Let g(x) be the generator
polynomial of C, which is the nonzero polynomial of smal-
lest degree r in the code C. Then the cyclic shifts g(x), xg(x),
, x
nr1
g(x) are codewords as well as any linear combina-
tion u(x)g(x), where deg(u(x)) < r. These are the only 2
nr
codewords in the code C, because if c(x) is a codeword, then
c(x) = u(x)g(x) s(x); where deg(s(x)) <deg(g(x))
By linearity, s(x) is a codeword and therefore s(x) = 0
because deg(s(x)) < deg(g(x)) and g(x) is the nonzero poly-
nomial of smallest degree in the code. It follows that C is as
described previously. Since Chas 2
nr
codewords, it follows
that n r = k; i.e., deg(g(x)) = n k.
Finally, we showthat g(x) divides x
n
1. Let c(x) =c
0
c
1
x c
n1
x
n1
be a nonzero codeword shifted such
that c
n1
= 1. Then the cyclic shift of c(x) given by
s(c(x)) = c
n1
c
0
x c
1
x c
n2
x
n1
also is a
codeword and
s(c(x)) = xc(x) (x
n
1)
Because both codewords c(x) and s(c(x)) are divisible by
g(x), it follows that g(x) divides x
n
1.
Because the generator polynomial of acyclic code divides
x
n
1, it is important to know how to factor x
n
1 into
irreducible polynomials. Let n be odd. Then an integer m
exists such that 2
m
= 1 (mod n) and an element a F
2
m
exists of order n [if v is a primitive element of F
2
m, then a
can be taken to be a = v
(2
m
1)=n
].
We have
x
n
1 = P
n1
i=0
(x a
i
)
Let m
i
(x) denote the minimal polynomial of a
i
; that is,
the polynomial of smallest degree withcoefcients inF
2
and
having a
i
as a zero. The generator polynomial g(x) can be
written as
g(x) = P
i I
(x a
i
)
where I is a subset of {0, 1,. . ., n 1}, called the dening set
of C with respect to a. Then m
i
(x) divides g(x) for all i I.
Furthermore, g(x)
l
j=1
m
i
j
(x) for some i
1
, i
2
,. . ., i
l
.
Therefore we can describe the cyclic code in alternative
equivalent ways as
C = c(x)[m
i
(x)divides c(x); for all i I
C = c(x)[c(a
i
) = 0; for all i I
C = c F
n
q
[cH
tr
= 0
where
H =
1 a
i
1
a
2i
1
. . . a
(n1)i
1
1 a
i
2
a
2i
2
. . . a
(n1)i
2
.
.
.
.
.
.
.
.
.
}
.
.
.
1 a
i
l
a
2i
l
a
(n1)i
l
_
_
_
_
_
_
_
_
_
_
The encoding for cyclic codes usually is done inone of two
ways. Let u(x) denote the information polynomial of
degree < k. The two ways are as follows:
1. Encode into u(x)g(x).
2. Encode into c(x) = x
nk
u(x) s(x), where s(x) is the
polynomial such that
v s(x) = x
nk
u(x) (mod g(x)) [thus g(x) divides c(x)],
v deg(s(x)) < deg(g(x)).
The last of these two methods is systematic; that is, the last
k bits of the codeword are the information bits.
BCH CODES
An important task in coding theory is to designcodes witha
guaranteed minimum distance d that correct all errors of
Hamming weight
d 1
2
| . Such codes were designed
independently by Bose and Ray-Chaudhuri (1) and by
Hocquenghem (2) and are known as BCH codes. To con-
struct a BCH code of designed distance d, the generator
polynomial is chosen to have d 1 consecutive powers of
a as zeros
a
b
; a
b1
; . . . ; a
bd2
That is, the dening set I with respect to a contains a set of
d 1 consecutive integers (mod n). The parity-check
matrix of the BCH code is
H =
1 a
b
a
2b
. . . a
(n1)b
1 a
b1
a
2(b1)
. . . a
(n1)(b1)
.
.
.
.
.
.
.
.
.
}
.
.
.
1 a
bd2
a
2(bd2)
. . . a
(n1)(bd2)
_
_
_
_
_
_
_
_
_
_
To show that this code has a minimum distance of at
least d, it is sufcient to show that any d 1 columns are
linear independent. Suppose a linear dependency between
the d 1columns corresponds to a
i1b
; a
i2b
; . . . ; a
i
d1
b
. Inthis
case the (d 1) (d 1) submatrix obtained by retaining
these columns in H has determinant
a
i
1
b
a
i
2
b
. . . a
i
d1
b
a
i
1(b1)
a
i
2(b1)
. . . a
i
d1
(b1)
.
.
.
.
.
.
}
.
.
.
a
i
1
(bd2)
a
i
2
(bd2)
. . . a
i
d1
(bd2)
= a
b(i
1
i
2
...i
d2
)
1 1 . . . 1
a
i
1
a
i
2
. . . a
i
d1
.
.
.
.
.
.
.
.
.
.
.
.
a
(d2)i
1
a
(d2)i
2
. . . a
(d2)i
d1
= a
b(i
1
i
2
...i
d2
)
k<r
(a
i
k
a
i
r
) ,=0
because the elements a
i
1
, a
i
2
, , a
i
d1
are distinct (the last
equality follows fromthe fact that the last determinant is a
Vandermonde determinant). It follows that the BCH code
has a minimum Hamming distance of at least d.
6 ALGEBRAIC CODING THEORY
If b = 1, which is often the case, the code is called a
narrow-sense BCH code. If n = 2
m
1, the BCH code is
called a primitive BCH code. A binary single error-correct-
ing primitive BCH code is generated by g(x) = m
1
(x). The
zeros of g(x) are a
2
i
, i = 0, 1,. . ., m 1. The parity-check
matrix is
H = ( 1 a
1
a
2
. . . a
2
m
2
)
This code is equivalent to the Hamming code because a is a
primitive element of F
2
m.
To construct a binary double-error-correcting primitive
BCH code, we let g(x) have a, a
2
, a
3
, a
4
as zeros. Therefore,
g(x) = m
1
(x)m
3
(x) is a generator polynomial of this code.
The parity-check matrix of a double-error-correcting BCH
code is
H =
1 a
1
a
2
. . . a
2
m
2
1 a
3
a
6
. . . a
3(2
m
2)
_ _
Inparticular, a binary double-error correcting BCHcode of
length n = 2
4
1 = 15, is obtained by selecting
g(x) = m
1
(x)m
3
(x)
= (x
4
x 1)(x
4
x
3
x
2
x 1)
= x
8
x
7
x
6
x
4
1
Similarly, a binary triple-error correcting BCH code of
the same length is obtained by choosing the generator
polynomial
g(x) = m
1
(x)m
3
(x)m
5
(x)
= (x
4
x 1)(x
4
x
3
x
2
x 1)(x
2
x 1)
= x
10
x
8
x
5
x
4
x
2
x 1
The maininterest inBCHcodes is because they have a very
fast and efcient decoding algorithm. We will describe this
later.
AUTOMORPHISMS
Let Cbe abinarycode of lengthn. Consider apermutationp
of the set {0, 1,. . ., n 1}; that is, p is a one-to-one function
of the set of coordinate positions onto itself.
For a codeword c C, let
p(c) = (c
p(0)
; c
p(1)
; . . . ; c
p(n1)
)
That is, the coordinates are permuted by the permutation
p. If
p(c)[c C = C
then p is called an automorphism of the code C.
Example. Consider the following (nonlinear code):
C = 101; 011
The actions of the six possible permutations on three
elements are given in the following table. The permuta-
tions, which are automorphisms, are marked by a star.
In general, the set of automorphisms of a code C is a
group, the Automorphism group Aut(C). We note that
n1
i=0
x
i
y
i
=
n1
i=0
x
p(i)
y
p(i)
and so (x, y) = 0 if and only if (p(x), p(y)) = 0. In parti-
cular, this implies that
Aut(C) = Aut(C
l
)
That is, C and C
l
have the same automorphism group.
For a cyclic code C of length n, we have by denition
s(c) C for all c C, where s(i) = i 1 (mod n). In parti-
cular, s Aut(C). For n odd, the permutation d dened by
d( j) = 2j (mod n) also is contained in the automorphism
group. To show this permutation, it is easier to show that
d
1
Aut(C). We have
d
1
(2j) = j for j = 0; 1; . . . ; (n 1)=2
d
1
(2j 1) = (n 1)=2 j for j = 0; 1; . . . ; (n 1)=2 1
Let g(x) be a generator polynomial for C, and let
n1
i=0
c
i
x
i
= a(x)g(x). Becausex
n
= 1(modx
n
1), wehave
n1
i=0
c
d
1
(i)
x
i
=
(n1)=2
j=0
c
j
x
2j
(n1)=2
j=0
c
(n1)=2j
x
2 j1
=
(n1)=2
j=0
c
j
x
2j
n1
j=(n1)=2
c
j
x
2 j
= a(x
2
)g(x
2
) = (a(x
2
)g(x))g(x); (modx
n
1)
and so d
1
(c) C; that is, d
1
Aut(C) and so d Aut(C).
The automorphism group Aut(C) is transitive if for each
pair (i, j) a p Aut(C) exists such that p(i) = j. More
general, Aut(C) is t-fold transitive if, for distinct i
1
, i
2
,. . .,
i
t
and distinct j
1
, j
2
,. . ., j
t
, a p Aut(C) exists such that
p(i
1
) = j
1
, p(i
2
) = j
2
,. . ., p(i
t
) = j
t
.
Example. Any cyclic [n, k] code has a transitive auto-
morphism group because s repeated s times, where
s = i j (mod n), maps i to j.
Example. The (nonlinear) code C = {101, 011} was con-
sideredpreviously. Its automorphismgroupis not transitive
because there is no automorphism p such that p(0) = 2.
p(0) p(1) p(2) p((101)) p((011))
0 1 2 101 011
$
0 2 1 110 011
1 0 2 011 101
$
1 2 0 011 110
2 0 1 110 101
2 1 0 101 110
ALGEBRAIC CODING THEORY 7
Example. Let Cbe the [9, 3] code generatedby the matrix
0 0 1 0 0 1 0 0 1
0 1 0 0 1 0 0 1 0
1 0 0 1 0 0 1 0 0
_
_
_
_
This is a cyclic code, and we will determine its automorph-
ism group. The all zero and the all one vectors in C are
transformed into themselves by any permutation. The
vectors of weight 3 are the rows of the generator matrix,
and the vectors of weight 6 are the complements of these
vectors. Hence, we see that p is an automorphism if and
only if it leaves the set of the three rows of the generator
matrix invariant, that is, if and only if the following con-
ditions are satised:
p(0) =p(3) =p(6) (mod 3)
p(1) =p(4) =p(7) (mod 3)
p(2) =p(5) =p(8) (mod 3)
Note that the two permutations s and d dened previously
satisfy these conditions, as they should. They are listed
explicitly in the following table:
The automorphism group is transitive because the code is
cyclic but not doubly transitive. For example, no auto-
morphismpexists such that p(0) = 0 and p(3) = 1 because
0 and 1 are not equivalent modulo 3. A simple counting
argument shows that Aut(C) has order 1296: First choose
p(0); this can be done in nine ways. Then two ways exist to
choose p(3) and p(6). Next choose p(1); this can be done in
six ways. There are again two ways to choose p(4) and p(7).
Finally, there are 3 2 ways to choose p(2), p(5), p(8).
Hence, the order is 9 2 6 2 3 2 = 1296.
Example. Consider the extended Hamming code H
ext
m
.
The positions of the codewords correspond to the elements
of F
2
m and are permuted by the afne group
AG = p[p(x) = ax b; a; bF
2
m; a,=0
This is the automorphism group of H
ext
m
. It is double
transitive.
THE WEIGHT DISTRIBUTION OF A CODE
Let C be a binary linear [n, k] code. As we noted,
d(x; y) = d(x y; 0) = w(x y)
If x, y C, then x y C by the linearity of C. In
particular, this means that the set of distances froma xed
codewordto all the other codewords is independent of which
codeword we x; that is, the code looks the same from any
codeword. In particular, the set of distances from the code-
word 0 is the set of Hamming weights of the codewords. For
i = 0, 1, . . ., n, let A
i
denote the number of codewords of
weight i. The sequence
A
0
; A
1
; A
2
; . . . ; A
n
is called the weight distribution of the code C. The corre-
sponding polynomial
A
C
(z) = A
0
A
1
z A
2
z
2
A
n
z
n
is known as the weight enumerator polynomial of C.
The polynomials A
C
(z) and A
C
l (z) are related by the
fundamental MacWilliams identity:
A
C
l (z) = 2
k
(1 z)
n
A
C
1 z
1 z
_ _
Example. The [2
m
1, m] simplex code has the weight
distribution polynomial 1 (2
m
1)z
2
m1
. The dual code is
the [2
m
1, 2
m
1 m] Hamming code with weight
enumerator polynomial
2
m
(1 z)
2
m
1
1 (2
m
1)
1 z
1 z
_ _
2
m1
_ _
= 2
m
(1 z)
2
m
1
(1 2
m
)(1 z)
2
m1
(1 z)
2
m1
1
For example, for m = 4, we get the weight distribution of
the [15, 11] Hamming code:
1 35z
3
105z
4
168z
5
280z
6
435z
7
435z
8
280z
9
168z
10
105z
11
35z
12
z
15
Consider a binary linear code C that is used purely for
error detection. Suppose a codeword c is transmitted over a
binary symmetric channel with bit error probability p. The
probability of receiving a vector r at distance i from c is
p
i
(1 p)
ni
, because i positions are changed (each with
probability p) and n i are unchanged (each with prob-
ability 1 p). If r is not a codeword, then this will be
discovered by the receiver. If r = c, then no errors have
occurred. However, if r is another codeword, then an unde-
tectable error has occurred. Hence, the probability of unde-
tected error is given by
P
ue
(C; p) =
c/ ,=c
p
d(c/;c)
(1 p)
nd(c/;c)
=
c
/
,=0
p
w(c
/
)
(1 p)
nw(c
/
)
=
n
i=1
A
i
p
i
(1 p)
ni
= (1 p)
n
A
C
(
p
1 p
) (1 p)
n
From the MacWilliams identity, we also get
P
ue
(C
l
; p) = 2
k
A
C
(1 2 p) (1 p)
n
i 0 1 2 3 4 5 6 7 8
s(i) 8 0 1 2 3 4 5 6 7
d(i) 0 2 4 6 8 1 3 5 7
8 ALGEBRAIC CODING THEORY
Example. For the [2
m
1, 2
m
1 m] Hamming code
H
m
, we get
P
ue
(H
m
; p) = 2
m
(1 (2
m
1)(1 2 p)
2
m1
) (1 p)
2
m
1
More informationonthe use of codes for error detectioncan
be found in the books by Klve and Korzhik (see Further
Reading).
THE BINARY GOLAY CODE
The Golay code G
23
has received much attention. It is
practically useful and has a several interesting properties.
The code can be dened in various ways. One denition is
that G
23
is the cyclic code generated by the irreducible
polynomial
x
11
x
9
x
7
x
6
x
5
x 1
whichis afactor of x
23
1over F
2
. Another denitionis the
following: Let Hdenote the [7, 4] Hamming code, and let H
+
be the code whose codewords are the reversed of the code-
words of H. Let
C = (u x; v x; u v x)[u; v H
ext
; x (H
+
)
ext
where H
ext
is the [8, 4] extended Hamming code and (H
+
)
ext
is the [8, 4] extended H
+
. The code C is a [24, 12, 8] code.
Puncturing the last position, we get a [23, 12, 7] code that is
(equivalent to) the Golay code.
The weight distribution of G
23
is given by the following
table:
The automorphism group Aut(G
23
) of the Golay code is
the Mathieu group M
23
, a simple group of order
10200960 = 2
7
3
2
5 7 11 23, which is fourfold
transitive.
Much information about G
23
can be found in the book by
MacWilliams and Sloane and in the Handbook of Coding
Theory (see Further Reading).
DECODING
Suppose that a codeword c from the [n, k] code C was sent
and that an error e occurred during the transmission over
the noisy channel. Based on the received vector r = c e,
the receiver has to make an estimate of what was the
transmitted codeword. Because error patterns of lower
weight are more probable than error patterns of higher
weight, the problem is to estimate an error e such that the
weight of e is as small as possible. He will then decode the
received vector r into c = r e .
If His a parity-check matrix for C, then cH
tr
= 0 for all
codewords c. Hence,
rH
tr
= (c e)H
tr
= cH
tr
eH
tr
= eH
tr
(3)
The vector
s = eH
tr
is knownas the syndrome of the error e; Equation(3) shows
that s can be computed from r. We now have the following
outline of a decoding strategy:
1. Compute the syndrome s = rH
tr
.
2. Estimate an error e of smallest weight corresponding
to the syndrome s.
3. Decode to c = r e .
The hard part is, of course, step 2.
For any vector x F
n
2
, the set {x c [ c C} is a coset of
C. All the elements of the coset have the same syndrome,
namely, xH
tr
. There are 2
n k
cosets, one for eachsyndrome
in F
n k
2
, and the set of cosets is a partition of F
n
2
. We can
rephrase step2as follows: Findavector eof smallest weight
in the coset with syndrome s.
Example. Let C be the [6, 3, 3] code with parity-check
matrix
H =
1 1 0 1 0 0
1 0 1 0 1 0
0 1 1 0 0 1
_
_
_
_
A standard array for C is the following array (the eight
columns to the right):
Eachrowinthe array is a listing of a coset of C; the rst row
is a listing of the code itself. The vectors in the rst column
have minimal weight in their cosets and are known as coset
leaders. The choice of coset leader may not be unique. For
example, inthe last coset there are three vectors of minimal
weight. Any entry inthe array is the sumof the codeword at
the top of the column and the coset leader (at the left in the
row). Each vector of F
6
2
is listed exactly once in the array.
The standard array can be used to decode: Locate r in the
array and decode the codeword at the top of the correspond-
ing column (that is, the coset leader is assumed to be the
error pattern). However, this method is not practical;
except for small n, the standard array of 2
n
entries is too
large to store (also locating r may be a problem). A step to
simplify the method is to store a table of coset leaders
corresponding to the 2
n k
syndromes. In the table above,
this method is illustrated by listing the syndromes at the
i A
i
0; 23
7; 16
8; 15
11; 12
1
253
506
1288
000 000000 001011 010101 011110 100110 101101 110011 111000
110 100000 101011 110101 111110 000110 001101 010011 011000
101 010000 011011 000101 001110 110110 111101 100011 101000
011 001000 000011 011101 010110 101110 100101 111011 110000
100 000100 001111 010001 011010 100010 101001 110111 111100
010 000010 001001 010111 011100 100100 101111 110001 111010
001 000001 001010 010100 011111 100111 101100 110010 111001
111 100001 101010 110100 111111 000111 001100 010010 011001
ALGEBRAIC CODING THEORY 9
left. Again this alternative is possible only if n k is small.
For carefullydesignedcodes, it is possible to compute efrom
the syndrome. The simplest case is single errors: If e is an
error pattern of weight 1, where the 1 is in the ith position,
then the corresponding syndrome is in the ith column of H;
hence, from H and the syndrome, we can determine i.
Example. Let H be the m (2
m
1) parity-check
matrix where the ith column is the binary expansion of
the integer i for i = 1, 2,. . .,2
m
1. The corresponding
[2
m
1, 2
m
1 m, 3] Hamming code corrects all single
errors. Decoding is done as follows: Compute the syndrome
s = (s
0
, s
1
,. . .,s
m1
). If s ,= 0, then correct position
i =
m1
j=0
s
j
2
j
.
Example. Let
H =
_
1 a a
2
a
n1
1 a
3
a
6
a
3(n1)
_
where a F
2
m and n = 2
m
1. This is the parity-check
matrix for the double error-correcting BCH code. It is
convenient to have a similar representation of the syn-
dromes:
s = (S
1
; S
3
) where S
1
; S
3
F
2
m
Depending on the syndrome, there are several cases:
1. If no errors have occurred, then clearly S
1
= S
3
= 0.
2. If a single error has occurred in the ith position (that
is, the position corresponding to a
i
), then S
1
= a
i
and
S
3
= a
3i
. In particular, S
3
= S
3
1
.
3. If two errors have occurred in positions i and j, then
S
1
= a
i
a
j
; S
3
= a
3i
a
3 j
This implies that S
3
1
= S
3
a
i
a
j
S
1
,= S
3
. Further-
more, x
1
= a
i
and x
2
= a
j
are roots of the equation
1 S
1
x
S
3
1
S
3
S
1
x
2
= 0: (4)
This gives the following procedure to correct two errors:
v
Compute S
1
and S
3
.
v
If S
1
= S
3
= 0, then assume that no errors have
occurred.
v
Else, if S
3
= S
3
1
,= 0, thenone error has occurred inthe
ith position determined by S
1
= a
i
.
v
Else (if S
3
,= S
3
1
), consider the equation
1 S
1
x (S
3
1
S
3
)=S
1
x
2
= 0
- If the equation has two roots a
i
and a
j
, then errors
have occurred in positions i and j.
Else (if the equation has no roots in F
2
m), then more
than two errors have occurred.
Similar explicit expressions (in terms of the syndrome)
for the coefcients of anequationwiththe error positions as
roots can be found for t error-correcting BCH codes when
t = 3, t = 4, etc., but they become increasingly compli-
cated. However, an efcient algorithm to determine the
equation exists, and we describe this is some detail next.
Let a be a primitive element in F
2
m. A parity-check
matrix for the primitive t-error-correcting BCH code is
H =
1 a a
2
a
n1
1 a
3
a
6
a
3(n1)
.
.
.
.
.
.
.
.
.
}
.
.
.
1 a
2t1
a
2(2t1)
a
(2t1)(n1)
_
_
_
_
_
_
_
_
_
_
where n = 2
m
1. Suppose errors have occurred in posi-
tions i
1
, i
2
,. . .,i
t
, where t _ t. Let X
j
= a
i
j
for j = 1, 2,. . .,t.
The error locator polynomial L(x) is dened by
L(x) =
t
j=1
(1 X
j
x) =
t
l=0
l
l
x
l
The roots of L(x) = 0 are X
1
j
. Therefore, if we can deter-
mine L(x), then we can determine the locations of the
errors. Expanding the expression for L(x), we get
l
0
= 1
l
1
= X
1
X
2
X
t
l
2
= X
1
X
2
X
1
X
3
X
2
X
3
X
t1
X
t
l
3
= X
1
X
2
X
3
X
1
X
2
X
4
X
2
X
3
X
4
X
t2
X
t1
X
t
.
.
.
l
t
= X
1
X
2
X
t
l
l
= 0 for l >t
Hence l
l
is the lth elementary symmetric function of X
1
,
X
2
,. . .,X
t
.
From the syndrome, we get S
1
, S
3
,. . .,S
2t1
, where
S
1
= X
1
X
2
X
t
S
2
= X
2
1
X
2
2
X
2
t
S
3
= X
3
1
X
3
2
X
3
t
.
.
.
S
2t
= X
2t
1
X
2t
2
X
2t
t
Furthermore,
S
2r
= X
2r
1
X
2r
2
X
2r
r
= (X
r
1
X
r
2
X
r
r
)
2
= S
2
r
for all r. Hence, from the syndrome we can determine the
polynomial
S(x) = 1 S
1
x S
2
x
2
S
2t
x
2t
The Newton equations is a set of relations between the
power sums S
r
and the symmetric functions l
l
, namely
l1
j=0
S
lj
l
j
ll
l
= 0 for l _1
Let
V(x) = S(x)L(x) =
l _0
v
l
x
l
(5)
10 ALGEBRAIC CODING THEORY
Because v
l
=
l1
j=0
S
lj
l
j
l
l
, the Newton equations
imply that
v
l
= 0 for all oddl; 1 _ l _ 2t 1 (6)
The BerlekampMassey algorithm is an algorithm that,
given S(x), determines the polynomial L(x) of smallest
degree such that Equation (6) is satised, where the v
l
are dened by Equation (5). The idea is, for r = 0, 1,. . ., t, to
determine polynomials L
(r)
of lowest degree such that
v
(r)
l
= 0 for all oddl; 1 _ l _ 2r 1
where
l _0
v
(r)
l
x
l
= S(x)L
(r)
(x)
For r = 0, clearly we can let L
(0)
(x) = 1. We proceed by
induction. Let 0 _ r < t, and suppose that polynomials
L
(r)
(x) have been constructed for 0 _ r _ r. If v
(r)
2r1
= 0,
then we can choose
L
(r1)
(x) = L
(r)
(x)
If, on the other hand, v
(r)
2r1
,=0, then we modify L
(r)
(x) by
adding another suitable polynomial. Two cases to consider
are as follows: First, if L
(r)
(x) = 1 [in which case L
(t)
(x) = 1
for 0 _ t _ r], then
L
(r1)
(x) = 1 v
(r)
2r1
x
2r1
will have the required property. If L
(r)
(x) ,= 1, then a max-
imal positive integer r < r such that v
(r)
2r1
,=0 exists and
we add a suitable multiple of L
(r)
:
L
(r1)
(x) = L
(r)
(x) v
(r)
2r1
(v
(r)
2r1
)
1
x
2r2r
L
(r)
(x)
We note that this implies that
L
(r1)
(x)S(x) =
l _0
v
(r)
l
x
l
v
(r)
2r1
(v
(r)
2r1
)
1
l _0
v
(r)
l
x
l2r2r
Hence for odd l we get
v
(r1)
l
=
v
(r)
l
= 0 for 1 _ l _ 2r 2r 1
v
(r)
l
v
(r)
2r1
(v
(r)
2r1
)
1
v
(r)
l2r2r
= 0 0 = 0 for 2r 2r 1 _ l _ 2r 1
v
(r)
2r1
v
(r)
2r1
(v
(r)
2r1
)
1
v
(r)
2r1
= v
(r)
2r1
v
(r)
2r1
= 0 for l = 2r 1
_
_
We now formulate these ideas as an algorithm (in a
Pascal-like syntax). In each step we keep the present L(x)
[the superscript (r) is dropped] and the modifying polyno-
mial [x
2r2r1
or (v
(r)
2r1
)
1
x
2r2r1
L
(r)
(x)], which we denote
by B(x).
BerlekampMassey Algorithm in the Binary Case
Input: t and S(x).
L(x) := 1; B(x) := 1;
for r := 1 to t do
begin
v := coefcient of x
2r1
in S(x)L(x);
if v = 0 then B(x) := x
2
B(x)
else
[L(x); B(x)[ := [L(x) vxB(x); xL(x)=v[ :
end;
The assignment following the else is two assignments
to be done in parallel; the new L(x) and B(x) are computed
from the old ones.
The BerlekampMassey algorithmdetermines the poly-
nomial L(x). To nd the roots of L(x) = 0, we try all possible
elements of F
2
m. In practical applications, this can be
efcientlyimplementedusingshift registers (usuallycalled
the Chien search).
Example. We consider the [15, 7, 5] double-error correct-
ing BCHcode; that is, m = 4 and t = 2. As a primitive root,
we choose a such that a
4
= a 1. Suppose that we have
received a vector with syndrome (S
1
, S
3
) = (a
4
, a
5
). Since
S
3
,= S
3
1
, at least two errors have occurred. Equation (4)
becomes
1 a
4
x a
10
x
2
= 0
which has the zeros a
3
and a
7
. We conclude that the
received vector has two errors (namely, in positions 3 and
7).
Now consider the BerlekampMassey algorithm for the
same example. First we compute S
2
= S
2
1
= a
8
and
S
4
= S
2
2
= a. Hence
S(x) = 1 a
4
x a
8
x
2
a
5
x
3
ax
4
:
The values of r, v, L(x), and B(x) after each iteration of the
for-loop in the BerlekampMassey algorithm are shown in
the following table:
Hence, L(x) = 1 a
4
x a
10
x
2
(as before).
Now consider the same code with syndrome of received
vector (S
1
, S
3
) = (a, a
9
). Because S
3
,= S
3
1
, at least two
errors have occurred. We get
L(x) = 1 ax x
2
However, the equation 1 ax x
2
= 0 does not have any
roots in F
2
4 . Hence, at least three errors have occurred, and
the code cannot correct them.
r v L(x) B(x)
1 1
1 a
4
1 a
4
x a
11
x
2 a
14
1 a
4
x a
10
x
2
ax a
5
x
2
ALGEBRAIC CODING THEORY 11
REEDSOLOMON CODES
In the previous sections we have considered binary codes
where the components of the codewords belong to the nite
eldF
2
= {0, 1}. Inasimilar waywe canconsider codes with
components from any nite eld F
q
.
The Singleton bound states that for any [n, k, d] code
with components from F
q
, we have
d _ n k 1
A code for which d = n k 1 is called maximum dis-
tance separable (MDS). The only binary MDS codes are the
trivial [n, 1, n] repetition codes and [n, n 1, 2] even-
weight codes. However, important nonbinary MDS codes
exist (inparticular, the ReedSolomoncodes, whichwe now
will describe).
ReedSolomon codes are t-error-correcting cyclic codes
with symbols froma nite eld F
q
, even though they can be
constructedinmanydifferent ways. Theycanbe considered
as the simplest generalization of BCH codes. Because the
most important case for applications is q = 2
m
, we consider
this case here. Each symbol is an element in F
2
m and can be
considered as an m-bit symbol.
One construction of a cyclic ReedSolomon code is as
follows: Let a be a primitive element of F
2
m, and let x
i
= a
i
.
Because x
i
F
2
m for all i, the minimal polynomial of a
i
over
F
2
m is just x x
i
. The generator polynomial of a (primitive)
t-error-correcting ReedSolomon code of length 2
m
1 has
2t consequtive powers of a as zeros:
g(x) =
2t1
i=0
(x a
bi
)
= g
0
g
1
x g
2t1
x
2t1
x
2t
The code has the following parameters:
Block length: n = 2
m
1
Number of parity-check symbols: n k = 2t
Minimum distance: d = 2t 1
Thus, the ReedSolomon codes satisfy the Singleton
bound with equality n k = d 1. That is, they are
MDS codes.
An alternative description of the ReedSolomon code is
( f (x
1
); f (x
2
); . . . ; f (x
n
))[
f polynomial of degree less thank:
The weight distributionof the ReedSolomoncode is (for
i _ d)
A
i
n
i
_ _
id
j=0
1 ( )
j
i
j
_ _
2
m(idj1)
1
_ _
The encoding of ReedSolomon codes is similar to the
encoding of binary cyclic codes. One decoding is similar to
the decoding of binary BCHcodes with one added complica-
tion. Using a generalization of the BerlekampMassey
algorithm, we determine the polynomials L(x) and V(x).
FromL(x) we can determine the locations of the errors. In
addition, we must determine the value of the errors (in the
binary case, the values are always 1). The value of the error
at location X
j
easily can be determined using V(x) and L(x);
we omit further details.
An alternative decoding algorithm can sometimes
decode errors of weight more than half the minimum dis-
tance. We sketch this algorithm, rst giving the simplest
version, which works if the errors have weight less than
half the minimum distance; that is, we assume a codeword
c = ( f (x
1
); f (x
2
); . . . ; f (x
n
)) was sent and r = (r
1
; r
2
; . . . ; r
n
)
was received, and w(r c) <(n k 1)=2. It is easy to
showthat if Q(x; y) is a nonzero polynomial intwo variables
of the form
Q(x; y) = Q
0
(x) Q
1
(x)y
where
Q(x
i
; r
i
) = 0 for i = 1; 2; . . . ; n
Q
0
(x) has degree at most n 1 t
Q
1
(x) has degree at most n k t
then Q
0
(x) Q
1
(x) f (x) = 0, and so Q(x; y) = Q
1
(x)
( y f (x)). Moreover, such a polynomial Q(x; y) does exist.
An algorithm to nd Q(x; y) is as follows:
Input: r = (r
1
; r
2
; . . . r
n
).
Solve the equations
n1
j=0
a
j
x
j
i
nkt
j=0
b
j
r
i
x
j
i
= 0,
where i = 1; 2; . . . ; n, for the unknown a
j
and b
j
;
Q
0
(x) :=
n1t
j=0
a
j
x
j
;
Q
1
(x) :=
nkt
j=0
b
j
x
j
;
We now recover f(x) from the relation
Q
0
(x) Q
1
(x) f (x) = 0.
To correct (some) errors of weight more than half the
minimumdistance, the method above must be generalized.
The idea is to nd a nonzero polynomial Q(x, y) of the form
Q(x; y) = Q
0
(x) Q
1
(x)y Q
2
(x)y
2
Q
l
(x)y
l
where now, for some integer t and s,
(x
i
; r
i
) for i = 1; 2; . . . ; n; are zeros of Q(x; y) of multiplicity s
Q
m
(x) has degree at most s(n t) 1 m(k 1)
for m = 0; 1; . . . ; l
For sucha polynomial one canshowthat if the weight of the
error is at most t, then y f (x) divides Q(x,y). Therefore, we
canndall codewords withindistance t fromr byndingall
factors of Q(x,y) of the form y h(x), where h(x) is a
polynomial of degree less than k. In general, there may
be more than one such polynomial h(x), of course, but in
some cases, it is unique even if t is larger than half
the minimum distance. This idea is the basis for the
GuruswamiSudan algorithm.
12 ALGEBRAIC CODING THEORY
GuruswamiSudan Algorithm.
Input: r = (r
1
; r
2
; . . . r
n
), t, s.
Solve the n(
s 1
2
) equations
l
m=u
s(nt)1m(k1)
j=v
(
m
u
)(
j
v
)a
m; j
x
mu
i
r
jv
i
= 0
,
where i = 1; 2; . . . ; n, and 0 _ u v <s,
for the unknown a
m; j
;
for m:=0 to l do
Q
m
(x) :=
s(nt)1m(k1)
j=0
a
m; j
x
j
;
Q(x; y) :=
l
m=0
Q
m
(x)y
m
;
for each polynomial h(x) of degree less than k do
if y h(x) divides Q(x; y), then output h(x).
We remark that the last loop of the algorithm is for-
mulated in this simple form to explain the idea. In actual
implementations, this part can be made more efcient. If
t <
n(2l s 1)
2(l 1)
l(k 1)
2s
, then the polynomial f (x) of the
sent codeword is among the output. The Guruswami
Sudan algorithmis a recent invention. Atextbook covering
it is the book by Justesen and Hholdt given in Further
Reading.
ALGEBRAIC GEOMETRY CODES
Algebraic geometry codes can be considered as general-
izations of ReedSolomon codes, but they offer a wider
range of code parameters. The ReedSolomon codes over
F
q
have maximum length n = q 1 whereas algebraic
geometry codes can be longer. One common method to
construct algebraic geometry codes is to select an algebraic
curve and a set of functions that are evaluated onthe points
on the curve. In this way, by selecting different curves and
sets of functions, one obtains different algebraic geometry
codes. To treat algebraic codes in full generality is outside
the scope of this article. We only will give a small avor of
the basic techniques involved by restricting ourselves to
considering the class of Hermitian codes, which is the most
studied class of algebraic geometry codes. This class of
codes can serve as a good illustration of the methods
involved.
The codewords in a Hermitian code have symbols from
an alphabet F
q
2 . The Hermitian curve consists of all points
(x; y) F
2
q
2
given by the following equation in two variables:
x
q1
y
q
y = 0
It is a straightforward argument to show that the number
of different points on the curve is n = q
3
. To select the
functions to evaluate over the points on the curve, one
needs to dene an order function r. The function r is the
mapping from the set of polynomials in two variable over
F
q
2 to the set of integers such that
r(x
i
y
j
) = iq j(q 1)
and r( f (x; y)) is the maximum over all nonzero terms in
f (x; y).
Let P
1
= (x
1
; y
1
); P
2
= (x
2
; y
2
); . . . ; P
n
= (x
n
; y
n
) denote
all the points on the Hermitian curve. The Hermitian
code C
s
is dened by
C
s
= ( f (P
1
); f (P
2
); . . . ; f (P
n
))[r( f (x; y)) _ s
and deg
x
( f (x; y)) _ q
where deg
x
( f (x; y)) is the maximum degree of x in f (x; y).
Using methods fromalgebraic geometry, one obtains the
following parameters of the Hermitian codes. The length of
the code is n = q
3
, and its dimension is
k =
m
s
if 0 _ s _ q
2
q 2
s 1 (q
2
q)=2 if q
2
q 2 <s <n q
2
q
_
where m
s
is the number of monomials with
r(x
i
y
j
) _ s andi _ q, and the minimum distance is
d_n s if q
2
q 2 <s <n q
2
q
It is known that the class of algebraic geometry codes
contains some very good codes with efcient decoding
algorithms. It is possible to modify the GuruswamiSudan
algorithm to decode these codes.
NONLINEAR CODES FROM CODES OVER Z
4
In the previous sections mainly we have considered binary
linear codes; that is, codes where the sumof two codewords
is again a codeword. The main reason has been that the
linearity greatly simplied construction and decoding of
the codes.
A binary nonlinear (n; M; d) code C is simply a set of M
binary n-tuples with pairwise distance at least d, but with-
out any further imposed structure. In general, to nd the
minimum distance of a nonlinear code, one must compute
the distance between all pairs of codewords. This is, of
course, more complicated than for linear codes, where it
sufces to nd the minimum weight among all the nonzero
codewords. The lack of structure in a nonlinear code also
makes it difcult to decode in an efcient manner.
However, some advantages to nonlinear codes exist. For
given values of length n and minimum distance d, some-
times it is possible to construct nonlinear codes with more
codewords thanis possible for linear codes. For example, for
n = 16 and d = 6, the best linear code has dimension k = 7
(i.e., it contains 128 codewords). The code of length 16
obtained by extending the double-error-correcting primi-
tive BCH code has these parameters.
In 1967, Nordstrom and Robinson (3) found a nonlinear
code withparameters n = 16 andd = 6containing M = 256
codewords, which has twice as many codewords as the best
linear code for the same values of n and d.
In 1968, Preparata (4) generalized this construction to
an innite family of codes having parameters
(2
m1
; 2
2
m1
2m2
; 6); modd; m_3
ALGEBRAIC CODING THEORY 13
A few years later, in 1972, Kerdock (5) gave another
generalization of the NordstromRobinson code and con-
structed another innite class of codes with parameters
(2
m1
; 2
2m2
; 2
m
2
m1
2
); modd; m_3
The Preparata code contains twice as many codewords
as the extended double-error-correcting BCH code and is
optimal in the sense of having the largest possible size for
the given length and minimumdistance. The Kerdock code
has twice as many codewords as the best knownlinear code.
In the case m = 3, the Preparata code and the Kerdock
codes both coincide with the NordstromRobinson code.
The Preparata and Kerdock codes are distance invar-
iant, which means that the distance distribution from a
given codeword to all the other codewords is independent of
the given codeword. In particular, because they contain the
all-zero codeword, their weight distribution equals their
distance distribution.
In general, no natural way to dene the dual code of a
nonlinear code exists, and thus the MacWilliams identities
have no meaning for nonlinear codes. However, one can
dene the weight enumerator polynomial A(z) of a non-
linear code in the same way as for linear codes and compute
its formal dual B(z) from the MacWilliams identities:
B(z) =
1
M
(1 z)
n
A
1 z
1 z
_ _
The polynomial B(z) obtained in this way has no simple
interpretation. In particular, it may have coefcients that
are nonintegers or even negative. For example, if
C = (110); (101); (111), then A(z) = 2z
2
z
3
and
B(z) = (3 5z z
2
z
3
)=3.
An observation that puzzled the coding theory commu-
nity for a long time was that the weight enumerator of the
Preparata code A(z) and the weight enumerator of the
Kerdock code B(z) satised the MacWilliams identities,
and in this sense, these nonlinear codes behaved like
dual linear codes.
Hammons et al. (6) gave a signicantly simpler descrip-
tion of the family of Kerdock codes. They constructed a
linear code over Z
4
= 0; 1; 2; 3, which is an analog of the
binary rst-order ReedMuller code. This code is combined
with a mapping called the Gray map that maps the ele-
ments in Z
4
into binary pairs. The Gray map fis dened by
f(0) = 00; f(1) = 01; f(2) = 11; f(3) = 10
The Lee weight of an element in Z
4
is dened by
w
L
(0) = 0; w
L
(1) = 1; w
L
(2) = 2; w
L
(3) = 1
Extending f in a natural way to a map f : Z
n
4
Z
2n
2
one
observes that fis adistance preservingmapfromZ
n
4
(under
the Lee metric) to Z
2n
2
(under the Hamming metric).
Alinear code over Z
4
is asubset of Z
n
4
suchthat anylinear
combination of two codewords is again a codeword. From a
linear code C of length n over Z
4
, one obtains usually a
binary nonlinear code C = f(C) of length 2n by replacing
each component in a codeword in C by its image under the
Gray map. This code usually is nonlinear.
The minimum Hamming distance of C equals the mini-
mum Lee distance of C and is equal to the minimum Lee
weight of C because C is linear over Z
4
.
Example. To obtain the NordstromRobinson code, we
will construct a code over Z
4
of length 8 and then apply the
Gray map. Let f(x) = x
3
2x
2
x 3 Z
4
[x]. Let b be a
zero of f(x); that is, b
3
2b
2
b 3 = 0. Then we can
express all powers of b in terms of 1, b, and b
2
, as follows:
b
3
= 2b
2
3b 1
b
4
= 3b
2
3b 2
b
5
= b
2
3b 3
b
6
= b
2
2b 1
b
7
= 1
Consider the code Cover Z
4
withgenerator matrix givenby
G =
1 1 1 1 1 1 1 1
0 1 b b
2
b
3
b
4
b
5
b
6
_ _
=
1 1 1 1 1 1 1 1
0 1 0 0 1 2 3 1
0 0 1 0 3 3 3 2
0 0 0 1 2 3 1 1
_
_
_
_
where the column corresponding to b
i
is replaced by the
coefcients in its expression in terms of 1, b, and b
2
. Then
the NordstromRobinson code is the Gray map of C.
The dual code c
l
of a code c over Z
4
is dened similarly
as for binary linear codes, except that the inner product of
the vectors x = (x
1
, x
2
,. . .,x
n
) and y = (y
1
, y
2
,. . .,y
n
) with
components in Z
4
is dened by
(x; y) =
n
i=1
x
i
y
i
(mod 4)
The dual code c
l
of c is then
c
l
= x Z
n
4
[(x; c) = 0 for all c Z
n
4
+
/
for h = 1 to n 1/
+
evaluate each cut
+
/
Let P be the partition with clusters C
1
= x
1
; . . . ; x
h
and C
2
= x
h1
; . . . ; x
n
compute RE
1
(P)
if RE
1
(P) < MinRE then
MinRE = RE
1
(P), cut = h=+ the best cut
+
/
Return cut as the best cut
4 COOPERATIVE DATABASE SYSTEMS
attributes are dependent. (Dependency here means that all
the attributes as a whole dene a coherent concept. For
example, the length and width of a rectangle are said to be
semantically dependent. This kind of dependency should
be distinguished from the functional dependency in data-
base theory.) In addition, using multiple single-attribute
TAHs is inefcient because it may need many iterations of
querymodicationanddatabase access before approximate
answers are found. Furthermore, relaxation control for
multiple TAHs is more complex because there is a large
number of possible orders for relaxing attributes. In gen-
eral, we can rely only on simple heuristics such as best rst
or minimal coverage rst to guide the relaxation (see sub-
section entitled Relaxation Control). These heuristics
cannot guarantee best approximate answers because
they are rules of thumb and not necessarily accurate.
Most of these difculties can be overcome by using
Multiattribute TAH (MTAH) for the relaxation of multiple
attributes. Because MTAHs are generated from semanti-
cally dependent attributes, these attributes are relaxed
together in a single relaxation step, thus greatly reducing
the number of query modications and database accesses.
Approximate answers derived by using MTAH have better
quality than those derived by using multiple single-attri-
bute TAHs. MTAHs are context- and user-sensitive
because a user may generate several MTAHs withdifferent
attribute sets from a table. Should a user need to create an
MTAH containing semantically dependent attributes from
different tables, these tables canbe joined into a single view
for MTAH generation.
To cluster objects with multiple attributes, DISC can be
extended to Multiple attributesDISC or M-DISC (28).
MTAHs are generated. The algorithm DISC is a special
case of M-DISC, and TAHis a special case of MTAH. Let us
now consider the time complexity of M-DISC. Let m be the
number of attributes and n be the number of distinct
attribute values. The computation of relaxation error for
a single attribute takes O(n log n) to complete (27). Because
the computation of RE involves computation of relaxation
error for m attributes, its complexity is O(mn log n). The
nestedloopinM-DISCis executedmntimes sothat the time
complexity of M-DISC is O(m
2
n
2
log n). To generate an
MTAH, it takes no more than n calls of M-DISC; therefore,
the worst case time complexity of generating an MTAH is
O(m
2
n
3
log n). The average case time complexity is
O[m
2
n
2
(log n)
2
] because M-DISC needs only to be called
log n times on the average.
Nonnumerical TAHs
Previous knowledge discovery techniques are inadequate
for clusteringnonnumerical attribute values for generating
TAHs for Cooperative Query Answering. For example,
Attribute Oriented Induction (36) provides summary infor-
mation and characterizes tuples in the database, but is
inappropriate since attribute values are focused too closely
on a specic target. Conceptual Clustering (37,38) is a top-
downmethod to provide approximate query answers, itera-
tively subdividing the tuple-space into smaller sets. The
top-down approach does not yield clusters that provide
the best correlation near the bottom of the hierarchy.
Cooperative query answering operates from the bottom
of the hierarchy, so better clustering near the bottom is
desirable. To remedy these shortcomings, a bottom-up
approachfor constructing attribute abstractionhierarchies
called Pattern-Based Knowledge Induction(PKI) was devel-
oped to include a nearness measure for the clusters (29).
PKI determines clusters by deriving rules from the
instance of the current database. The rules are not 100%
certain; instead, they are rules-of-thumb about the data-
base, such as
If the car is a sports car, then the color is red
Each rule has a coverage that measures how often the
rule applies, andcondence measures the validityof the rule
in the database. In certain cases, combining simpler rules
can derive a more sophisticated rule with high condence.
The PKI approach generates a set of useful rules that
can then be used to construct the TAH by clustering the
premises of rules sharing a similar consequence. For exam-
ple, if the following two rules:
If the car is a sports car, then the color is red
If the car is a sports car, then the color is black
have high condence, then this indicates that for sports
cars, the colors red and black should be clustered together.
Supporting and contradicting evidence fromrules for other
attributes is gathered and PKI builds an initial set of
clusters. Each invocation of the clustering algorithm
adds a layer of abstraction to the hierarchy. Thus, attribute
values are clustered if they are usedas the premise for rules
with the same consequence. By iteratively applying the
algorithm, a hierarchy of clusters (TAH) can be found. PKI
can cluster attribute values with or without expert direc-
tion. The algorithm can be improved by allowing domain
expert supervision during the clustering process. PKI also
works well when there are NULL values in the data. Our
experimental results conrmthat the method is scalable to
large systems. For a more detailed discussion, see (29).
Maintenance of TAHs
Because the quality of TAH affects the quality of derived
approximate answers, TAHs should be kept up to date. One
simple way to maintain TAHs is to regenerate them when-
ever an update occurs. This approach is not desirable
because it causes overhead for the database system.
Although each update changes the distribution of data
(thus changing the quality of the corresponding TAHs),
this may not be signicant enough to warrant a TAH
regeneration. TAH regeneration is necessary only when
the cumulative effect of updates has greatly degraded the
TAHs. The quality of a TAHcanbe monitored by comparing
the derived approximate answers to the expected relaxa-
tion error (e.g., see Fig. 7), which is computed at TAH
generation time and recorded at each node of the TAH.
When the derived approximate answers signicantly devi-
ate fromthe expectedquality, thenthe quality of the TAHis
deemed to be inadequate and a regeneration is necessary.
The following incremental TAH regeneration procedure
COOPERATIVE DATABASE SYSTEMS 5
can be used. First, identify the node within the TAH that
has the worst query relaxations. Apply partial TAH regen-
eration for all the database instances covered by the node.
After several such partial regenerations, we then initiate a
complete TAH regeneration.
The generatedTAHs are storedinUNIXles, andaTAH
Manager (described in subsection entitled TAH Facility)
is responsible to parse the les, create internal representa-
tionof TAHs, andprovide operations suchas generalization
and specialization to traverse TAHs. The TAH Manager
also provides a directory that describes the characteristics
of TAHs (e.g., attributes, names, user type, context, TAH
size, location) for the users/systems to select the appropri-
ate TAH to be used for relaxation.
Our experience in using DISC/M-DISC and PKI for
ARPARome Labs Planning Initiative (ARPI) transportation
databases (94 relations, the biggest one of which has 12
attributes and 195,598 tuples) shows that the clustering
techniques for both numerical and nonnumerical attributes
can be generated from a few seconds to a few minutes
dependingonthe table size onaSunSPARC20Workstation.
COOPERATIVE OPERATIONS
The cooperative operations consist of the following
four types: context-free, context-sensitive, control, and
interactive.
Context-Free Operators
v
Approximate operator .v relaxes the specied value v
within the approximate range predened by the user.
For example, .9am transforms into the interval (8am,
10am).
v
Between (v
1
, v
2
) species the interval for an attribute.
For example, time between (7am, .9am) transforms
into (7am, 10am). The transformed interval is prespe-
cied by either the user or the system.
Context-Sensitive Operators
v
Near-to Xis usedfor specicationof spatial nearness of
object X. The near-to measure is context- and user-
sensitive. Nearness can be specied by the user. For
example, near-to BIZERTE requests the list of
cities located within a certain Euclidean distance
(depending on the context) from the city Bizerte.
v
Similar-to X based-on [(a
1
w
1
)(a
2
w
2
) (a
n
w
n
)] is
used to specify a set of objects semantically similar
to the target object X based on a set of attributes
(a
1
, a
2
, . . ., a
n
) specied by the user. Weights (w
1
,
w
2
, . . ., w
n
) may be assigned to each of the attributes
to reect the relative importance in considering the
similarity measure. The set of similar objects can be
ranked by the similarity. The similarity measures that
computed from the nearness (e.g., weighted mean
square error) of the prespecied attributes to that of
the target object. The set size is boundby a prespecied
nearness threshold.
Control Operators
v
Relaxation-order (a
1
, a
2
, . . ., a
n
) species the order of
the relaxationamong the attributes (a
1
, a
2
, . . ., a
n
) (i.e.,
a
i
precedes a
i+1
). For example, relaxation-order
(runway_length, runway_width) indicates that if
no exact answer is found, then runway_length should
be relaxed rst. If still no answer is found, then relax
the runway_width. If no relaxation-order control is
specied, the system relaxes according to its default
relaxation strategy.
v
Not-relaxable (a
1
, a
2
, . . ., a
n
) species the attributes
(a
1
, a
2
, . . ., a
n
) that should not be relaxed. For example,
not-relaxable location_name indicates that the
condition clause containing location_name must not
be relaxed.
v
Preference-list (v
1
, v
2
, . . ., v
n
) species the preferred
values (v
1
, v
2
, . . ., v
n
) of a given attribute, where v
i
is
preferred over v
i+1
. As a result, the given attribute is
relaxed according to the order of preference that the
user species in the preference list. Consider the attri-
bute food style; a user may prefer Italian food to
Mexican food. If there are no such restaurants within
the specied area, the query can be relaxed to include
the foods similar to Italianfoodrst andthensimilar to
Mexican food.
v
Unacceptable-list (v
1
, v
2
, . . ., v
n
) allows users to inform
the systemnot to provide certainanswers. This control
can be accomplished by trimming parts of the TAH
from searching. For example, avoid airlines X and
Y tells the system that airlines X and Y should not be
considered during relaxation. It not only provides
more satisfactory answers to users but also reduces
search time.
v
Alternative-TAH (TAH-name) allows users to use the
TAHs of their choices. For example, a vacationtraveler
may want to nd anairline based onits fare, whereas a
business traveler is more concerned with his schedule.
To satisfy the different needs of the users, several
TAHs of airlines can be generated, emphasizing dif-
ferent attributes (e.g., price and nonstop ight).
v
Relaxation-level (v) species the maximum allowable
range of the relaxation on an attribute, i.e., [0, v].
v
Answer-set (s) species the minimum number of
answers required by the user. CoBase relaxes query
conditions until enough number of approximate
answers (i.e., _s) are obtained.
v
Rank-by ((a
1
, w
1
), (a
2
, w
2
), . . ., (a
n
, w
n
)) METHOD
(method name) species a method to rank the
answers returned by CoBase.
User/System Interaction Operators
v
Nearer, Further provide users with the ability to con-
trol the near-to relaxation scope interactively. Nearer
reduces the distance by a prespecied percentage,
whereas further increases the distance by a prespeci-
ed percentage.
6 COOPERATIVE DATABASE SYSTEMS
Editing Relaxation Control Parameters
Users can browse and edit relaxation control parameters to
better suit their applications (see Fig. 2). The parameters
include the relaxation range for the approximately-equal
operator, the default distance for the near-to operator, and
the number of returned tuples for the similar-to operator.
Cooperative SQL (CoSQL)
The cooperative operations can be extended to the rela-
tional database query language, SQL, as follows: The con-
text-free andcontext-sensitive cooperative operators canbe
used in conjunction with attribute values specied in the
WHERE clause. The relaxation control operators can be
usedonly onattributes specied inthe WHEREclause, and
the control operators must be specied in the WITH clause
after the WHERE clause. The interactive operators can be
used alone as command inputs.
Examples. In this section, we present a few selected
examples that illustrate the capabilities of the cooperative
operators. The corresponding TAHs used for query mod-
ication are shown in Fig. 1, and the relaxable ranges are
shown in Fig. 2.
Query 1. List all the airports with the runway length
greater than 7500 ft and runway width greater than 100 ft.
If there is no answer, relax the runway length condition
rst. The following is the corresponding CoSQL query:
SELECT aport_name, runway_length_ft,
runway_width_ft
FROM aports
WHERE runway_length_ft > 7500 AND
runway_width_ft > 100
WITH RELAXATION-ORDER (runway_length_ft,
runway_width_ft)
Based on the TAH on runway length and the relaxation
order, the query is relaxed to
SELECT aport_name, runway_length_ft,
runway_width_ft
FROM aports
WHERE runway_length_ft >= 7000 AND
runway_width_ft > 100
If this query yields no answer, then we proceed to relax
the range runway width.
Query 2. Find all the cities with their geographical
coordinates near the city Bizerte in the country Tunisia.
If there is no answer, the restriction on the country should
not be relaxed. The near-to range inthis case is prespecied
at 100 miles. The corresponding CoSQL query is as follows:
SELECT location_name, latitude, longitude
FROM GEOLOC
WHERE location_name NEAR-TO Bizerte
AND country_state_name = Tunisia
WITH NOT-RELAXABLE country_state_name
Based on the TAH on location Tunisia, the relaxed
version of the query is
SELECT location_name, latitude, longitude
FROM GEOLOC
WHERE location_name IN {Bizerte, Djedeida,
Gafsa,
Gabes, Sfax, Sousse, Tabarqa,
Tunis}
AND country_state_name_= Tunisia
Query 3. Find all airports in Tunisia similar to the
Bizerte airport. Use the attributes runway_length_ft and
runway_width_ft as criteria for similarity. Place more
similarity emphasis on runway length than runway width;
their corresponding weight assignments are 2 and 1,
respectively. The following is the CoSQL version of the
query:
SELECT aport_name
FROM aports, GEOLOC
WHERE aport_name SIMILAR-TO Bizerte
BASED-ON ((runway_length_ft 2.0)
(runway_width_ft 1.0))
AND country_state_name = TUNISIA
AND GEOLOC.geo_code = aports.geo_code
To select the set of the airport names that have the
runway length and runway width similar to the ones for
the airport in Bizerte, we shall rst nd all the airports in
Tunisia and, therefore, transform the query to
SELECT aport_name
FROM aports, GEOLOC
WHERE country_state_name_ = TUNISIA
AND GEOLOC.geo_code = aports.geo_code
After retrieving all the airports in Tunisia, based on the
runway length, runway width, and their corresponding
weights, the similarity of these airports to Bizerte can be
computed by the prespecied nearness formula (e.g.,
Approximate operator relaxation range
Relation name
Aports Runway_length_ft 500
Aports Runway_width_ft 10
Aports Parking_sq_ft 100000
GEOLOC Latitude 0.001
GEOLOC
GEOLOC
Longitude 0.001
Attribute name Range
Near-to operator relaxation range
Relation name Attribute name Near-to range Nearer/further
Aports Aport_name
Location_name
100 miles
200 miles
50%
50%
Figure 2. Relaxation range for the approximate and near-to
operators.
COOPERATIVE DATABASE SYSTEMS 7
weighted mean square error). The order in the similarity
set is ranked according to the nearness measure, and the
size of the similarity set is determined by the prespecied
nearness threshold.
A SCALABLE AND EXTENSIBLE ARCHITECTURE
Figure 3 shows an overview of the CoBase System. Type
abstraction hierarchies and relaxation ranges for the expli-
cit operators are storedina knowledge base (KB). There is a
TAHdirectory storing the characteristics of all the TAHs in
the system. When CoBase queries, it asks the underlying
database systems (DBMS). When an approximate answer
is returned, context-based semantic nearness will be pro-
vided to rank the approximate answers (in order of near-
ness) against the speciedquery. Agraphical user interface
(GUI) displays the query, results, TAHs, and relaxation
processes. Based on user type and query context, associa-
tive information is derived from past query cases. A user
canconstruct TAHs fromone or more attributes andmodify
the existing TAH in the KB.
Figure 4 displays the various cooperative modules:
Relaxation, Association, and Directory. These agents
are connected selectively to meet applications needs.
An application that requires relaxation and association
capabilities, for example, will entail a linking of Relaxa-
tion and Association agents. Our architecture allows
incremental growth with application. When the demand
for certain modules increases, additional copies of the
modules can be added to reduce the loading; thus, the
system is scalable. For example, there are multiple copies
of relaxation agent and association agent in Fig. 4.
Furthermore, different types of agents can be intercon-
nected and communicate with each other via a common
communication protocol [e.g., FIPA (http.//www.pa.org),
or Knowledge Query Manipulation Language (KQML)
(39)] to perform a joint task. Thus, the architecture is
extensible.
Relaxation Module
Query relaxation is the process of understanding the
semantic context, intent of a user query and modifying
the query constraints with the guidance of the customized
knowledge structure (TAH) into near values that provide
best-t answers. The ow of the relaxation process is
depicted in Fig. 5. When a CoSQL query is presented to
the Relaxation Agent, the system rst go through a pre-
processing phase. During the preprocessing, the system
rst relaxes any context-free and/or context-sensitive coop-
erative operators in the query. All relaxation control opera-
tions specied in the query will be processed. The
information will be stored in the relaxation manager and
be ready to be used if the query requires relaxation. The
modied SQL query is then presented to the underlying
database system for execution. If no answers are returned,
then the cooperative query system, under the direction of
the Relaxation Manager, relaxes the queries by query
modication. This is accomplished by traversing along
the TAH node for performing generalization and speciali-
zation and rewriting the query to include a larger search
scope. The relaxed query is then executed, and if there is no
answer, we repeat the relaxation process until we obtain
one or more approximate answers. If the system fails to
produce an answer due to overtrimmed TAHs, the relaxa-
tion manager will deactivate certain relaxation rules to
restore part of a trimmed TAH to broaden the search scope
until answers are found. Finally, the answers are postpro-
cessed (e.g., ranking and ltering).
Figure 4. A scalable and extensi-
ble cooperative information system.
Applications
Applications
Mediator
layer
Module requirement
Module capability
Information
sources
Dictionary/Directory
R: Relaxation module
A: Association module
D: Directory module
Users
R
R
R R
D
A A
DB-1 DB-1 DB-1 DB-1
A
A
......
...
...
......
Database
Database
Geographical
information
system
GUI User
Knowledge
editor
Knowledge bases
Data sources
TAHs,
user profiles,
query cases
Relaxation
module
Association
module
Figure 3. CoBase functional architecture.
8 COOPERATIVE DATABASE SYSTEMS
Relaxation Control. Relaxationwithout control may gen-
erate more approximations than the user can handle. The
policy for relaxation control depends on many factors,
includinguser prole, querycontext, andrelaxationcontrol
operators as dened previously. The Relaxation Manager
combines those factors via certainpolicies (e.g., minimizing
search time or nearness) to restrict the search for approx-
imate answers. We allow the input query to be annotated
with control operators to help guide the agent in query
relaxation operations.
If control operators are used, the Relaxation Manager
selects the condition to relax in accordance with the
requirements specied by the operators. For example, a
relaxation-order operator will dictate relax location rst,
then runway length. Without such user-specied require-
ments, the Relaxation Manager uses a default relaxation
strategy by selecting the relaxation order based on the
minimum coverage rule. Coverage is dened as the ratio
of the cardinality of the set of instances covered by the
entire TAH. Thus, coverage of aTAHnode is the percentage
of all tuples in the TAH covered by the current TAH node.
The minimum coverage rule always relaxes the condition
that causes the minimumincrease inthe scope of the query,
which is measured by the coverage of its TAH node. This
default relaxation strategy attempts to add the smallest
number of tuples possible at each step, based on the ratio-
nale that the smallest increase in scope is likely to generate
the close approximate answers. The strategy for choosing
which condition to be relaxed rst is only one of many
possible relaxation strategies; the Relaxation Manager
can support other different relaxation strategies as well.
Let us consider the following example of using control
operators to improve the relaxationprocess. Suppose apilot
is searchingfor anairport withan8000ft runwayinBizerte
but there is no airport in Bizerte that meets the specica-
tions. There are many ways to relax the query in terms of
locationand runway length. If the pilot species the relaxa-
tion order to relax the location attribute rst, then the
query modication generalizes the location Bizerte to
NWTunisia (as shown in Fig. 1) and species the locations
Bizerte, Djedeida, Tunis, and Saminjah, thus broadening
the search scope of the original query. If, in addition, we
knowthat the user is interested only inthe airports inWest
Tunisia and does not wish to shorten the required runway
length, the systemcaneliminate the searchinEast Tunisia
and also avoid airports withshort and mediumrunways, as
shown in Fig. 6. As a result, we can limit the query relaxa-
tion to a narrower scope by trimming the TAHs, thus
improving both the system performance and the answer
relevance.
Spatial Relaxation and Approximation. In geographical
queries, spatial operators such as located, within, contain,
intersect, union, anddifference are used. Whenthere are no
exact answers for a geographical query, both its spatial and
nonspatial conditions can be relaxed to obtain the approx-
imate answers. CoBase operators also can be used for
describing approximate spatial relationships. For example,
an aircraft-carrier is near seaport Sfax. Approximate
spatial operators, such as near-to and between are devel-
oped for the approximate spatial relationships. Spatial
approximation depends on contexts and domains (40,41).
For example, a hospital near to LAX is different from an
airport near to LAX. Likewise, the nearness of a hospital in
a metropolitanarea is different fromthe one ina rural area.
Thus, spatial conditions should be relaxed differently in
different circumstances. A common approach to this pro-
blem is the use of prespecied ranges. This approach
requires experts to provide suchinformationfor all possible
situations, which is difcult to scale up to larger applica-
tions or to extend to different domains. Because TAHs are
user- and context-sensitive, they can be used to provide
context-sensitive approximation. More specically, we can
generate TAHs based on multidimensional spatial attri-
butes (MTAHs).
Furthermore, MTAH (based on latitude and longitude)
is generated based on the distribution of the object loca-
tions. The distance between nearby objects is context-sen-
sitive: the denser the location distribution, the smaller the
distance among the objects. In Fig. 7, for example, the
default neighborhood distance in Area 3 is smaller than
the one in Area 1. Thus, when a set of airports is clustered
based on the locations of the airports, the ones in the same
cluster of the MTAH are much closer to each other than to
those outside the cluster. Thus, they can be considered
near-to each other. We can apply the same approach to
other approximate spatial operators, such as between (i.e.,
a cluster near-to the center of two objects). MTAHs also can
Present
answers
Query
relaxation
Query
processing
Satisfactory
answers
Postprocess
modules
Relaxation
Select
relaxation
heuristic
Parsed query
Yes No
Preprocess
modules
Approximate
answers
Figure 5. Flow chart for proces-
sing CoBase queries.
COOPERATIVE DATABASE SYSTEMS 9
be used to provide context-sensitive query relaxation. For
example, consider the query: Find an aireld at the city
Sousse. Because there is no aireld located exactly at
Sousse, this query can be relaxed to obtain approximate
answers. First, we locate the city Sousse withlatitude 35.83
andlongitude 10.63. Using the MTAHinFig. 7, we ndthat
Sousse is covered by Area 4. Thus, the airport Monastir is
returned. Unfortunately, it is not an aireld. So the query
is further relaxed to the neighboring clusterthe four
airports in Area 3 are returned: Bizerte, Djedeida, Tunis,
and Saminjah. Because only Djedeida and Saminjah are
airelds, these two will be returned as the approximate
answers.
MTAHs are automatically generated from databases by
using our clustering method that minimizes relaxation
error (27). They can be constructed for different contexts
and user type. For example, it is critical to distinguish a
friendly airport fromanenemy airport. Using anMTAHfor
friendly airports restricts the relaxation only within the set
of friendly airports, even though some enemy airports are
Figure 6. TAH trimming based on
relaxation control operators.
Tunisia
NE_Tun
Monastir
NW_Tun SE_Tun SW_Tun
Bizerte
Djedeida
Tunis
Saminjah
Sfax
Gabes
Jerba
Gafsa
El Borma
0 4k 4k 8k
Medium Short Long
8k 10k
All
Runway_length Location
Type abstraction hierarchies
Constraints
Trimmed type abstraction hierarchies
Do not relax to short or medium runway
Limit location to NW_Tun and SW_Tun
Relaxation
manager
Tunisia
NW_Tun SW_Tun
Bizerte
Djedeida
Tunis
Saminjah
Gafsa
El Borma
Long
8k 10k
All
Runway_length Location
Figure 7. An MTAH for the airports
in Tunisia and its corresponding two-
dimensional space.
RE: 0.625 RE: 0.282
Latitude [34.72, 37.28]
Longitude [8.1, 11.4]
Relaxation error: 0.677
34.72
31.72
8.1 9.27
37.28
3
10.23
11.4
Latitude [31.72, 34.72]
Longitude [8.1, 11.4]
Latitude [31.72, 37.28]
Longitude [8.1, 11.4]
Latitude
(34.72)
Longitude,
10.23
Longitude,
9.27
Monastir
Sousse City
Monastir
Bizerte
Djedeida
Tunis
Saminjah
RE: 0.145
RE: 0
Gafsa
El_Borma
Sfax
Gabes
Jerba
RE: 0.359 RE: 0.222
3 4 1 2
Longitude
L
a
t
i
t
u
d
e
4
1 2
10 COOPERATIVE DATABASE SYSTEMS
geographically nearby. This restriction signicantly
improves the accuracy and exibility of spatial query
answering. The integration of spatial and cooperative
operators provides more expressiveness and context-sensi-
tive answers. For example, the user is able to pose such
queries as, nd the airports similar-to LAX and near-to
City X. When no answers are available, both near-to and
similar-to can be relaxed based on the users preference
(i.e., a set of attributes). To relax near-to, airports from
neighboring clusters in the MTAH are returned. To relax
similar-to, the multiple-attribute criteria are relaxed by
their respective TAHs.
Cooperativeness in geographic databases was studied in
Ref. 42. A rule-based approach is used in their system for
approximate spatial operators as well as query relaxation.
For example, they dene that Pis near-to Qiff the distance
fromP to Qis less than n
+
length_unit, where length_unit is
a context dependent scalar parameter, and n is a scalar
parameter that canbe either unique for the applicationand
thus dened in domain model, or specic for each class of
users and therefore dened in the user models. This
approach requires n and length_unit be set by domain
experts. Thus, it is difcult to scale up. Our system uses
MTAHs as a representation of the domain knowledge. The
MTAHs can be generated automatically from databases
based on contexts and provide a structured and context-
sensitive way to relax queries. As a result, it is scalable to
large applications. Further, the relaxation error at each
node is computed during the construction of TAHs and
MTAHs. It canbeusedto evaluate the qualityof relaxations
and to rank the nearness of the approximate answers to the
exact answer.
Associative Query Answering via Case-Based Reasoning
Often it is desirable to provide additional information
relevant to, though not explicitly stated in, a users query.
For example, in nding the location of an airport satisfying
the runway lengthandwidthspecications, the association
module (Fig. 8) can provide additional information about
the runway quality and weather condition so that this
additional information may help the pilot select a suitable
airport to land his aircraft. On the other hand, the useful
relevant information for the same query if posed by a
transportation planner may be information regarding rail-
way facilities and storage facilities nearby the airport.
Therefore, associative information is user- and context-
sensitive.
Association in CoBase is executed as a multistep post-
process. After the query is executed, the answer set is
gathered with the query conditions, user prole, and appli-
cation constraints. This combined information is matched
against query cases from the case base to identify relevant
associative information (15,33). The query cases can take
the form of a CoBase query, which can include any CoBase
construct, such as conceptual conditions (e.g., runway_-
length_ft = short) or explicitly cooperative operations (city
near-to BIZERTE).
For example, consider the query
SELECT name, runway_length_ft
FROM airports
WHERE runway_length_ft > 6000
Based on the combined information, associative attri-
butes such as runway conditions and weather are derived.
The associated information for the corresponding airports
is retrieved from the database and then appended to the
query answer, as shown in Fig. 9.
Our current case base, consisting of about 1500 past
queries, serves as the knowledge server for the association
module. The size of the case base is around 2 Mb. For
association purposes, we use the 300-case set, which is
composed of past queries used in the transportation
domain. For testing performance and scalability of the
Case matching,
association,
reasoning
Query
extension
Query &
answer
User profile Case base
User
Extended
query
Capabilities:
- Adaptation of associative
attributes
- Ranking of associative
attributes
- Generate associative
query
Requirements:
- Query conditions
- Query context
- User type
- Relevance feedback
Source
mediator
TAH
mediator
Learning
feedback
Figure 8. Associative query an-
swering facility.
Name
Query answer
Jerba
Runway_condition
Damaged
Monastir Good
Tunis Good
Runway_length
9500
6500
8500
Weather
Sunny
Foggy
Good
Associative information
Figure 9. Query answer and associative information for the
selected airports.
COOPERATIVE DATABASE SYSTEMS 11
system, we use a 1500-case set, which consists of randomly
generated queries based onuser prole and query template
over the transportation domain. Users can also browse and
edit association control parameters such as the number of
association subjects, the associated links and weights of a
given case, and the threshold for association relevance.
PERFORMANCE EVALUATION
In this section, we present the CoBase performance based
onmeasuringthe executionof aset of queries onthe CoBase
testbed developed at UCLA for the ARPI transportation
domain. The performance measure includes response time
for query relaxation, association, and the quality of
answers. The response time depends on the type of queries
(e.g., size of joins, number of joins) as well as the amount of
relaxation, and association, required to produce ananswer.
The quality of the answer depends on the amount of relaxa-
tionand associationinvolved. The user is able to specify the
relaxation and association control to reduce the response
time and also to specify the requirement of answer accu-
racy. In the following, we shall show four example queries
and their performances. The rst query illustrates the
relaxation cost. The second query shows the additional
translation cost for the similar-to cooperative operator,
whereas the third query shows the additional association
cost. The fourth query shows the processing cost for
returned query answers as well as the quality of answers
by using TAHversus MTAHfor a very large database table
(about 200,000 tuples).
Query 4. Find nearby airports can land C-5.
Based on the airplane location, the relaxation module
translates nearby to a prespecied or user-specied lati-
tude and longitude range. Based on the domain knowledge
of C-5, the mediator also translates land into required
runway length and width for landing the aircraft. The
systemexecutes the translatedquery. If no airport is found,
the system relaxes the distance (by a predened amount)
until an answer is returned. In this query, an airport is
found after one relaxation. Thus, two database retrievals
(i.e., one for the original query and one for the relaxed
query) are performed. Three tables are involved: Table
GEOLOC (50,000 tuples), table RUNWAYS (10 tuples),
and table AIRCRAFT_AIRFIELD_CHARS (29 tuples).
The query answers provide airport locations and their
characteristics.
Elapsed time: 5 seconds processing time for
relaxation
40secondsdatabaseretrievaltime
Query 5. Find at least three airports similar-to Bizerte
based on runway length and runway width.
The relaxation module retrieves runway characteristics
of Bizerte airport and translates the similar-to condition
into the corresponding query conditions (runway length
and runway width). The system executes the translated
query and relaxes the runway length and runway width
according to the TAHs until at least three answers are
returned. Note that the TAH used for this query is a Run-
way-TAH based on runway length and runway width,
which is different fromthe Location-TAHbased on latitude
and longitude (shown in Fig. 7). The nearness measure is
calculated based on weighted mean square error. The
system computes similarity measure for each answer
obtained, ranks the list of answers, and presents it to the
user. The system obtains ve answers after two relaxa-
tions. The best three are selected and presented to the user.
Two tables are involved: table GEOLOC(50000 tuples) and
table RUNWAYS (10 tuples).
Elapsed time: 2 seconds processing time for
relaxation
10secondsdatabaseretrievaltime
Query 6. Find seaports in Tunisia with a refrigerated
storage capacity of over 50 tons.
The relaxation module executes the query. The query is
not relaxed, so one database retrieval is performed. Two
tables are used: table SEAPORTS (11 tuples) and table
GEOLOC (about 50,000 tuples).
Elapsed time: 2 seconds processing time for
relaxation
5 seconds database retrieval time
The association module returns relevant information
about the seaports. It compares the user query to previous
similar cases and selects a set of attributes relevant to the
query. Two top-associated attributes are selected and
appended to the query. CoBase executes the appended
query and returns the answers to the user, together with
the additional information. The two additional attributes
associated are location name and availability of railroad
facility near the seaports.
Elapsed time: 10 seconds for association
computation time
Query 7. Find at least 100 cargos of code 3FKAK
with the given volume (length, width, height), code is
nonrelaxable.
The relaxation module executes the query and relaxes
the height, width, and length according to MTAH, until at
least 100 answers are returned. The query is relaxed four
times. Thus, ve database retrievals are performed. Among
the tables accessed is table CARGO_DETAILS (200,000
tuples), a very large table.
Elapsed time: 3 seconds processing time for
relaxation using MTAH
2 minutes database retrieval time
for 5 retrievals
By using single TAHs (i.e., single TAHs for height,
width, and length, respectively), the query is relaxed 12
times. Thus, 13 database retrievals are performed.
Elapsed time: 4 seconds for relaxation by
single TAHs
5 minutes database retrieval time
for 13 retrievals
12 COOPERATIVE DATABASE SYSTEMS
For queries involving multiple attributes in the same
relation, using an MTAH that covers multiple attributes
would provide better relaxation control than using a com-
bination of single-attribute TAHs. The MTAH compares
favorably with multiple single-attribute TAHs in both
qualityandefciency. We have shownthat anMTAHyields
a better relaxation strategy than multiple single-attribute
TAHs. The primary reason is that MTAHs capture attri-
bute-dependent relationships that cannot be captured
when using multiple single-attribute TAHs.
Using MTAHs to control relaxation is more efcient
than using multiple single-attribute TAHs. For this exam-
ple, relaxation using MTAHs require an average of 2.5
relaxation steps, whereas single-attribute TAHs require
8.4 steps. Because a database query is posed after each
relaxation step, using MTAHs saves around six database
accesses on average. Depending on the size of tables and
joins involved, each database access may take from 1 s to
about 30 s. As a result, using MTAHs to control relaxation
saves a signicant amount of user time.
With the aid of domain experts, these queries can be
answered by conventional databases. Such an approach
takes a few minutes to a few hours. However, without the
aid of the domain experts, it may take hours to days to
answer these queries. CoBase incorporates domain knowl-
edge as well as relaxation techniques to enlarge the search
scope to generate the query answers. Relaxation control
plays an important role in enabling the user to control the
relaxation process via relaxation control operators such as
relaxation order, nonrelaxable attributes, preference list,
etc., to restrict the search scope. As a result, CoBase is able
to derive the desired answers for the user in signicantly
less time.
TECHNOLOGY TRANSFER OF COBASE
CoBase stemmed from the transportation planning appli-
cation for relaxing query conditions. CoBase was linked
with SIMS (43) and LIM (44) as a knowledge server for the
planning system. SIMS performs query optimizations for
distributed databases, and LIM provides high-level lan-
guage query input to the database. ATechnical Integration
Experiment was performed to demonstrate the feasibility
of this integrated approach. CoBase technology was imple-
mented for the ARPI transportation application (45).
Recently, CoBase has also been integrated into a logistical
planning tool called Geographical Logistics Anchor Desk
(GLAD) developed by GTE/BBN. GLAD is used in locating
desired assets for logistical planning which has a very large
database (some of the tables exceed one million rows).
CoBase has been successfully inserted into GLAD (called
CoGLAD), generating the TAHs from the databases, pro-
viding similarity search when exact match of the desired
assets are not available, and also locating the required
amount of these assets with spatial relaxation techniques.
The spatial relaxation avoids searching and ltering the
entire available assets, which greatly reduces the compu-
tation time.
Inaddition, CoBase has also beensuccessfully applied to
the following domains. In electronic warfare, one of the key
problems is to identify and locate the emitter for radiated
electromagnetic energy based on the operating parameters
of observed signals. The signal parameters are radio fre-
quency, pulse repetition frequency, pulse duration, scan
period, and the like. In a noisy environment, these para-
meters often cannot be matched exactly within the emitter
specications. CoBase can be used to provide approximate
matching of these emitter signals. Aknowledge base (TAH)
can be constructed fromthe parameter values of previously
identied signals and also from the peak (typical, unique)
parameter values. The TAH provides guidance on the
parameter relaxation. The matched emitters from relaxa-
tion can be ranked according to relaxation errors. Our
preliminary results have shown that CoBase can signi-
cantly improve emitter identication as compared to con-
ventional database techniques, particularly in a noisy
environment. Fromthe line of bearing of the emitter signal,
CoBase can locate the platform that generates the emitter
signal by using the near-to relaxation operator.
In medical databases that store x rays and magnetic
resonance images, the images are evolution and temporal-
based. Furthermore, these images need to be retrieved by
object features or contents rather than patient identica-
tion (46). The queries asked are often conceptual and not
precisely dened. We need to use knowledge about the
application (e.g., age class, ethnic class, disease class,
bone age), user prole and query context to derive such
queries (47). Further, to match the feature exactly is very
difcult if not impossible. For example, if the query Find
the treatment methods used for tumors similar to X
i
(location
x
i
; size
x
i
) on 12-year-old Korean males cannot
be answered, then, based on the TAH shown in Fig. 10,
we can relax tumor X
i
to tumor Class X, and 12-year-old
Korean male to pre-teen Asian, which results in the follow-
ing relaxed query: Find the treatment methods used for
tumor Class X on pre-teen Asians. Further, we can obtain
such relevant information as the success rate, side effects,
and cost of the treatment from the association operations.
As a result, query relaxation and modication are essential
Preteens Asian African
Ethnic group
Tumor classes
(location, size)
European
Class X [l
1
, l
n
] Class Y [l
1
, l
n
]
l: location
s: size
[s
1
, s
n
]
(I
1
, s
1
) (I
2
, s
2
) (I
n
, s
n
)
x
1
Y
1
Y
1
x
2
x
n
Teens
9 10 11 12 Korean Chinese Japanese Filipino
Adult
Age
...
...Y
1
m
1
. . .
m
m
= 2
m
relaxation
operation combinations. Thus, there are at most 2
m
relaxed twigs.
XML QUERY RELAXATION BASED ON SCHEMA
CONVERSION
One approach to XML query relaxation is to convert XML
schema, transform XML documents into relational tables
withthe convertedschema, andthenapply relational query
relaxation techniques. A schema conversion tool, called
XPRESS (Xml Processing and Relaxation in rElational
Storage System) has been developed for these purposes:
XML documents are mapped into relational formats so
that queries can be processed and relaxed using existing
relational technologies. Figure 4 illustrates the query
relaxation ow via the schema conversion approach. This
process rst begins by extracting the schema inform-
ation, such as DTD, from XML documents via tools such
as XML Spy(see http://www.xmlspy.com.) Second, XML
schema is transformed to relational schema via schema
conversion ([e.g., XPRESS). Third, XML documents are
parsed, mapped into tuples, andinserted into the relational
databases. Then, relational query relaxation techniques
[e.g., CoBase (3,6)] can be used to relax query conditions.
Further, semi-structured queries over XML documents
are translated into SQL queries. These SQL queries are
processed and relaxed if there is no answer or there
are insufcient answers available. Finally, results in
the relational format are converted back into XML (e.g.,
the Nesting-based Translation Algorithm (NeT) and
Constraints-based Translation Algorithm (CoT) (11) in
XPRESS). The entire process can be done automatically
and is transparent to users. In the following sections, we
shall briey describe the mapping between XML and rela-
tional schema.
Mapping XML Schema to Relational Schema
Transforming a hierarchical XML model to a at rela-
tional model is a nontrivial task because of the following
article
title body
paragraph
data
mining
frequent
itemset,
+algorithms
year
2000
(a) Node relabel
article
title body
section
data
mining
frequent
itemset,
+algorithms
year
2000
(b) Edge generalization
article
title section
data
mining
frequent
itemset,
+algorithms
year
2000
(c) Node delete
Figure 3. Examples of structure relaxations for
Fig.2.
extract DTD from XML
file
schema map
DTD
XML Spy
XML
Relational
Relational
XML
XML
Relational
XML
Queries
query processing
query relaxation
SQL
TAH
RDB
6 4
7
5
3 2
1
Relaxed
Answers
Relaxed
Answers in
XML formats
generate TAHs
XML
doc
data map
Figure 4. The processing ow of XML query relaxation via schema conversion.
COXML: COOPERATIVE XML QUERY ANSWERING 3
inherent difculties: the nontrivial 1-to-1 mapping, exis-
tence of set values, complicated recursion, and/or frag-
mentation issues. Several research works have been
reported in these areas. Shanmugasundaram et al. (12)
mainly focuses on the issues of structural conversion. The
Constraints Preserving Inline (CPI) algorithm (13) consi-
ders the semantics existing in the original XML schema
during the transformation. CPI inclines as many descen-
dants of an element as possible into a single relation. It
maps an XML element to a table when there is 1-to-0; . . .
or 1-to-1; . . . cardinality between its parent and itself.
The rst cardinality has the semantics of any, denoted
by * inXML. The second means at least, denotedby . For
example, consider the following DTD fragment:
<!ELEMENT author (name, address)>
<!ELEMENT name (rstname?, lastname)>
A naive algorithm will map every element into a sepa-
rate table, leading to excessive fragmentation of the
document, as follows:
author (address, name_id)
name (id, rstname, lastname)
The CPI algorithm converts the DTD fragment above
into a single relational table as author (rstname, last-
name, address).
In addition, semantics such as #REQUIRED in XML
can be enforced in SQL with NOT NULL. Parent-to-child
relationships are captured with KEYS in SQL to allow
join operations. Figure 5 overviews the CPI algorithm,
which uses a structure-based conversion algorithm (i.e.,
a hybrid algorithm) (13), as a basis and identies various
semantic constraints in the XML model. The CPI algo-
rithm has been implemented in XPRESS, which reduces
the number of tables generated while preserving most
constraints.
Mapping Relational Schema to XML Schema
After obtaining the results in the relational format, we
may need to represent them in the XML format before
returning them back to users. XPRESS developed a Flat
Translation (FT) algorithm (13), which translates
tables in a relational schema to elements in an XML
schema and columns in a relational schema to attributes
in an XML schema. As FT translates the at relational
model to a at XML model in a one-to-one manner, it
does not use basic non-at features provided by the
XML model such as representing subelements though
regular expression operator (e.g., * and ). As a result,
the NeT algorithm (11) is proposed to decrease data
redundancy and obtains a more intuitive schema by: (1)
removing redundancies caused by multivalued dependen-
cies; and (2) performing grouping on attributes. The NeT
algorithm, however, considering tables one at a time,
cannot obtain an overall picture of the relational schema
where many tables are interconnected with each other
through various other dependencies. The CoT algorithm
(11) uses inclusion dependencies (INDs) of relational
schema, such as foreign key constraints, to capture the
interconnections between relational tables and represent
them via parent-to-child hierarchical relationships in the
XML model.
Query relaxation via schema transformation (e.g.,
XPRESS) has the advantage of leveraging on the well-
developed relational databases and relational query
relaxation techniques. Information, however, may be lost
during the decomposition of hierarchical XML data into
at relational tables. For example, by transforming the
following XML schema into the relational schema author
(rstname, lastname, address), we lose the hierarchical
relationship between element author and element name,
as well as the information that element rstname is
optional.
<!ELEMENT author (name, address)>
<!ELEMENT name (rstname?,lastname)>
Further, this approach does not support structure
relaxations in the XML data model. To remedy these short-
comings, we shall perform query relaxation on the XML
model directly, which will provide both value relaxation
and structure relaxation.
A COOPERATIVE APPROACH FOR XML QUERY
RELAXATION
Query relaxation is often user-specic. For a given query,
different users may have different specications about
which conditions to relax and how to relax them. Most
existing approaches on XML query relaxation (e.g., (10))
do not provide control during relaxation, which may yield
undesired approximate answers. To provide user-specic
approximate query answering, it is essential for an XML
system to have a relaxation language that allows users to
specify their relaxation control requirements and to have
the capability to control the query relaxation process.
Furthermore, query relaxation usually returns a set of
approximate answers. These answers should be ranked
based on their relevancy to both the structure and the
content conditions of the posed query. Most existing rank-
ing models (e.g., (14,15)) only measure the content simila-
rities between queries and answers, and thus are
Figure 5. Overviewof the CPI algorithm.
DTD
Relational Scheme
Integrity Constraint
hybrid()
FindConstraints()
Relational Schema CPI
2
1
3
4 COXML: COOPERATIVE XML QUERY ANSWERING
inadequate for ranking approximate answers that use
structure relaxations. Recently, in Ref. (16), the authors
proposed a family of structure scoring functions based on
the occurrence frequencies of query structures among data
without considering data semantics. Clearly, using the rich
semantics provided in XML data in design scoring func-
tions can improve ranking accuracy.
To remedy these shortcomings, we propose a new para-
digm for XML approximate query answering that places
users and their demands in the center of the design
approach. Based on this paradigm, we develop a coopera-
tive XML system that provides userspecic approximate
query answering. More specically, we rst, develop a
relaxation language that allows users to specify approxi-
mate conditions and control requirements in queries (e.g.,
preferred or unacceptable relaxations, nonrelaxable condi-
tions, and relaxation orders).
Second, we introduce a relaxation index structure that
clusters twigs into multilevel groups based on relaxation
types and their distances. Thus, it enables the system to
control the relaxation process based onusers specications
in queries.
Third, we propose a semantic-based tree editing dis-
tance to evaluate XML structure similarities, which is
based on not only the number of operations but also
the operation semantics. Furthermore, we combine struc-
ture and content similarities in evaluating the overall
relevancy.
In Fig. 6, we present the architecture of our CoXML
query answering system. The system contains two major
parts: ofine components for building relaxation indexes
and online components for processing and relaxing queries
and ranking results.
v
Building relaxation indexes. The Relaxation Index
Builder constructs relaxation indexes, XML Type
Abstraction Hierarchy (XTAH), for a set of document
collections.
v
Processing, relaxing queries, and ranking results.
When a user posts a query, the Relaxation Engine
rst sends the query to an XML Database Engine to
search for answers that exactly match the structure
conditions and approximately satisfy the content con-
ditions in the query. If enough answers are found,
the Ranking Module ranks the results based on their
relevancy to the content conditions and returns the
ranked results to the user. If there are no answers or
insufcient results, thenthe RelaxationEngine, based
on the user-specied relaxation constructs and con-
trols, consults the relaxation indexes for the best
relaxed query. The relaxed query is then resubmitted
to the XML Database Engine to search for approxi-
mate answers. The Ranking Module ranks the
returned approximate answers based on their rele-
vancies to bothstructure and content conditions inthe
query. This process will be repeated until either there
are enough approximate answers returned or
the query is no longer relaxable.
The CoXML system can run on top of any existing
XML database engine (e.g., BerkeleyDB
3
, Tamino
4
,
DB2XML
5
) that retrieves exactly matched answers.
XML QUERY RELAXATION LANGUAGE
A number of XML approximate search languages have
been proposed. Most extend standard query languages
with constructs for approximate text search (e.g., XIRQL
(15), TeXQuery (17), NEXI (18)). For example, TeXQuery
extends XQuery with a rich set of full-text search primi-
tives, suchas proximity distances, stemming, andthesauri.
NEXI introduces about functions for users to specify
approximate content conditions. XXL (19) is a exible
XML search language with constructs for users to specify
both approximate structure and content conditions. It,
however, does not allow users to control the relaxation
process. Users may often want to specify their preferred
or rejected relaxations, nonrelaxable query conditions, or
to control the relaxation orders among multiple relaxable
conditions.
To remedy these shortcomings, we propose an XML
relaxation language that allows users both to specify
approximate conditions and to control the relaxation pro-
cess. A relaxation-enabled query Q is a tuple (T , 1, c, o),
where:
v
T is a twig as described earlier;
v
1 is a set of relaxation constructs specifying which
conditions in T may be approximated when needed;
v
c is a booleancombinationof relaxationcontrol stating
how the query shall be relaxed; and
v
o is a stop condition indicating when to terminate the
relaxation process.
The execution semantics for a relaxation-enabled query
are as follows: We rst search for answers that exactly
match the query; we then test the stop condition to check
whether relaxation is needed. If not, we repeatedly relax
relaxation-enabled
XML query
ranked
results
Ranking
Module
Relaxation
Engine
Relaxation
Indexes
Relaxation
Index Builder
XML
Database Engine
XML
Documents
CoXML
Figure 6. The CoXML system architecture.
3
See http://www.sleepycat.com/
4
See http://www.softwareag.com/tamino
5
See http://www.ibm.com/software/data/db2/
COXML: COOPERATIVE XML QUERY ANSWERING 5
the twig based on the relaxation constructs and control
until either the stop condition is met or the twig cannot be
further relaxed.
Given a relaxation-enabled query Q, we use Q:T , Q:1,
Q:c, and Q:o to represent its twig, relaxation constructs,
control, and stop condition, respectively. Note that a twig is
required to specify a query, whereas relaxation constructs,
control, andstop conditionare optional. Whenonly a twig is
present, we iteratively relax the query based on similarity
metrics until the query cannot be further relaxed.
A relaxation construct for a query Q is either a specic
or a generic relaxation operation in any of the following
forms:
v
rel(u,), where u Q:T :V, species that node u may
be relabeled when needed;
v
del(u), where u Q:T :V, species that node u may be
deleted if necessary; and
v
gen(e
u,v
), where e
u;v
Q:T :E, species that edge e
u,v
may be generalized when needed.
The relaxation control for a query Q is a conjunction of
any of the following forms:
v
Nonrelaxablecondition!r, where r rel(u; ); del(u);
gen (e
u; v
)[u; v Q:T :V; e
u; v
Q; T :E, species that
node u cannot be relabeled or deleted or edge e
u,v
cannot be generalized;
v
Pre fer(u; l
1
; . . . ; ln), where u Q:T :V and l
i
is a label
(1 _ i _ n), species that node u is preferred to be
relabeled to the labels in the order of (l
1
; . . . ; l
n
);
v
Reject(u; l
1
; . . . ; l
n
), where u Q:T :V, species a set
of unacceptable labels for node u;
v
RelaxOrder(r
1
; . . . ; r
n
), where r
i
Q:1: (1 _ i _ n),
species the relaxation orders for the constructs in
R to be (r
1
; . . . ; r
n
); and
v
UseRType(rt
1
; . . . ; rt
k
), where rt
i
node relabel; node
delete; edge generalize(1 _ i _ k _ 3), species the set
of relaxation types allowed to be used. By default, all
three relaxation types may be used.
A stop condition o is either:
v
AtLeast(n), where n is a positive integer, species the
minimum number of answers to be returned; or
v
d(Q:T :T
/
) _ t, where T
/
stands for a relaxed twig and t
a distance threshold, species that the relaxation
should be terminated when the distance between
the original twig and a relaxed twig exceeds the
threshold.
Figure 7 presents a sample relaxation-enabled query.
The minimum number of answers to be returned is 20.
When relaxation is needed, the edge between body and
section may be generalized and node year may be deleted.
The relaxation control species that node body cannot be
deleted during relaxation. For instance, a section about
frequent itemset in an articles appendix part is irrele-
vant. Also, the edge between nodes article and title and the
edge betweennodes article and body cannot be generalized.
For instance, an article with a reference to another article
that possesses a title on data mining is irrelevant.
Finally, only edge generalization and node deletion can
be used.
We now present an example of using the relaxation
language to represent query topics in INEX 05
6
. Figure 8
presents Topic 267 with three parts: castitle (i.e., the query
formulated in an XPath-like syntax), description, and
narrative. The narrative part describes a users detailed in-
formation needs and is used for judging result relevancy.
The user considers an articles title (atl) non-relaxable
and regards titles about digital libraries under the bib-
liography part (bb) irrelevant. Based on this narrative, we
formulate this topic using the relaxation language as
shown in Fig. 9. The query species that node atl cannot
be relaxed (either deleted or relabeled) and node fmcannot
be relabeled to bb.
Figure 8. Topic 267 in INEX 05.
<inex_topic topic_id="267" query_type="CAS" ct_no="113" >
<castitle>//article//fm//atl[about(., "digital libraries")]</castitle>
<description>Articles containing "digital libraries" in their title.</description>
<narrative>I'm interested in articles discussing Digital Libraries as their main subject.
Therefore I require that the title of any relevant article mentions "digital library" explicitly.
Documents that mention digital libraries only under the bibliography are not relevant, as well
as documents that do not have the phrase "digital library" in their title.</narrative>
</inex_topic>
article
title body
section
data
mining
frequent itemset,
+algorithms
$1
year
2000
$3 $2 $4
$5
R = {gen(e
$4,$5
), del($3)}
C = !del($4) !gen(e
$1,$2
) !gen(e
$1, $4
)
UseRType(node_delete, edge_generalize)
S = AtLeast(20)
Figure 7. A sample relaxation-enabled query.
6
Initiative for the evaluation of XML retrieval, See http://
inex.is.informatik.uni-duisburg.de/
C = !rel($3, -) !del($3) Reject($2, bb)
article
fm
atl
digital libraries
$1
$2
$3
C = !rel($3, -) !del($3) Reject($2, bb)
article
fm
atl
digital libraries
$1
$2
$3
article
fm
atl
digital libraries
$1
$2
$3
Figure 9. Relaxation specications for Topic 267.
6 COXML: COOPERATIVE XML QUERY ANSWERING
XML RELAXATION INDEX
Several approaches for relaxing XML or graph queries
have been proposed (8,10,16,20,21). Most focus on efcient
algorithms for deriving top-k approximate answers with-
out relaxation control. For example, Amer-yahia et al. (16)
proposed a DAG structure that organizes relaxed twigs
based on their consumption relationships. Each node in
a DAG represents a twig. There is an edge from twig T
A
to twig T
B
if the answers for T
B
is a superset of those for
T
A
. Thus, the twig represented by an ancestor DAG node
is always less relaxed and thus closer to the original twig
than the twig represented by a descendant node. There-
fore, the DAG structure enables efcient top-k searching
whenthere are no relaxationspecications. Whenthere are
relaxation specications, the approach in Ref. 16 can also
be adapted to top-k searching by adding a postprocessing
part that checks whether a relaxed query satises the
specications. Such an approach, however, may not be
efcient when relaxed queries do not satisfy the relaxation
specications.
To remedy this condition, we propose an XML relaxa-
tion index structure, XTAH, that clusters relaxed twigs
into multilevel groups based on relaxation types used by
the twigs and distances between them. Each group con-
sists of twigs using similar types of relaxations. Thus,
XTAH enables a systematic relaxation control based on
users specications in queries. For example, Reject can
be implemented by pruning groups of twigs using unac-
ceptable relaxations. RelaxOrder can be implemented by
schedulingrelaxedtwigs fromgroups basedonthe specied
order.
In the following, we rst introduce XTAH and then
present the algorithm for building an XTAH.
XML Type Abstraction HierarchyXTAH
Query relaxationis aprocess that enlarges the searchscope
for nding more answers. Enlarging a query scope can be
accomplished by viewing the queried object at different
conceptual levels.
In the relational database, a tree-like knowledge repre-
sentation called Type Abstraction Hierarchy (TAH) (3) is
introduced to provide systematic query relaxation gui-
dance. A TAH is a hierarchical cluster that represents
dataobjects at multiple levels of abstractions, where objects
at higher levels are more general than objects at lower
levels. For example, Fig. 10 presents a TAHfor braintumor
sizes, in which a medium tumor size (i.e., 310 mm) is a
more abstract representation than a specic tumor size
(e.g., 10 mm). By such multilevel abstractions, a query can
be relaxed by modifying its conditions via generalization
(moving up the TAH) and specialization (moving down the
TAH). In addition, relaxation can be easily controlled via
TAH. For example, REJECT of a relaxation can be imple-
mented by pruning the corresponding node from a TAH.
To support query relaxation in the XML model, we
propose a relaxation index structure similar to TAH, called
XML Type Abstraction Hierarchy (XTAH). An XTAH for a
twig structure T, denoted as XT
T
, is a hierarchical cluster
that represents relaxed twigs of T at different levels of
relaxations based on the types of operations used by the
twigs andthe distances betweenthem. More specically, an
XTAH is a multilevel labeled cluster with two types of
nodes: internal and leaf nodes. A leaf node is a relaxed
twig of T. An internal node represents a cluster of relaxed
twigs that use similar operations and are closer to each
other by distance. The label of an internal node is the
common relaxation operations (or types) used by the twigs
in the cluster. The higher level an internal node in the
XTAH, the more general the label of the node, the less
relaxed the twigs in the internal node.
XTAH provides several signicant advantages: (1) We
can efciently relax a query based on relaxation cons-
tructs by fetching relaxed twigs from internal nodes whose
labels satisfy the constructs; (2) we can relax a query at
different granularities by traversing up and down an
XTAH; and (3) we can control and schedule query relaxa-
tion based on users relaxation control requirements. For
example, relaxation control such as nonrelaxable condi-
tions, Reject or UseRType, can be implemented by prun-
ing XTAH internal nodes corresponding to unacceptable
operations or types.
Figure 11 shows an XTAH for the sample twig in
Fig. 3(a).
7
For ease of reference, we associate each node
in the XTAH with a unique ID, where the IDs of internal
nodes are prexed with I and the IDs of leaf nodes are
prexed with T.
Given a relaxation operation r, let I
r
be an internal
node with a label r. That is, I
r
represents a cluster of
relaxed twigs whose common relaxation operation is r. As
a result of the tree-like organization of clusters, each
relaxed twig belongs to only one cluster, whereas the
twig may use multiple relaxation operations. Thus, it
may be the case that not all the relaxed twigs that use
the relaxation operation r are within the group I
r
. For
example, the relaxed twig T
/
2
, which uses two operations
gen(e
$1;$2
) and gen(e
$4;$5
), is not included in the internal
node that represents gen(e
$4;$5
), I
7
, because T
/
2
may
belong to either group I
4
or group I
7
but is closer to the
twigs in group I
4
.
To support efcient searching or pruning of relaxed
twigs in an XTAHthat uses an operation r, we add a virtual
link from internal node I
r
to internal node I
k
, where I
k
is
not a descendant of I
r
, but all the twigs within I
k
use
operation r. By doing so, relaxed twigs that use operation
r are either within group I
r
or within the groups connected
to I
r
by virtual links. For example, internal node I
7
is
connected to internal nodes I
16
and I
35
via virtual links.
all
small medium large
0 3mm 10mm 15mm 4mm 3mm 10mm
Figure 10. A TAH for brain tumor size.
7
Due to space limitations, we only show part of the XTAH here.
COXML: COOPERATIVE XML QUERY ANSWERING 7
Thus, all the relaxed twigs using the operation gen(e
$4;$5
)
are within the groups I
7
, I
16
, and I
35
.
Building an XTAH
With the introduction of XTAH, we now present the algo-
rithm for building the XTAH for a given twig T.
Algorithm 1 Building the XTAH for a given twig T
Input: T: a twig
K: domain knowledge about similar node labels
Output: XT
T
: an XTAH for T
1: RO
T
GerRelaxOperations(T, K) {GerRelaxOperations
(T, K) returns a set of relaxation operations applicable to the
twig T based on the domain knowledge K}
2: let XT
T
be a rooted tree with four nodes: a root node relax
with three child nodes node_relabel, node_delete and
edge_generalization
3: for each relaxation operation r RO
T
do
4: rtype the relaxation type of r
5: InsertXTNode(/relax/rtype, {r}) {InsertXTNode(p, n)
inserts node n into XT
T
under path p}
6: T
/
the relaxed twig using operation r
7: InsertXTNode (=relax=rtype; =r; T
/
)
8: end for
9: for k = 2to[RO
T
[ do
10: S
k
all possible combinations of k relaxation operations
in RO
T
11: for each combination s S
k
do
12: let s = r
1
; . . . ; r
k
relax
{gen(e
$4, $5
)}
{gen(e
$1,$2
),
gen(e
$4,$5
)}
{del($4)}
edge_generalization I
1
I
7
I
4
I
16
I
2
I
10
I
3
I
11
I
15
I
0
node_delete
{gen(e
$1,$2
)}
node_relabel
...
Virtual links
T
1
article
body title
section
year
T
8
article
body title
section
year
T
25
article
title section year
{del($3)}
I
35
{del($3),
gen(e
$4, $5
)}
article
body title
section
T
15
article
body title
section
T
16
{del($2)}
article
body year
section
T
10
T
2
article
body title
section
year
Figure 11. An example of XML relaxation index structure for the twig T.
8 COXML: COOPERATIVE XML QUERY ANSWERING
QUERY RELAXATION PROCESS
Query Relaxation Algorithm
Algorithm 2 Query Relaxation Process
Input: XT
T
: an XTAH
Q = T ; 1; c; o: a relaxation-enabled query
Output: /: a list of answers for the query Q
1: /SearchAnswer(Q:T); {Searching for exactly matched
answers for Q:T}
2: if (the stop condition Q:o is met) then
3: return /
4: end if
5: if (the relaxation controls Q:c are non-empty) then
6: PruneXTAH(XT
T
, Q:c) {Pruning nodes in XT
T
that
contain relaxed twigs using unacceptable relaxation
operations based on Q:c}
7: end if
8: if the relaxation constructs Q:1 are non-empty then
9: while (Q:o is not met)&&(not all the constructs in Q:1have
been processed) do
10: T
/
the relaxed twig from XT
T
that best satises the
relaxation specications Q:1 & Q:c
11: Insert SearchAnswer(T
/
) into /
12: end while
13: end if
14: while (Q:T is relaxable)&&(Q:o is not met) do
15: T
/
the relaxed twig fromXT
T
that is closest to Q:T based
on distance
16: Insert SearchAnswer(T
/
) into /
17: end while
18: return /
Figure 12 presents the control ow of a relaxation pro-
cess based on XTAH and relaxation specications in a
query. The Relaxation Control module prunes irrelevant
XTAH groups corresponding to unacceptable relaxation
operations or types and schedules relaxation operations
based on Prefer and RelaxOrder as specied in the query.
Algorithm 2 presents the detailed steps of the relaxation
process:
1. Given a relaxation-enabled query Q = T ; 1; c; o
andanXTAHfor Q:T , the algorithmrst searches for
exactly matched answers. If there are enough num-
ber of answers available, there is no need for relaxa-
tion and the answers are returned (Lines 14).
2. If relaxationis needed, basedonthe relaxationcontrol
Q:c (Lines 57), the algorithm prunes XTAH internal
nodes that correspond to unacceptable operations
such as nonrelaxable twig nodes (or edges), unaccep-
table node relabels, and rejected relaxation types.
This step can be efciently carried out by using
internal node labels and virtual links. For example,
the relaxation control in the sample query (Figure 7)
species that only node_delete and edge_generaliza-
tion may be used. Thus, any XTAH node that uses
node_relabel, either withingroup I
2
or connected to I
2
by virtual links, is disqualied from searching. Simi-
larly, the internal nodes I
15
and I
4
, representing the
operations del($4) and gen(e
$1, $2
), respectively, are
pruned from the XTAH by the Relaxation Control
module.
3. After pruning disqualied internal groups, based on
relaxation constructs and control, such as RelaxOr-
der and Prefer, the Relaxation Control module sche-
dules and searches for the relaxed query that best
satises users specications from the XTAH. This
step terminates when either the stop condition is met
or all the constructs have been processed. For exam-
ple, the sample query contains two relaxation con-
structs: gen(e
$4,$5
) and del($3). Thus, this step selects
the best relaxed query from internal groups, I
7
and
I
11
, representing the two constructs, respectively.
4. If further relaxation is needed, the algorithm then
iteratively searches for the relaxed query that is
closest to the original query by distance, which
may use relaxation operations in addition to those
specied in the query. This process terminates when
either the stop condition holds or the query cannot be
further relaxed.
5. Finally, the algorithmoutputs approximate answers.
Searching for Relaxed Queries in an XTAH
We shall now discuss how to efciently search for the
best relaxed twig that has the least distance to the query
twig from its XTAH in Algorithm 2.
A brute-force approach is to select the best twig by
checking all the relaxed twigs at the leaf level. For a
twig Twith mrelaxation operations, the number of relaxed
twigs can be up to 2
m
. Thus, the worst case time complexity
for this approach is O(2
m
), which is expensive.
To remedy this condition, we propose to assign repre-
sentatives to internal nodes, where a representative sum-
marizes the distance characteristics of all the relaxed twigs
covered by a node. The representatives facilitate the
searching for the best relaxed twig by traversing an
Query
Processing
Satisfactory
Answers?
Ranking
Relaxation-
enabled Query
Ranked Answers
XTAH
Yes No
Relaxed
Queries
Relaxation Control
(Pruning & Scheduling)
Figure 12. Query relaxation control ow.
COXML: COOPERATIVE XML QUERY ANSWERING 9
XTAHin a top-down fashion, where the path is determined
by the distance properties of the representatives. By doing
so, the worst case time complexity of nding the best
relaxed query is O(d
*
h), where d is the maximal degree
of an XTAHnode and h is the height of the XTAH. Given an
XTAH for a twig T with m relaxation operations, the
maximal degree of any XTAH node and the depth of the
XTAH are both O(m). Thus, the time complexity of the
approach is O(m
2
), which is far more efcient than the
brute-force approach (O(2
m
)).
In this article, we use M-tree (22) for assigning repre-
sentatives to XTAH internal nodes. M-tree provides an
efcient access method for similarity search in the metric
space, where object similarities are dened by a distance
function. Given a tree organization of data objects where
all the data objects are at the leaf level, M-tree assigns a
data object covered by an internal node I to be the repre-
sentative object of I. Each representative object stores
the covering radius of the internal node (i.e., the maximal
distance between the representative object and any data
object covered by the internal node). These covering radii
are thenused indetermining the pathto a data object at the
leaf level that is closest to a query object during similarity
searches.
XML RANKING
Query relaxation usually generates a set of approximate
answers, which need to be ranked before being returned to
users. A query contains both structure and content condi-
tions. Thus, we shall rank anapproximate answer based on
its relevancy to both the structure and content conditions
of the posed query. In this section, we rst present how
to compute XML content similarity, then describe how to
measure XML structure relevancy, and nally discuss how
to combine structure relevancy with content similarity to
produce the overall XML ranking.
XML Content Similarity
Given an answer / and a query Q, the content similarity
between the answer and the query, denoted as cont_sim(A
and Q), is the sum of the content similarities between the
data nodes and their corresponding matched query nodes.
That is,
cont sim(/; Q) =
X
v /; $uQ:T :n;umatches $u
cont sim(v; $u)
(1)
For example, given the sample twig in Fig. 2, the set of
nodes {1, 2, 6, 7, 8} in the sample data tree is an answer.
The content similarity between the answer and the twig
equals to cont_sim(2, $2) cont_sim(6, $3) cont_sim(8, $5).
We now present how to evaluate the content similarity
between a data node and a query node. Ranking models in
traditional IR evaluate the content similarity between a
document to a query and thus need to be extended to
evaluating the content similarity between an XML data
node anda query node. Therefore, we proposed anextended
vector space model (14) for measuring XML content simi-
larity, which is based on two concepts: weighted term
frequency and inverse element frequency.
WeightedTermFrequency. Due to the hierarchical struc-
ture of the XML data model, the text of a node is also
considered as a part of the ancestor nodes text, which
introduces the challenge of how to calculate the content
relevancy of an XML data node v to a query term t, where t
could occur in the text of any node nested within the node
v. For example, all three section nodes (i.e., nodes 8, 10, and
12) in the XML data tree (Fig. 1) contain the phrase
frequent itemsets in their text parts. The phrase fre-
quent itemsets occurs at the title part of the node 8, the
paragraph part in the node 10, and the reference part in
the node 12. The same term occurring at the different
text parts of a node may be of different weights. For
example, a frequent itemset in the title part of a section
node has a higher weight than a frequent itemset in the
paragraph part of a section node, which, in turn, is more
important than a frequent itemset in the reference part
of a section node. As a result, it may be inaccurate to
measure the weight of a term t in the text of a data node
v by simply counting the occurrence frequency of the
term t in the text of the node v without distinguishing
the terms occurrence paths within the node v.
To remedy this condition, we introduce the concept of
weighted term frequency, which assigns the weight of a
term t in a data node v based on the terms occurrence
frequency and the weight of the occurrence path. Given
a data node v and a term t, let p= v
1
.v
2
. . .v
k
be an occur-
rence path for the term t in the node v, where v
k
is a
descendant node of v, v
k
directly contains the term t, and
v v
1
. . . v
k
represents the path from the node v to
the node v
k
. Let w(p) and w(v
i
) denote the weight for the
path p and the node v
i
, respectively. Intuitively, the weight
of the path p = v
1
.v
2
. . .v
k
is a function of the weights of
the nodes on the path (i.e., w(p) = f(w(v
1
), . . . w(v
k
))), with
the following two properties:
1. f(w(v
1
), w(v
2
), . . ., w(v
k
)) is a monotonically increas-
ing function with respect to w(v
i
) (1 _ i _ k); and
2. f(w(v
1
), w(v
2
), . . ., w(v
k
))) = 0 if any w(v
i
) = 0 (1 _
i _ k).
The rst property states that the path weight function
is a monotonically increasing function. That is, the weight
of a path is increasing if the weight of any node on the
path is increasing. The second property states that if
the weight of any node on the path is zero, then the weight
of the path is zero. For any node v
i
(1 _ i _ k) on the path p,
if the weight of the node v
i
is zero, then it implies that
users are not interested in the terms occurring under
the node v
i
. Therefore, any term in the text of either the
node v
i
or a descendant node of v
i
is irrelevant.
A simple implementation of the path weight func-
tion f(w(v
1
), w(v
2
), . . ., w(v
k
)) that satises the properties
stated above is to let the weight of a path equal to the
10 COXML: COOPERATIVE XML QUERY ANSWERING
product of the weights of all nodes on the path:
w(p) =
Y
k
i=1
w(v
i
) (2)
With the introduction of the weight of a path, we shall
now dene the weighted term frequency for a term t in a
data node v, denoted as t f
w
(v, t), as follows:
t f
w
(v; t) =
X
m
j=1
w(p
j
) tf (v; p
j
; t) (3)
where m is the number of paths in the data node v con-
taining the term t and tf (v, p
j
, t) is the frequency of the
term t occurred in the node v via the path p
j
.
For example, Fig. 13 illustrates an example of an XML
data tree with the weight for each node shown in italic
beside the node. The weight for the keyword node is 5
(i.e., w(keyword) =5). From Equation (2), we have
w( front_matter.keyword) = 5*1 = 5, w(body.section.para-
graph) = 2*1*1 = 2, and w(back_matter.reference) = 0*1 =
0, respectively. The frequencies of the term XML in the
paths front_matter.keyword, body.section.paragraph, and
back_ matter.reference are 1, 2, and 1, respectively. There-
fore, fromEquation(3), the weightedtermfrequency for the
term XML in the data node article is 5*1 2*2 0*1 =9.
Inverse Element Frequency. Terms with different popu-
larity in XML data have different degrees of discrimi-
native power. It is well known that a term frequency (tf)
needs to be adjusted by the inverse document frequency
(idf) (23). A very popular term (with a small idf) is less
discriminative than a rare term (with a large idf). There-
fore, the second component in our content ranking model
is the concept of inverse element frequencys (ief), which
distinguishes terms with different discriminative powers
in XML data. Given a query Q and a term t, let $u be the
node in the twig Q:T whose content condition contains
the term t (i.e., t $u:cont). Let DN be the set of data nodes
such that each node in DN matches the structure condi-
tion related with the query node $u. Intuitively, the more
frequent the term t occurs in the text of the data nodes in
DN, the less discriminative power the term t has. Thus,
the inverse element frequency for the query term t can be
measured as follows:
ie f ($u; t) = log
N
1
N
2
1
(4)
where N
1
denotes the number of nodes in the set DN and
N
2
represents the number of the nodes in the set DN
that contain the term t in their text parts.
For example, given the sample XML data tree (Fig. 1)
and the query twig (Fig. 2), the inverse element fre-
quency for the term frequent itemset can be calculated
as follows: First, the content condition of the query node
$5 contains the term frequent itemset; second, there
are three data nodes (i.e., nodes 8, 10, and 12) that match
the query node $5; and third, all the three nodes contain
the term in their text. Therefore, the inverse element
frequency for the term frequent itemset is log(3/3 1)
= log2. Similarly, as only two nodes (i.e., nodes 8 and 12)
contain the term algorithms, the inverse element fre-
quency for the term algorithms is log(3/2 1) = log(5/2).
Extended Vector Space Model. With the introduction
weighted term frequency and inverse element fre-
quency, we now rst present how we compute the con-
tent similarity between a data node and a query node and
then present how we calculate the content similarity
between an answer and a query.
Given a query node $u and a data node v, where the
node v matches the structure condition related with the
query node $u, the content similarity between the nodes v
and $u can be measured as follows:
cont sim(v;$u) =
X
t $u:cont
w(m(t))t f
w
(v; t)ie f ($u; t) (5)
where t is a term in the content condition of the node $u,
m(t) stands for the modier prexed with the term t (e.g.,
, , ), and w(m(t)) is the weight for the termmodier
as specied by users.
For example, given the section node, $5, in the sample
twig (Fig. 2), the data node 8 in Fig. 1 is a match for the
twig node $5. Suppose that the weight for a term
modier is 2 and the weight for the title node is 5, respec-
tively. The content similarity between the data node 8
and the twig node $5 equals to t f
w
(8, frequent itemset)
ie f($5, frequent itemset) w() t f
w
(8, algorithms)
ief($5, algorithms), which is 5 log2 2 5 log(5/2) =
18.22. Similarly, the datanode 2is amatchfor the twignode
title (i.e., $2) and the content similarity between them is
t f
w
(2, data mining) ie f($2, data mining) = 1.
Discussions. The extended vector space model has
shown to be very effective in ranking content similarities
1
5
2 0
1
2
3
4
5
7
6
8
front_matter
keyword
body
section
paragraph
back_matter
reference
XML
XMLXML
XML
1
1
1
article
1
i j
i: node id
j: node weight
1
5
2 0
1
2
3
4
5
7
6
8
front_matter
keyword
body
section
paragraph
back_matter
reference
XML
XMLXML
XML
1
1
1
article
1
i j
i: node id
j: node weight
i j i j
i: node id
j: node weight
Figure 13. An example of weighted term frequency.
COXML: COOPERATIVE XML QUERY ANSWERING 11
of SCAS retrieval results
8
(14). SCAS retrieval results are
usually of relatively similar sizes. For example, for the
twig in Fig. 2, suppose that the node section is the target
node (i.e., whose matches are to be returned as answers).
All the SCAS retrieval results for the twig will be sections
inside article bodies. Results that approximately match the
twig, however, could be nodes other than section nodes,
such as paragraph, body, or article nodes, which are of
varying sizes. Thus, to apply the extended vector space
model for evaluating content similarities of approximate
answers under this condition, we introduce the factor of
weighted sizes into the model for normalizing the biased
effects caused by the varying sizes in the approximate
answers (24):
cont sim(/; Q) =
X
v /; $u Q:T :1; v matches $u
cont sim(v; $u)
log
2
wsize(v)
(6)
where wsize(v) denotes the weighted size of a data node v.
Given an XML data node v, wsize(v) is the sum of the
number of terms directly contained in node vs text, size-
(v.text), and the weighted size of all its child nodes adjusted
by their corresponding weights, as shown in the following
equation.
wsize(v) = size(v:text)
X
v
i
s:t: v[v
i
wsize(v
i
) + w(v
i
) (7)
For example, the weighted size of the paragraph node
equals the number of terms in its text part, because the
paragraph node does not have any child node.
Our normalization approach is similar to the scoring
formula proposed in Ref. 25, which uses the log of a docu-
ment size to adjust the product of t f and idf.
Semantic-based Structure Distance
The structure similarity between two twigs can be mea-
sured using tree editing distance (e.g., (26)), which is fre-
quently used for evaluating tree-to-tree similarities. Thus,
we measure the structure distance between an answer
/ and a query Q, struct_dist(/, Q), as the editing distance
between the twig QT and the least relaxed twig T
/
,
d(Q:T , T
/
), which is the total costs of operations that relax
Q:T to T
/
:
struct dist(/; Q) = d(Q:T ; T
/
) =
X
k
i=1
cost(r
i
) (8)
where {r
1
, . . ., r
k
} is the set of operations that relaxes Q:T
to T
/
and cost(r
i
) (0 _ cost(r
i
) _ 1) is equal to the cost of
the relaxation operation r
i
(1 _ i _ k).
Existing edit distance algorithms do not consider ope-
ration cost. Assigning equal cost to each operation is
simple, but does not distinguish the semantics of different
operations. To remedy this condition, we propose a
semantic-based relaxation operation cost model.
We shall rst present how we model the semantics of
XML nodes. Given an XML dataset D, we represent each
data node v
i
as a vector {w
i1
, w
i2
, . . ., w
iN
}, where N is the
total number of distinct terms in D and w
ij
is the weight of
the jth term in the text of v
i
. The weight of a term may be
computed using tf*idf (23) by considering each node as a
document. With this representation, the similarity
between two nodes can be computed by the cosine of their
corresponding vectors. The greater the cosine of the two
vectors, the semantically closer the two nodes.
We now present how to model the cost of an operation
based on the semantics of the nodes affected by the opera-
tion with regard to a twig T as follows:
v
Node Relabel rel(u, l)
A node relabel operation, rel(u, l), changes the label of
a node u from u.label to a new label l. The more
semantically similar the two labels are, the less the
relabel operationwill cost. The similarity betweentwo
labels, u.label and l, denoted as sim(u.label, l), can be
measured as the cosine of their corresponding vector
representations in XML data. Thus, the cost of a
relabel operation is:
cost(rel(u; l)) = 1 sim(u:lable; l) (9)
For example, using the INEX 05 data, the cosine of
the vector representing section nodes and the vector
representing paragraph nodes is 0.99, whereas the
cosine of the vector for section nodes and the vector
for gure nodes is 0.38. Thus, it is more expensive to
relabel node section to paragraph than to gure.
v
Node Deletion del(u)
Deleting a node u from the twig approximates u to its
parent node in the twig, say v. The more semantically
similar node u is to its parent node v, the less the
deletion will cost. Let V
v=u
and V
v
be the two vectors
representing the data nodes satisfying v/u and v,
respectively. The similarity between v/u and v,
denoted as sim(v/u, v), can be measured as the cosine
of the two vectors V
v/u
and V
v
. Thus, a node deletion
cost is:
cost(del(u)) = 1 sim(v=u; u) (10)
For example, using the INEX 05 data, the cosine of
the vector for section nodes inside body nodes and the
vector for body nodes is 0.99, whereas the cosine of
the vector for keyword nodes inside article nodes
and the vector for article nodes is 0.2714. Thus, delet-
ing the keyword node in Fig. 3(a) costs more than
deleting the section node.
v
Edge Generalization gen(e
v,u
)
Generalizing the edge between nodes $v and $u
approximates a child node v/u to a descendant node
v//u. The closer v/u is to v//u in semantics, the less
the edge generalization will cost. Let V
v/u
and V
v//u
be two vectors representing the data nodes satisfying
8
In a SCAS retrieval task, structure conditions must be matched
exactly whereas content conditions are to be approximately
matched.
12 COXML: COOPERATIVE XML QUERY ANSWERING
v/uand v//u, respectively. The similarity betweenv/
u and v//u, denoted as sim(v/u, v//u), can be mea-
sured as the cosine of the two vectors V
v/u
and V
v//u
.
Thus, the cost for an edge generalization can be mea-
sured as:
cost(gen(e
v;u
)) = 1 sim(v=u; v==u) (11)
For example, relaxing article/title in Fig. 3(a) to
article//title makes the title of an articles author
(i.e., /article/author/title) an approximate match.
As the similarity between an articles title and an
authors title is low, the cost of generalizing article/
title to article//title may be high.
Note that our cost model differs from Amer-Yahia et al.
(16) in that Amer-Yahia et al. (16) applies idf to twig
structures without considering node semantics, whereas
we applied tf*idf to nodes with regard to their correspond-
ing data content.
The Overall Relevancy Ranking Model
We now discuss how to combine structure distance and
content similarity for evaluating the overall relevancy.
Given a query Q, the relevancy of an answer / to the
query Q, denoted as sim(/, Q), is a function of two
factors: the structure distance between / and Q (i.e.,
struct_dist(/, Q)), and the content similarity between /
and Q, denoted as cont_sim(/, Q). We use our extended
vector space model for measuring content similarity (14).
Intuitively, the larger the structure distance, the less the
relevancy; the larger the content similarity, the greater
the relevancy. When the structure distance is zero
(i.e., exact structure match), the relevancy of the answer
to the query should be determined by their content
similarity only. Thus, we combine the two factors in a
way similar to the one used in XRank (27) for combining
element rank with distance:
sim(/; Q) = a
struct dist(/;Q)
+ cost sim(/; Q) (12)
where a is a constant between 0 and 1.
A SCALABLE AND EXTENSIBLE ARCHITECTURE
Figure 14 illustrates a mediator architecture framework
for a cooperative XML system. The architecture consists of
an application layer, a mediation layer, and an information
source layer. The information source layer includes a set of
heterogeneous data sources (e.g., relational databases,
XML databases, and unstructured data), knowledge bases,
and knowledge base dictionaries or directories. The knowl-
edge base dictionary (or directory) stores the characteris-
tics of all the knowledge bases, includingXTAHanddomain
knowledge in the system. Non-XML data can be converted
into the XML format by wrappers. The mediation layer
consists of data source mediators, query parser mediators,
relaxationmediators, XTAHmediators, and directory med-
iators. These mediators are selectively interconnected to
meet the specic application requirements. When the
demand for certain mediators increases, additional copies
of the mediators can be added to reduce the loading. The
mediator architecture allows incremental growth with
application, and thus the system is scalable. Further,
XML
DB
Unstructured
Data
Wrapper
Relational
DB
KB
1
Dictionary/
Directory
...
KB
n
Application
Mediation
Information
Sources
QPM: Query Parser Mediator
DSM: Data Source Mediator
RM: Relaxation Mediator
XTM: XTAH Mediator
DM: Directory of Mediators
Mediation Capability
Mediation Requirement
Mediation Capability
Mediation Requirement
Wrapper
...
XML
DB
XML
DB
...
...
...
User
... ...
Application
...
DM DM DM
QPM
RM
DSM
QPM QPM
RM RM
DSM DSM
QPM
RM
DSM
QPM QPM
RM RM
DSM DSM
QPM
RM
XTM
QPM QPM
RM RM
XTM XTM
QPM
RM
XTM
QPM QPM
RM RM
XTM XTM
Figure 14. A scalable and extensible
cooperative XML query answering system.
COXML: COOPERATIVE XML QUERY ANSWERING 13
different types of mediators can be interconnected and can
communicate with each other via a common communica-
tion protocol (e.g., KQML (28), FIPA
9
) to perform a joint
task. Thus, the architecture is extensible.
For query relaxation, based on the set of frequently used
query tree structures, the XTAHs for each query tree
structure can be generated accordingly. During the query
relaxation process, the XTAH manager selects the appro-
priate XTAH for relaxation. If there is no XTAH available,
the system generates the corresponding XTAH on-the-y.
We shall now describe the functionalities of various
mediators as follows:
v
Data Source Mediator (DSM)
The data source mediator provides a virtual database
interface to query different data sources that usually
have different schema. The data source mediator
maintains the characteristics of the underlying data
sources and provides a unied description of these
data sources. As a result, XML data can be accessed
from data sources without knowing the differences of
the underlying data sources.
v
Query Parser Mediator (PM)
The query parser mediator parses the queries fromthe
application layer and transforms the queries into
query representation objects.
v
Relaxation Mediator (RM)
Figure 15 illustrates the functional components of the
relaxationmediator, whichconsists of a pre-processor,
a relaxation manager, and a post-processor. The ow
of the relaxation process is depicted in Fig. 16. When a
relaxation-enabled query is presented to the relaxa-
tion mediator, the system rst goes through a pre-
processing phase. During pre-processing, the system
transforms the relaxation constructs into standard
XML query constructs. All relaxation control opera-
tions specied in the query are processed and for-
warded to the relaxation manager and are ready for
use if the query requires relaxation. The modied
query is then presented to the underlying databases
for execution. If no answers are returned, then the
relaxation manager relaxes the query conditions
guided by the relaxation index (XTAH). We repeat
the relaxationprocess until either the stop conditionis
met or the query is no longer relaxable. Finally, the
returned answers are forwarded to the post-proces-
sing module for ranking.
v
XTAH Mediator (XTM)
The XTAHmediator provides three conceptually sepa-
rate, yet interlinked functions to peer mediators:
XTAH Directory, the XTAH Management, and the
XTAH Editing facilities, as illustrated in Fig. 17.
Usually, a system contains a large number of
XTAHs. To allow other mediators to determine which
XTAHs exist within the system and their char-
acteristics, the XTAH mediator contains a directory.
This directory is searchable by the XML query tree
structures.
The XTAH management facility provides client
mediators with traversal functions and data extrac-
tion functions (for reading the information out of
XTAH nodes). These capabilities present a common
interface so that peer mediators can traverse and
extract data from an XTAH. Further, the XTAH
Figure 16. The ow chart of XML query relaxa-
tion processing.
Preprocessor
Query Processing
Postprocessor
Present Answers
Satisfactory
Answers?
Relaxation
Manager
XTAH
Query Relaxation
Parsed Query
Approximate
Answers
Relaxation Mediator
9
See http://www.pa.org.
Preprocessor
Relaxation
Manager
Post-
processor
Data Source
Mediator
XTAH
Mediator
Ranked
Answers
Figure 15. The relaxation mediator.
XTAH Management
XTAH Editor
XTAH Directory
Capability:
Generate XTAH
Browse XTAH
Edit and reformat XTAH
Traverse XTAH nodes
Requirements:
Data sources
Figure 17. The XTAH mediator.
14 COXML: COOPERATIVE XML QUERY ANSWERING
mediator has aneditor that allows users to edit XTAHs
to suit their specic needs. The editor handles recal-
culation of all information contained within XTAH
nodes during the editing process and supports expor-
tation and importation of entire XTAHs if a peer
mediator wishes to modify it.
v
Directory Mediator (DM)
The directory mediator provides the locations, char-
acteristics, and functionalities of all the mediators in
the systemand is used by peer mediators for locating a
mediator to perform a specic function.
A COOPERATIVE XML (CoXML) QUERY ANSWERING
TESTBED
A CoXML query answering testbed has been developed at
UCLA to evaluate the effectiveness of XML query relaxa-
tion through XTAH. Figure 18 illustrates the architecture
of CoXML testbed, which consists of a query parser, a
preprocessor, a relaxation manager, a database manager,
anXTAHmanager, anXTAHbuilder, anda post-processor.
We describe the functionality provided by each module as
follows:
v
XTAHBuilder. Given a set of XMLdocuments and the
domain knowledge, the XTAHbuilder constructs a set
of XTAHs that summarizes the structure character-
istics of the data.
v
Query Parser. The query parser checks the syntax of
the query. If the syntax is correct, then it extracts
informationfromthe parsedquery andcreates aquery
representation object.
v
Preprocessor. The pre-processor transforms relaxa-
tion constructs (if any) in the query into the standard
XML query constructs.
v
Relaxation Manager. The relaxation manager per-
forms the following services: (1) building a relaxation
structure based on the specied relaxation constructs
and controls; (2) obtaining the relaxed query condi-
tions from the XTAH Manager; (3) modifying the
query accordingly; and (4) retrieving the exactly
matched answers.
v
Database Manager. The database manager interacts
with an XML database engine and returns exactly
matched answers for a standard XML query.
v
XTAH Manager. Based on the structure of the query
tree, the XTAHmanager selects anappropriate XTAH
to guide the query relaxation process.
v
Post-processor. The post-processor takes unsorted
answers as input, ranks thembased on both structure
and content similarities, and outputs a ranked list of
results.
EVALUATION OF XML QUERY RELAXATION
INEX is a DELOS working group
10
that aims to provide a
means for evaluating XML retrieval systems in the form of
a large heterogeneous XML test collection and appropriate
scoring methods. The INEX test collection is a large set of
scientic articles, represented in XML format, from pub-
lications of the IEEE Computer Society covering a range of
computer science topics. The collection, approximately 500
megabytes, contains over 12,000 articles from 18 maga-
zines/transactions from the period of 1995 to 2004, where
an article (on average) consists of 1500 XML nodes. Differ-
ent magazines/transactions have different data organiza-
tions, althoughthey use the same ontology for representing
similar content.
There are three types of queries in the INEX query sets:
content-only (CO), strict content andstructure (SCAS), and
vague content and structure (VCAS). COqueries are tradi-
tional information retrieval (IR) queries that are written in
natural language and constrain the content of the desired
results. Content and structure queries not only restrict
content of interest but also contain either explicit or impli-
cit references to the XMLstructure. The difference between
a SCAS and a VCAS query is that the structure conditions
in a SCAS query must be interpreted exactly whereas the
structure conditions in a VCAS query may be interpreted
loosely.
To evaluate the relaxationquality of the CoXMLsystem,
we performthe VCAS retrieval runs on the CoXMLtestbed
and compare the results against the INEXs relevance
assessments for the VCAS task, which can be viewed as
the gold standard. The evaluaion studies reveal the
expressiveness of the relaxation language and the effec-
tiveness of using XTAH in providing user-desired relaxa-
tion control. The evaluation results demonstrate that our
content similarity model has signicantly high precision at
low recall regions. The model achieves the highest average
precisionas compared withall the 38 ofcial submissions in
e
Query
Parser
Pre-
processor
Database
Manager
XTAH
Manager
Relaxation
Manager
XTAH
Builder
.
Domain
Knowledge
XTAH
Post-
Processor
Relaxation-
enabled query
Ranked
Approximate
Answers
XML DB
1
XML DB
1 XML DB
n
XML DB
n
Figure 18. The architecture of the CoXML testbed.
10
See http://www.iei.pi.cnr.it/DELOS
COXML: COOPERATIVE XML QUERY ANSWERING 15
INEX 03 (14). Furthermore, the evaluation results also
demonstrate that using the semantic-based distance func-
tion yields results with greater relevancy than using the
uniform-cost distance function. Comparing with other sys-
tems in INEX 05, our user-centeric relaxation approach
retrieves approximate answers withgreater relevancy (29).
SUMMARY
Approximate matching of query conditions plays an impor-
tant role in XML query answering. There are two
approaches to XML query relaxation: either through
schema conversion or directly through the XML model.
Converting the XML model to the relational model by
schema conversion can leverage on the mature relational
model techniques, but information may be lost during such
conversions. Furthermore, this approach does not support
XML structure relaxation. Relaxation via the XML model
approach remedies these shortcomings. In this article, a
new paradigm for XML approximate query answering is
proposed that places users and their demands at the center
of the designapproach. Based onthis paradigm, we develop
an XML systemthat cooperates with users to provide user-
specic approximate query answering. More specically, a
relaxation language is introduced that allows users to
specify approximate conditions and relaxation control
requirements inaposedquery. Wealso developarelaxation
index structure, XTAH, that clusters relaxed twigs into
multilevel groups based on relaxation types and their
interdistances. XTAH enables the system to provide user-
desired relaxation control as specied in the query.
Furthermore, a ranking model is introduced that combines
both content and structure similarities in evaluating the
overall relevancy of approximate answers returned from
query relaxation. Finally, a mediatorbased CoXML archi-
tecture is presented. The evaluationresults usingthe INEX
test collection reveal the effectiveness of our proposed user-
centric XML relaxation methodology for providing user-
specic relaxation.
ACKNOWLEDGMENTS
The research and development of CoXML has been a team
effort. We would like to acknowledge our CoXMLmembers,
Tony Lee, Eric Sung, Anna Putnam, Christian Cardenas,
JosephChen, and RuzanShahinian, for their contributions
in implementation, testing, and performance evaluation.
BIBLIOGRAPHY
1. S. Boag, D. Chamberlin, M. F Fernandez, D. Florescu, J.
Robie, and J. S. (eds.), XQuery 1.0: An XML Query Language.
Available http://www.w3.org/TR/xquery/.
2. W. W Chu ,Q. Chen, and A. Huang, Query Answering via
Cooperative Data Inference. J. Intelligent Information Sys-
tems (JIIS), 3 (1): 5787, 1994.
3. W. Chu, H. Yang, K. Chiang, M. Minock, G. Chow, and C.
Larson, CoBase: A scalable and extensible cooperative infor-
mation system. J. Intell. Inform. Syst., 6 (11), 1996.
4. S. Chaudhuri and L. Gravano, Evaluating Top-k Selection
Queries. In Proceedings of 25
th
International Conference on
Very Large Data Bases, September 710, 1999, Edinburgh,
Scotland, UK.
5. T. Gaasterland, Cooperative answering through controlled
query relaxation, IEEE Expert, 12 (5): 4859, 1997.
6. W.W. Chu, Cooperative Information Systems, in B. Wah (ed.),
The Encyclopedia of Computer Science and Engineering,
New York: Wiley, 2007.
7. Y. Kanza, W. Nutt, and Y. Sagiv, Queries with Incomplete
Answers Over Semistructured Data. In Proceedings of the
Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on
Principles of Database Systems, May 31 June 2, 1999,
Philadelphia, Pennsylvania.
8. Y. Kanza and Y. Sagiv, In Proceedings of the Twentieth ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, May 2123, 2001, Santa Barbara,
California.
9. T. Schlieder, In Proceedings of 10
th
International Conference
on Extending Database Technology, March 2631, 2006,
Munich, Germany.
10. S. Amer-Yahia, S. Cho, and D. Srivastava, XML Tree Pattern
Relaxation. In Proceedings of 10
th
International Conference on
Extending Database Technology, March 2631, 2006, Munich,
Germany.
11. D. Lee, M. Mani, and W. W Chu, Schema Conversions
Methods between XML and Relational Models, Knowledge
Transformation for the Semantic Web. Frontiers in Articial
Intelligence and Applications Vol. 95, IOS Press, 2003,
pp. 117.
12. J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt
and J. Naughton. Relational Databases for Querying XML
Documents: Limitations and Opportunities. In Proceedings of
25
th
International Conference on Very Large Data Bases,
September 710, 1999, Edinburgh, Scotland, UK.
13. D. Lee and W.W Chu, CPI: Constraints-preserving Inlining
algorithmfor mapping XMLDTDto relational schema, J. Data
and Knowledge Engineering, Special Issue on Conceptual
Modeling, 39 (1): 325, 2001.
14. S. Liu, Q. Zou, and W. Chu, Congurable Indexing and Rank-
ing for XML Information Retrieval. In Proceedings of the 27
th
Annual International ACMSIGIRConference onResearchand
Development in Information Retrieval, July 2529, 2004,
Shefeld, UK.
15. N. Fuhr and K. Grobjohann, XIRQL: A Query Language for
Information Retrieval in XML Documents. In Proceedings of
the 24
th
Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval,
September 913, 2001, New Orleans Louisiana.
16. S. Amer-Yahia, N. Koudas, A. Marian, D. Srivastava, and
D. Toman, Structure and Content Scoring for XML. InProceed-
ings of the 31
st
International Conference on Very Large Data
Bases, August 30September 2, 2005, Trondheim, Norway.
17. S. Amer-Yahia, C. Botev, and J. Shanmugasundaram, TeXQu-
ery: AFull-Text SearchExtensionto XQuery. InProceedings of
13
th
International World Wide Web Conference. May 1722,
2004, New York.
18. A. Trotmanand B. Sigurbjornsson, Narrowed Extended XPath
I NEXI. In Proceedings of the 3
rd
Initiative of the Evaluation of
XML Retrieval (INEX 2004) Workshop, December 68, 2004,
Schloss Dagstuhl, Germany,
19. A. Theobald and G. Weikum, Adding Relevance to XML. In
Proceedings of the 3
rd
International Workshop on the Web and
16 COXML: COOPERATIVE XML QUERY ANSWERING
Databases, WebDB 2000, Adams, May 1819, 2000, Dallas,
Texas.
20. A. Marian, S. Amer-Yahia, N. Koudas, and D. Srivastava,
Adaptive Processing of Top-k Queries in XML. In Proceedings
of the 21
st
International Conference onDataEngineering, ICDE
2005, April 58, 2005, Tokyo, Japan.
21. I. Manolescu, D. Florescu, and D. Kossmann, Answering
XML Queries on Heterogeneous Data Sources. In Proceedings
of 27
th
International Conference on Very Large Data Bases,
September 1114, 2001, Rome, Italy.
22. P. Ciaccia, M. Patella, and P. Zezula, M-tree: An Efcient
Access Method for Similarity Search in Metric Spaces. In
Proceedings of 23
rd
International Conference on Very Large
Data Bases, August 2529, 1997, Athens, Greece.
23. G. SaltonandM. JMcGill, Introductionto ModernInformation
Retrieval, New York: McGraw-Hill, 1983.
24. S. Liu, W. Chu, and R. Shahinian, Vague Content and Struc-
ture Retrieval(VCAS) for Document-Centric XML Retrieval.
Proceedings of the 8
th
International Workshop on the Web and
Databases (WebDB 2005), June 1617, 2005, Baltimore,
Maryland.
25. W. B Frakes and R. Baeza-Yates, Information Retreival: Data
Structures and Algorithms, Englewood Cliffs, N.J.: Prentice
Hall, 1992.
26. K. ZhangandD. Shasha, Simple fast algorithms for the editing
distance between trees and related problems, SIAM J. Com-
put., 18 (6):1245 1262, 1989.
27. L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram,
XRANK: Ranked Keyword Search Over XML Document. In
Proceedings of the 2003 ACM SIGMOD International Confer-
ence on Management of Data, June 912, 2003, San Diego,
California.
28. T. Finin, D. McKay, R. Fritzson, and R. McEntire, KQML: An
informationandknowledge exchange protocol, inK. Fuchi and
T. Yokoi, (eds), Knowledge Building and Knowledge Sharing,
Ohmsha and IOS Press, 1994.
29. S. Liu and W. W Chu, CoXML: A Cooperative XML Query
Answering System. Technical Report # 060014, Computer
Science Department, UCLA, 2006.
30. T. Schlieder and H. Meuss, Querying and ranking XML
documents, J. Amer. So. Inf. Sci. Technol., 53 (6):489.
31. J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt,
and J. Naughton, Relational Databases for Querying
XML Documents: Limitations and Opportunities. In VLDB,
1999.
WESLEY W. CHU
SHAORONG LIU
University of California,
Los Angeles
Los Angeles, California
COXML: COOPERATIVE XML QUERY ANSWERING 17
D
DATA ANALYSIS
What is data analysis? Nolan (1) gives a denition, which is
a way of making sense of the patterns that are in, or can be
imposed on, sets of gures. In concrete terms, data analysis
consists of an observation and an investigation of the given
data, and the derivation of characteristics from the data.
Such characteristics, or features as they are sometimes
called, contribute to the insight of the nature of data.
Mathematically, the features can be regarded as some
variables, and the data are modeled as a realization of
these variables with some appropriate sets of values. In
traditional data analysis (2), the values of the variables are
usually numerical and may be transformed into symbolic
representation. There are two general types of variables,
discrete and continuous. Discrete variables vary in units,
such as the number of words in a document or the popula-
tion in a region. In contrast, continuous variables can vary
in less than a unit to a certain degree. The stock price and
the height of people are examples of this type. The suitable
method for collecting values of discrete variables is count-
ing, and for continuous ones, it is measurement.
The task of data analysis is required among various
application elds, such as agriculture, biology, economics,
government, industry, medicine, military, psychology, and
science. The source data provided for different purposes
may be in various forms, such as text, image, or waveform.
There are several basic types of purposes for data analysis:
1. Obtain the implicit structure of data.
2. Derive the classication or clustering of data.
3. Search particular objects in data.
For example, the stockbroker wouldlike to get the future
trend of the stock price, the biologist needs to divide the
animals into taxonomies, and the physician tries to nd
the related symptoms of a given disease. The techniques
to accomplish these purposes are generally drawn from
statistics that provide well-dened mathematical models
and probability laws. In addition, some theories, such as
fuzzy-set theory, are also useful for data analysis in parti-
cular applications. This article is an attempt to give a brief
introduction of these techniques and concepts of data ana-
lysis. Inthe following section, avarietyof fundamental data
analysis methods are introduced and illustrated by exam-
ples. Inthe second section, the methods for data analysis on
two types of Internet data are presented. Advanced meth-
ods for Internet data analysis are discussed in the third
section. At last, we give a summary of this article and
highlight the research trends.
FUNDAMENTAL DATA ANALYSIS METHODS
Indata analysis, the goals are to nd signicant patterns in
the data and apply this knowledge to some applications.
Data analysis is generally performed in the following
stages:
1. Feature selection.
2. Data classication or clustering.
3. Conclusion evaluation.
The rst stage consists of the selection of the features in
the data according to some criteria. For instance, features
of people may include their height, skin color, and nger-
prints. Considering the effectiveness of humanrecognition,
the ngerprint, which is the least ambiguous, may get the
highest priority in the feature selection. In the second
stage, the data are classied according to the selected
features. If the data consist of at least two features, e.g.,
the height and the weight of people, which can be plotted in
a suitable coordinate system, we can inspect so-called
scatter plots and detect clusters or contours for data group-
ing. Furthermore, we can investigate ways to express data
similarity. In the nal stage, the conclusions drawn from
the data would be compared withthe actual demands. Aset
of mathematical models has been developed for this eva-
luation. Inthe following, we rst divide the methods of data
analysis into two categories according to different initial
conditions and resultant uses. Then, we introduce two
famous models for data analysis. Each method will be
discussed rst, followed by examples. Because the feature
selection depends on the actual representations of data, we
postpone the discussion about this stage until the next
section. In this section, we focus on the classication/clus-
tering procedure based on the given features.
A Categorization of Data Analysis Methods
There are a variety of ways to categorize the methods of
data analysis. According to the initial conditions and the
resultant uses, there can be two categories, supervised
data analysis and unsupervised data analysis. The term
supervised means that the human knowledge has to be
provided for the process. In supervised data analysis, we
specify a set of classes called a classication template and
select some samples from the data for each class. These
samples are then labeled by the names of the associated
classes. Based on this initial condition, we can automati-
cally classify the other data termed as to-be-classied
data. In unsupervised data analysis, there is no classica-
tion template, and the resultant classes depend on
the samples. The following are descriptions of supervised
and unsupervised data analysis with an emphasis on
their differences.
Supervised Data Analysis. The classication template
and the well-chosen samples are given as an initial state
and contribute to the high accuracy of data classication.
Consider the K nearest-neighbor (K NNs) classier, which
is a typical example of supervised data analysis. The input
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
to the classier includes a set of labeled samples S, a
constant value K, and a to-be-classied datum X.
The output after the classication is a label denoting a
class to which X belongs. The classication procedure is
as follows:
1. Find the K NNs of X from S.
2. Choose the dominant classes by K NNs.
3. If only one dominant class exists, label X by this
class; otherwise, label X by any dominant class.
4. Add X to S and the process terminates.
The rst step selects Ksamples fromS such that the values
of the selected features (also called patterns) of these K
samples are closest to those of X. Such a similarity may
be expressed in a variety of ways. The measurement of
distances among the patterns is one of the suitable instru-
ments, for example, the Euclidean distance as shown in
Equation (1). Suppose the K samples belong to a set of
classes; the second step is to nd the set of dominant classes
C. A dominant class is a class that contains the majority of
the Ksamples. If there is onlyoneelement inC, sayclass C
i
,
we assign X to C
i
. On the other hand, if C contains more
than one element, X is assigned to an arbitrary class in C.
After deciding the class of X, we label it and add it into the
set S.
d(X; Y) =
X
m
k=1
(X
k
Y
k
)
2
v
u
u
t
(1)
where each datum is represented by m features.
Example. Suppose there is a dataset about the salaries
and ages of people. Table 1 gives sucha set of samples Sand
the corresponding labels. There are three labels that denote
three classes: Rich, Fair, and Poor. These classes are
determined based on the assumption that the Richness
depends on the values of the salary and age. In Table 1,
we also append the rules of assigning labels for each age
value. From the above, we can get the set membership of
each class:
C
Rich
= Y
1
; Y
4
; Y
8
; C
Fair
= Y
2;
Y
5
; Y
6
; Y
10
;
C
Poor
= Y
3
; Y
7
; Y
9
(4)
From the above, it can be clearly determined whether X is
an element of A. However, many real-world phenomena
make such a unique decision impossible. In this case,
expressing in a degree of membership is more suitable. A
fuzzy set A on U can be represented by the set of pairs that
describe the membership function MF
A
(X) : U[0; 1[ as
dened in Ref. 5.
A = (X; MF
A
(X))[X U; MF
A
(X) [0; 1[ (5)
Example. Table 3 contains a fuzzy-set representation of
the dataset in Table 1. The membership function of each
sample is expressed in a form of possibility that stands for
the degree of the acceptance that a sample belongs to a
class. Under the case of supervised data analysis, the to-be-
classied datumXneeds to be labeled using an appropriate
classicationprocedure. The distance betweeneachsample
and X is calculated using the two features and Euclidean
distance:
1. Find the K NNs of X from S.
2. Compute the membership functionof Xfor eachclass.
3. Label X by the class with a maximal membership.
4. Add X to S and stop the process.
The rst stage in nding K samples with minimal dis-
tances is the same, so we have the same set of 4 NNs
Table 2. A Summary of Probability Distribution for the
Data in Table 1
Sample Rich Fair Poor
Expressions of New
Condensed Features
Young 2 2 1 Age is lower than 30
Old 1 2 2 The others
Low 1 2 3 Salary is lower than 36k
Median 1 1 0 The others
High 1 1 0 Salary is higher than 50k
Table 3. Fuzzy-Set Membership Functions for the Data
in Table 1
Sample Rich Fair Poor
Estimated Distance
Between Each
Sample and X
Y
1
0.5 0.2 0.3 11.66
Y
2
0.1 0.5 0.4 20.39
Y
3
0 0.2 0.8 20.09
Y
4
0.6 0.3 0.1 5.38
Y
5
0.2 0.5 0.3 10.19
Y
6
0.2 0.5 0.2 6.4
Y
7
0 0 1 15.52
Y
8
0.9 0.1 0 25.7
Y
9
0 0.3 0.7 11.18
Y
10
0.4 0.6 0 37.69
X 0.2 0.42 0.38
4 DATA ANALYSIS
Y
4
; Y
5
; Y
6
; Y
9
whenthe value of K=4. Let d(X; Y
j
) denote
the distance between X and the sample Y
j
. In the next
stage, we calculate the membership function MF
C
i
(X) of X
for each class C
i
as follows:
MF
C
i
(X) =
X
j
MF
C
i
(Y
j
) d(X; Y
j
)
X
j
d(X; Y
j
)
; \Y
j
K NNs of X (6)
MF
C
rich
(X)
=
0:6 5:38 0:2 10:19 0:2 6:4 0 11:18
5:38 10:19 6:4 11:18
~0:2
MF
C
fair
(X)
=
0:3 5:38 0:5 10:19 0:6 6:4 0:3 11:18
5:38 10:19 6:4 11:18
~0:42
MF
C
poor
(X)
=
0:1 5:38 0:3 10:19 0:2 6:4 0:7 11:18
5:38 10:19 6:4 11:18
~0:38
Because the membership of X for class C
Fair
is higher
than all others, we label X by C
Fair
. The resultant
membership directly gives a condence measure of this
classication.
INTERNET DATA ANALYSIS METHODS
The dramatic growth of information systems over the past
years has brought about the rapid accumulation of data
and an increasing need for information sharing. The
World Wide Web (WWW) combines the technologies of
the uniform resource locator (URL) and hypertext to
organize the resources in the Internet into a distributed
hypertext system (6). As more and more users and servers
register on the WWW, data analysis on its rich content is
expected to produce useful results for various applica-
tions. Many research communities such as network man-
agement (7), information retrieval (8), and database
management (9) have been working in this eld.
The goal of Internet data analysis is to derive a classi-
cation or clustering of Internet data, which can provide a
valuable guide for the WWW users. Here the Internet data
can be two kinds of materials, web page and web log. Each
site within the WWW environment contains one or more
web pages. Under this environment, any WWW user can
make a request to any site for any web page in it, and the
request is then recorded in its log. Moreover, the user can
also roam through different sites by means of the anchors
provided in each web page. Such an approach leads to the
essential difculties for data analysis:
1. Huge amount of data.
2. Frequent changes.
3. Heterogeneous presentations.
Basically the Internet data originate from all over the
world; the amount of data is huge. As any WWW user
can create, delete, and update the data, and change the
locations of the data at any time, it is difcult to get a
precise viewof the data. Furthermore, the various forms of
expressing the same data also reveal the status of the
chaos on the WWW. As a whole, Internet data analysis
should be able to handle the large amount of data and to
control the uncertainty factors in a practical way. In this
section, we rst introduce the method for data analysis on
web pages and then describe the method for data analysis
on web logs.
Web Page Analysis
Many tools for Internet resource discovery (10) use the
results of data analysis on the WWW to help users nd
the correct positions of the desired resources. However,
many of these tools essentially keep a keyword-based
index of the available web pages. Owing to the imprecise
relationship between the semantics of keywords and
the web pages (11), this approach clearly does not t the
user requests well. From the experiments in Ref. 12, the
text-based classier that is 87% accurate for Reuters
(news documents) yields only 32% precision for Yahoo
(web pages). Therefore, a new method for data analysis
on web pages is required. Our approach is to use the
anchor information in each web page, which contains
much stronger semantics for connecting the user interests
and those truly relevant web pages.
A typical data analysis procedure consists of the
following stages:
1. Observe the data.
2. Collect the samples.
3. Select the features.
4. Classify the data.
5. Evaluate the results.
In the rst stage, we observe the data and conclude with
a set of features that may be effective for classifying
the data. Next, we collect a set of samples based on a given
scope. In the third stage, we estimate the tness of each
feature for the collected samples to determine a set of effec-
tive features. Then, we classify the to-be-classied data
according to the similarity measure on the selected
features. At last, we evaluate the classied results and
nd a way for the further improvement. In the following,
we rst give some results of the study on the nature of web
pages. Then we show the feature selection stage and a
procedure to classify the web pages.
Data Observation. In the following, we provide two
directions for observing the web pages.
Semantic Analysis. We may consider the semantics of a
web page as potential features. Keywords contained in a
web page can be analyzed to determine the semantics such
as which elds it belongs to or what concepts it provides.
There are many ongoingefforts ondeveloping techniques to
derive the semantics of a web page. The research results of
information retrieval (13,14) can also be applied for this
purpose.
DATA ANALYSIS 5
Observing the data formats of web pages, we can nd
several parts expressing the semantics of the web pages to
some extent. For example, the title of a web page usually
refers to a general concept of the web page. An anchor,
which is constructed by the web page designer, provides a
URL of another web page and makes a connection between
the two web pages. As far as the web page designer is
concerned, the anchor texts must sufciently express the
semantics of the whole webpage to whichthe anchor points.
As to the viewpoint of aWWWuser, the motivationto follow
an anchor is based on the fact that this anchor expresses
desired semantics for the user. Therefore, we can make a
proper connection between the users interests and those
truly relevant web pages. We can group the anchor texts to
generate a corresponding classication of the web pages
pointed to by these anchor texts. Through this classica-
tion, we can relieve the WWW users of the difculties on
Internet resource discovery through a query facility.
Syntactic Analysis. Because the data formats of web
pages follow the standards provided on the WWW, for
example, hypertext markup language (HTML), we can
nd potential features among the web pages. Consider
the features shown in Table 4. The white pages, which
mean the web pages with a list of URLs, can be distin-
guished from the ordinary web pages by a large number of
anchors and the short distances between two adjacent
anchors within a web page. Note that here the distance
between two anchors means the number of characters
between them. For the publication, the set of headings
has to contain some specied keywords, such
as bibliography or reference. The average distance
between two adjacent anchors has to be lower than a given
threshold, andthe placement of anchors has to center to the
bottom of the web page.
According to these features, some conclusions may
be drawn in the form of classication rules. For instance,
the web page is designed for publication if it satises the
requirements of the corresponding features. Obviously, this
approach is effective only when the degree of support for
such rules is high enough. Selection of effective features is a
way to improve the precision of syntactic analysis.
Sample Collection. It is impossible to collect all web
pages and thus, choosing a set of representative samples
becomes a very important task. On the Internet, we have
two approaches to gather these samples, as follows:
1. Supervised sampling.
2. Unsupervised sampling.
Supervised sampling means that the sampling process is
based on the human knowledge specifying the scope of the
samples. In supervised data analysis, a classication tem-
plate that consists of a set of classes exists. The sampling
scope can be set based on the template. The sampling is
more effective when all classes in the template contain at
least one sample. On the other hand, we consider unsu-
pervised sampling if there is not enough knowledge about
the scope, as in the case of unsupervised data analysis. The
most trivial way to get samples is to choose any subset of
web pages. However, this arbitrary sampling may not t
the requirement of random sampling well. We recommend
the use of searchengines that provide different kinds of web
pages in the form of a directory.
Feature Selection. In addition to collecting enough sam-
ples, we have to select suitable features for the subsequent
classication. No matter howgoodthe classicationscheme
is, the accuracy of the results would not be satisfactory
without effective features. A measure for the effectiveness
of a feature is to estimate the degree of class separability. A
better feature implies higher class separability. This mea-
sure can be formulated as a criterion to select effective
features.
Example. Consider the samples shown in Table 5.
From Table 4, there are two potential features for white
pages, the number of anchors (F
0
) and the average dis-
tance between two adjacent anchors (F
1
). We assume that
F
0
_ 30 and F
1
_3 when the sample is a white
page. However, a sample may actually belong to the class
of white pages although it does not satisfy the assumed
conditions. For example, Y
6
is a white page although its F
0
< 30. Therefore, we need to nd a way to select effective
features.
Fromthe labels, the set membership of the two classes is
as follows, where the class C
1
refers to the class of white
pages:
C
0
= Y
1
; Y
2
; Y
3
; Y
4
; Y
5
; C
1
= Y
6
; Y
7
; Y
8
; Y
9
; Y
10
2p
p e
z
2
2
2
Consider the squared error distortion function. The corre-
sponding rate distortion function is
RD
log
1
2
logD; for 0 <D
2
0; for D>
2
2
, and inthe binary source
example, D
min
0 and D
max
p. Values of R(D) are
interesting only in the interval D
min
DD
max
. Function
R(D) is not dened for D < D
min
, and it has value zero for
all D > D
max
. It can be proved that in the open interval
D
min
< D < D
max
the rate-distortion function R(D) is
strictly decreasing, continuously differentiable and con-
vex downward. Its derivative approaches 1 when D
approaches D
min
(3,5,4).
In Example 3 , we had analytic formulas for the rate-
distortion functions. In many cases such formulas cannot
be found, and numerical calculations of rate-distortion
functions become important. Finding R(D) for a given
nite iid source is a convex optimization problem and
can be solved effectively using, for example, the
ArimotoBlahut algorithm (6,7).
QUANTIZATION
In the most important application areas of lossy compres-
sion, the source symbols are numerical values. They can be
color intensities of pixels in image compression or audio
samples in audio compression. Quantization is the process
of representing approximations of numerical values usinga
small symbol set. In scalar quantization each number is
treated separately, whereas in vector quantization several
numbers are quantized together. Typically, some numer-
ical transformationrst is applied to initialize the data into
a more suitable form before quantization. Quantization is
the only lossy operation in use in many compression algo-
rithms. See Ref. 8 for more details on quantization.
Scalar Quantization
A scalar quantizer is specied by a nite number of avail-
able reconstruction values r
1
; . . . ; r
n
2R, ordered r
1
<r
2
<
. . . <r
n
, and the corresponding decision intervals. Decision
intervals are specied by their boundaries b
1
; b
2
; . . . ;
(b) (a)
Figure 1. The rate-distortionfunctionof (a) binarysource withp0.5 and(b) Gaussiansource with1. Achievable rates reside above the
curve.
4 DATA COMPRESSION CODES, LOSSY
b
n1
2R, also orderedb
1
<b
2
< . . . <b
n1
. For eachi 1,. . .,
n, all input numbers in the decision interval b
i1
; b
i
are
quantized into r
i
, where we use the convention that b
0
1andb
n
1. Only index i needs to be encoded, andthe
decoder knows to reconstruct it as value r
i
. Note that all
numerical digital data have been quantized initially from
continuous valued measurements into discrete values. In
digital data compression, we consider, additional quantiza-
tion of discrete data into a more coarse representation to
improve compression.
The simplest form of quantization is uniform quantiza-
tion. In this case, the quantization step sizes b
i
b
i1
are
constant. This process can be viewed as the process of
rounding each number x into the closest multiple of some
positive constant b. Uniform quantizers are used widely
because of their simplicity.
The choice of the quantizer affects both the distortion
and the bitrate of a compressor. Typically, quantization is
the only lossy step of a compressor, and quantization error
becomes distortion in the reconstruction. For given distor-
tion function and probability distribution of the input
values, oneobtains the expecteddistortion. For the analysis
of bitrate, observe that if the quantizer outputs are entropy
coded and take advantage of their uneven probability dis-
tribution, then the entropy of the distribution provides the
achieved bitrate. The quantizer design problem then is the
problem of selecting the decision intervals and reconstruc-
tion values in a way that provides minimal entropy for
given distortion.
The classic LloydMax (9,10) algorithm ignores the
entropy and only minimizes the expected distortion of
the quantizer output for a xed number n of reconstruction
values. The algorithm, hence, assumes that the quantizer
output is encoded using xed length codewords that ignore
the distribution. The LloydMax algorithm iteratively
repeats the following two steps:
(i) Find reconstruction values r
i
that minimize the
expected distortion, while keeping the interval
boundaries b
i
xed, and
(ii) Find optimal boundaries b
i
, while keeping the
reconstruction values xed.
If the squarederror distortionis used, the best choice of r
i
in
step (i) is the centroid (mean) of the distribution inside the
corresponding decision interval b
i1
; b
i
. If the error is
measured in absolute error distortion, then the best choice
of r
i
is the median of the distribution in b
i1
; b
i
. Instep (ii),
the optimal decision interval boundaries b
i
in both distor-
tions are the midpoints
r
i
r
i1
2
between consecutive recon-
struction values.
Each iteration of (i) and (ii) lowers the expected distor-
tion so the process converges toward a local minimum. A
LloydMaxquantizer is anyquantizer that is optimal inthe
sense that (i) and (ii) do not change it.
Modications of the LloydMax algorithm exist that
produce entropy-constrained quantizers (8). These quanti-
zers incorporate the bitrate in the iteration loop (i)(ii).
Also, a shortest path algorithm in directed acyclic graph
can be used in optimal quantizer design (11).
Vector Quantization
From information theory, it is known that optimal lossy
coding of sequences of symbols requires that long blocks
of symbols are encoded together, even if they are inde-
pendent statistically. This means that scalar quantiza-
tion is suboptimal. In vector quantization, the input
sequence is divided into blocks of length k, and each
block is viewed as an element of R
k
, the k-dimensional
real vector space. An n-level vector quantizer is specied
by n reconstruction vectors ~r
1
; . . . ; ~r
n
2R
k
and the corre-
sponding decision regions. Individual reconstruction vec-
tors are called code vectors, and their collection often is
called a code book.
Decision regions form a partitioning of R
k
into n parts,
each part representing the input vectors that are quan-
tized into the respective code vector. Decision regions do
not need to be formed explicitly, rather, one uses the
distortion function to determine for any given input
vector which of the n available code vectors ~r
i
gives
optimal representation. The encoder determines for
blocks of k consecutive input values the best code vector
~r
i
and communicates the index i to the decoder. The
decoder reconstructs it as the code vector ~r
i
. Typically,
each vector gets quantized to the closest code vector
available in the code book, but if entropy coding of the
code vectors is used, then the bitrate also may be incorpo-
rated in the decision.
Theoretically, vector quantizers can encode sources as
close to the rate-distortionbound as desired. This encoding,
however, requires that the block length k is increased,
which makes this fact have more theoretical than practical
signicance.
The LloydMax quantizer design algorithm easily
generalizes into dimension k. The generalized Lloyd
Max algorithm, also known as the LBG algorithm after
Linde, Buzo, and Gray (12), commonly is used with
training vectors
~
t
1
; . . .
~
t
m
2R
k
. These vectors are sample
vectors from representative data to be compressed. The
idea is that although the actual probability distribution
of vectors from the source may be difcult to nd, obtain-
ing large collections of training vectors is easy, and it
corresponds to taking randomsamples fromthe distribu-
tion.
The LBGalgorithmiteratively repeats the following two
steps. At all times the training vectors are assigned to
decision regions.
(i) For each decision region i nd the vector ~r
i
that
minimizes the sumof the distortions between~r
i
and
the training vectors previously assigned to region
number i.
(ii) Change the assignment of training vectors: Go over
the training vectors one by one and nd for each
training vector
~
t
j
the closest code vector~r
i
under the
given distortion. Assign vector
~
t
j
in decision region
number i.
In step (i), the optimal choice of ~r
i
is either the coordinate
wise mean or median of the training vectors in decision
DATA COMPRESSION CODES, LOSSY 5
region number i, depending on whether the squared error
or absolute error distortion is used.
Additional details are needed to specify howto initialize
the code book and what is done when a decision region
becomes empty. One also can incorporate the bitrate in the
iteration loop, which creates entropy-constrained vector
quantizers. For more details on vector quantization see,
for example, Ref. 13.
DATA TRANSFORMATIONS
Indata compression situations, it is rarely the case that the
elements of source sequences are independent of each
other. For example, consecutive audio samples or neighbor-
ing pixel values in images are highly correlated. Any effec-
tive compression system must take advantage of these
correlations to improve compressionthere is no sense
in encoding the values independently of each other as
then their common information gets encoded more than
once. Vector quantizationis one way to exploit correlations.
Scalar quantization, however, requires some additional
transformations to be performed on the data to remove
correlation before quantization.
Prediction-Based Transformations
One way to remove correlation is to use previously pro-
cessed data values to calculate a prediction ^ x
i
2R for the
next data value x
i
2R. As the decoder calculates the same
prediction, it is enough to encode the prediction error
e
i
x
i
^ x
i
In lossy compression, the prediction error is scalar quan-
tized. Let e
i
denote e
i
quantized. Value e
i
is entropy coded
and included in the compressed le. The decoder obtains e
i
and calculates the reconstruction value
x
i
e
i
^ x
i
Note that the reconstruction error x
i
x
i
is identical to the
quantization error e
i
e
i
so the quantization directly con-
trols the distortion.
In order to allowthe encoder and decoder to calculate an
identical prediction ^ x
i
, it is important that the predictions
are based on the previously reconstructed values x
j
, not the
original values x
j
; j <i. In other words,
^ x
i
f x
i1
; x
i2
; x
i3
; . . .
where f is the prediction function. In linear prediction,
function f is a linear function, that is,
^ x
i
a
1
x
i1
a
2
x
i2
. . . a
k
x
ik
for some constants a
1
; a
2
; . . . ; a
k
2R. Here, k is the order of
the predictor. Optimal values of parameters a
1
; a
2;
. . . ; a
k
that minimize the squared prediction error can be found by
solving a system of linear equations, provided autocorrela-
tion values of the input sequences are known (14).
In prediction-based coding, one has strict control of the
maximum allowed reconstruction error in individual
source elements x
i
. As the reconstructionerror is identical
to the quantization error, the step size of a uniform
quantizer provides an upper bound on the absolute recon-
struction error. Compression where a tight bound is
imposed on the reconstruction errors of individual num-
bers rather than an average calculated over the entire
input sequence is termed near-lossless compression.
Linear Transformations
An alternative to prediction-based coding is to perform a
linear transformation on the input sequence to remove
correlations. In particular, orthogonal transformations
are used because they keep mean squared distances invar-
iant. This method allows easy distortion control: The MSE
quantization error on the transformed data is identical to
the MSE reconstruction error on the original data.
Orthogonal linear transformation is given by an ortho-
gonal square matrix, that is, annnmatrix Mwhose rows
(and consequently also columns) form an orthonormal
basis of R
n
. Its inverse matrix is the same as its transpose
M
T
. The rows of Mare the basis vectors of the transforma-
tion.
Adatasequenceof lengthnis viewedas ann-dimensional
column vector ~x 2R
n
. This vector can be the entire input, or,
more likely, the input sequence is divided into blocks of
length n, and each block is transformed separately. The
transformed vector is M~x, which is also a sequence of n real
numbers. Noticethat theelements of thetransformedvector
are the coefcients in the expression of ~x as a linear combi-
nation of the basis vectors.
The inverse transformationis givenbymatrixM
T
. Inthe
following, we denote the ith coordinate of ~x by ~x(i) for i 1,
2,. . ., n. If the transformed data is quantized into ~y 2R
n
,
then the reconstruction becomes M
T
~y. Orthogonality guar-
antees that the squared quantization error kM~x ~yk
2
is
identical to the squared reconstruction error k~x M
T
~yk
2
.
Here, we use the notation
k~rk
2
X
n
i1
~ri
2
for the square norm in R
n
.
KLT Transformation. The goal is to choose a transforma-
tionthat maximallydecorrelates the coordinates of the data
vectors. At the same time, one hopes that as many coordi-
nates as possible become almost zero, which results in low
entropy in the distribution of quantized coordinate values.
The second goal is called energy compaction.
Let us be more precise. Consider a probability distribu-
tion on the input data vectors with mean ~ m2R
n
. The
variance in coordinate i is the expectation of ~xi
~ mi
2
when ~x 2R
n
is drawn from the distribution. The
energy in coordinate i is the expectation of ~xi
2
. The total
variance (energy) of the distribution is the sum of the n
coordinate-wise variances (energies). Orthogonal transfor-
mationkeeps the total variance and energy unchanged, but
6 DATA COMPRESSION CODES, LOSSY
a shift of variances and energies between coordinates can
occur. Variance (energy) compaction means that as much
variance (energy) as possible gets shifted into as few coor-
dinates as possible.
The covariance incoordinates i andj is the expectationof
~xi ~ mi~x j ~ m j
Coordinates are called uncorrelated if the covariance
is zero. Decorrrelation refers to the goal of making the
coordinates uncorrelated. Analogously, we can call the
expectation of ~xi~x j the coenergy of the coordinates.
Notice that the variances and covariances of a distribution
are identical to the energies and coenergies of the trans-
lated distribution of R
n
by the mean ~ m.
Example 4. Let n 2, and let us build a distribution of
vectors by sampling blocks from a grayscale image. The
image is partitioned into blocks of 1 2 pixels, and each
suchblockprovides a vector x; y 2R
2
where x and y are the
intensities of the two pixels. See Fig. 2 (a) for a plot of the
distribution. The two coordinates are highly correlated,
which is natural because neighboring pixels typically
have similar intensities. The variances and energies in
the two coordinates are
Energy Variance
x 17224 2907
y 17281 2885
total 34505 5792
The covariance and coenergy between the x and y coordi-
nates are 17174 and 2817, respectively.
Let us perform the orthogonal transformation
1
2
p
1 1
1 1
that rotates the points 45 degrees clockwise, see Fig. 2 (b)
for the transformed distribution. The coordinates after
the transformation have the following variances and
energies:
Energy Variance
x 34426 5713
y 79 79
total 34505 5792
Energy and variance has been packed in the rst coordi-
nate, whereas the total energy and variance remained
invariant. The effect is that the second coordinates became
very small (and easy to compress), whereas the rst coor-
dinates increased.
After the rotation, the covariance and the coenergy are -
29 and 11, respectively. The transformed coordinates are
uncorrelated almost, but not totally. &
The two goals of variance compaction and decorrelation
coincide. An orthogonal transformation called KLT (Kar-
hunenLoe`ve transformation) makes all covariances
between different coordinates zero. It turns out that at
the same time KLTpacks variance inthe following, optimal
way: For every k n the variance in coordinates 1,2,. . ., k
after KLT is as large as in any k coordinates after any
orthogonal transformation. In other words, no orthogonal
transformation packs more variance in the rst coordinate
than KLT nor more variance in two coordinates than KLT,
and so forth. The rows of the KLT transformation matrix
are orthonormal eigenvectors of the covariance matrix, that
is, the matrix whose element i,j is the covariance in coor-
dinates i and j for all 1 i; j n. If energy compaction
rather than variance compaction is the goal, then one
should replace the covariance matrix by the analogous
coenergy matrix.
Example 5. In the case of Example 4 the covariance
matrix is
2907 2817
2817 2885
The corresponding KLT matrix
0:708486 0:705725
0:705725 0:708486
consists of the orthonormal eigenvectors. It is the clockwise
rotation of R
2
by 44.88 degrees. &
Figure 2. The sample distribution (a) before and (b) after the orthogonal transformation.
DATA COMPRESSION CODES, LOSSY 7
KLT transformation, although optimal, is not used
widely in compression applications. No fast way exists
to take the transformation of given data vectorsone
has to multiply the data vector by the KLT matrix. Also,
the transformation is distribution dependent, so that the
KLT transformation matrix or the covariance matrix has
to be communicated to the decoder, which adds overhead
to the bitrate. In addition, for real image data it turns out
that the decorrelationperformance of the widelyused DCT
transformation is almost as good. KLT is, however, an
important tool indataanalysis where it is usedto lower the
dimension of the data set by removing the less signicant
dimensions. This method also is known as the principal
component analysis.
DCT Transformation. The discrete cosine transform
(DCT) is an orthogonal linear transformation whose basis
vectors are obtained by uniformly sampling the cosine
function, see Fig. 3. More precisely, the element i, j of
the n n DCT matrix is
C
i
cosip
2 j 1
2n
where i, j 0,1,. . ., n1 and C
i
is the normalization factor
C
i
1=n
p
; if i 0
2=n
p
; otherwise
2
p
1 1
1 1
Onthe rst level, the input signal of lengthnis partitioned
into segments of two samples and each segment is trans-
formed using M. The result is shufed by grouping
together the rst and the second coefcients from all
segments, respectively. This shufing provides two sig-
nals of length n/2, where the rst signal consists of moving
averages and the second signal of moving differences of
the input. These two signals are called subbands, and they
also canbe viewed as the result of ltering the input signal
by FIR lters whose coefcients are the rows of matrix M,
followed by subsampling where half of the output values
are deleted.
The high-frequency subband that contains the moving
differences typically has small energy because consecutive
input samples are similar. The low-frequency subband that
contains the moving averages, however, is a scaled version
of the original signal. The second level of the Haar trans-
formation repeats the process on this low-frequency sub-
band. This process again produces two subbands (signals of
length n/4 this time) that contain moving averages and
differences. The process is repeated on the low-frequency
output for as many levels as desired.
The combined effect of all the levels is an orthogonal
transformation that splits the signal into a number of
subbands. Notice that a sharp transition inthe input signal
nowonly affects a small number of values in each subband,
so the effect of sharp edges remains localized in a small
number of output values.
10 DATA COMPRESSION CODES, LOSSY
Haar transformation is the simplest subband transfor-
mation. Other subband coders are based onthe similar idea
of lteringthe signal withtwo ltersone low-pass andone
high-pass lterfollowed by subsampling the outputs by a
factor of two. The process is repeatedonthe low-pass output
and iterated in this fashion for as many levels as desired.
Different subband coders differ in the lters they use. Haar
transformation uses very simple lters, but other lters
with better energy compaction properties exist. See, for
example, Ref. 17 for more details on wavelets and subband
coding.
When comparing KLT (and DCT) with subband trans-
formations on individual inputs, one notices that
although KLT packs energy optimally in xed, predeter-
mined coefcients, subband transformations pack the
energy better in fewer coefcients, however, the positions
of those coefcients depend on the input. In other words,
for good compressionone cannot order the coefcients ina
predetermined zigzag order as in JPEG, but one has to
specify also the position of each large value.
Among the earliest successful image compression algo-
rithms that use subband transformations are the
embedded zerotree wavelet (EZW) algorithm by J. M.
Shapiro (18) and the set partitioning in hierarchical trees
(SPIHT) algorithm by A. Said and W. A. Pearlman (19).
The JPEG 2000 image compression standard is based on
subband coding as well (20).
CONCLUSION
A wide variety of specic lossy compression algorithms
based on the techniques presented here exist for audio,
speech, image, and video data. In audio coding, linear
prediction commonly is used. Also, subband coding and
wavelets are well suited. If DCT type transformations
are used, they are applied commonly on overlapping seg-
ments of the signal to avoid blocking artifacts.
In image compression, JPEG has been dominant. The
wavelet approach has become more popular after the intro-
duction of the wavelet-based JPEG2000. At high quantiza-
tion levels, JPEG 2000 exhibits blurring and ringing on
edges, but it does not suffer from the blocking artifacts
typical to JPEG.
Video is an image sequence, so it is not surprising that
the same techniques that work in image compression also
work well with video data. The main difference is the
additional temporal redundancy: Consecutive video frames
are very similar, and this similarity can be used to get good
compression. Commonly, achieving this good compression
is done through block-based motion compensation, where
the frame is partitioned into blocks (say of size 16 16) and
for each block a motion vector that refers to a reconstructed
block in a previously transmitted frame is given. The
motion vector species the relative location of the most
similar block in the previous frame. This reference block is
used by the decoder as the rst approximation. The differ-
ence of the correct block and the approximation then is
encoded as a still image using DCT. Common standards
such as MPEG-2 are based on this principle. At high
compression, the technique suffers fromblocking artifacts,
especially in the presence of high motion. This difculty is
because the motion vectors of neighboring blocks can differ
signicantly, which results invisible block boundaries, and
because at high-motion areas the motion compensation
leaves large differences to be encoded using DCT, which
can be done within available bit budget only by increasing
the level of quantization.
See Refs. 2124 for more information on lossy compres-
sion and for details of specic algorithms.
BIBLIOGRAPHY
1. C. E. Shannon, A mathematical theory of communication, Bell
System Technical Journal, 27: 379423; 623656, 1948.
2. T. Cover and J. Thomas, Elements of Information Theory. New
York: Wiley & Sons, 1991.
3. C. E. Shannon, Coding theorems for a discrete source with a
delity criterion, IRE Nat. Conv. Rec., 7: 142163, 1959.
4. T. Berger and J. D. Gibson, Lossy source coding, IEEE Trans-
actions on Information Theory, 44(6), 26932723, 1998.
5. T. Berger, Rate Distortion Theory: A Mathematical Basis for
Data Compression, Englewood Cliffs, NJ: Prentice Hall,
1971.
6. S. Arimoto, An algorithm for computing the capacity of arbi-
trary discrete memoryless channels, IEEE Transactions on
Information Theory, 18(1): 1420, 1972.
7. R. E. Blahut, Computation of channel capacity and
rate- distortion function, IEEE Trans. Information Theory,
18(4): 460473, 1972.
8. R. M. Gray and D. L. Neuhoff, Quantization, IEEE Transac-
tions on Information Theory, 44: 23252384, 1998.
9. S. P. Lloyd, Least squared quantization in PCM. Unpublished,
Bell Lab. 1957. Reprinted in IEEETrans. Information Theory,
28: 129137, 1982.
10. J. Max, Quantizing for minimum distortion, IEEE Trans.
Information Theory, 6(1), 712, 1960.
11. D. Muresonand M. Effros, Quantization as histogramsegmen-
tation: Globally optimal scalar quantizer design in network
systems, Proc. IEEEData CompressionConference 2002, 2002,
pp. 302311.
12. Y. Linde, A. Buzo, and R. M. Gray, An algorithm for vector
quantizer design, IEEE Trans. Communications, 28: 8495,
1980.
13. A. Gersho and R. M. Gray, Vector Quantization and Signal
Compression. New York: Kluwer Academic Press, 1991.
14. J. D. Markel and A. H. Gray, Linear Prediction of Speech. New
York: Springer Verlag, 1976.
15. K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms,
Advantages, Applications. New York: Academic Press, 1990.
16. W. B. Pennebaker and J. L. Mitchell, JPEG Still Image
Compression Standard. Amsterdam, the Netherlands: Van
Nostrand Reinhold, 1992.
17. G. Strang and T. Nguyen, Wavelets and Filter Banks. Well-
esley-Cambridge Press, 1996.
18. J. M. Shapiro, Embedded Image Coding Using Zerotrees of
Wavelet Coefcients. IEEE Trans. Signal Processing, 41(12),
34453462, 1993.
19. A. Said and W. A. Pearlman, A new fast and efcient image
codec based on set partitioning in hierarchical trees, IEEE
Trans. Circuits and Systems for Video Technology, 6: 243250,
1996.
DATA COMPRESSION CODES, LOSSY 11
20. D. S. Taubman and M. W. Marcellin, JPEG 2000: Image
Compression, Fundamentals, Standards, and Practice. New
York: Kluwer Academic Press, 2002.
21. K. Sayood, Introduction to Data Compression, 2nd ed. Morgan
Kaufmann, 2000.
22. D. Salomon, Data Compression: the Complete Reference, 3rd
ed. New York: Springer, 2004.
23. M. Nelson and J. L. Gailly, The Data Compression Book, 2nd
ed. M & T Books, 1996.
24. M. Rabbani and P. W. Jones, Digital Image Compression
Techniques. Society of Photo-Optical Instrumentation Engi-
neers (SPIE) Tutorial Text, Vol. TT07, 1991.
JARKKO KARI
University of Turku
Turku, Finland
12 DATA COMPRESSION CODES, LOSSY
D
DATA HANDLING IN INTELLIGENT
TRANSPORTATION SYSTEMS
INTRODUCTION
Within the domain of computer science, data handling is
the coordinated movement of data within and between
computer systems. The format of the data may be changed
during these movements, possibly losing information dur-
ing conversions. The data may also be ltered for privacy,
security, or efciency reasons. For instance, some informa-
tion transmitted from an accident site may be ltered for
privacy purposes before being presented to the general
public (e.g., license numbers and names). The motivation
for this article is to be an updated reference of the various
mechanisms and best practices currently available for
handling intelligent transportation systems (ITSs) data.
In the sections that follow, data handling within ITS is
discussed in detail.
What are Intelligent Transportation Systems?
Intelligent transportation systems apply computer, com-
munication, and sensor technologies in an effort to improve
surface transportation. The application of these technolo-
gies within surface transportation typically has been lim-
ited(e.g., car electronics andtrafc signals). One goal of ITS
is to broadenthe use of these technologies to integrate more
closely travelers, vehicles, and their surrounding infra-
structure. Used effectively, ITS helps monitor and manage
trafc ow, reduce congestion, provide alternative routes to
travelers, enhance productivity, respond to incidents, and
save lives, time, and money. Over the past ten years, the
public and private sectors have invested billions of dollars
in ITS research and development and in initial deployment
of the resulting products and services.
Given the potential benets of ITS, the U.S. Congress
specicallysupportedanITSprogramintheTransportation
EquityAct for the 21st Century(TEA-21) in1998. As dened
by TEA-21, the ITS program provides for the research,
development, and operational testing of ITSs aimed at sol-
ving congestion and safety problems, improving operating
efciencies intransit andcommercial vehicles, andreducing
the environmental impact of growing travel demand. Tech-
nologies that were cost effective were deployed nationwide
within surface transportation systems as part of TEA-21.
Intelligent transportation systems can be broken down
into three general categories: advanced traveler informa-
tion systems, advanced trafc management systems, and
incident management systems.
AdvancedTraveler InformationSystems (ATISs) deliver
data directly to travelers, empowering themto make better
choices about alternative routes or modes of transportation.
When archived, this historical data provide transportation
planners with accurate travel pattern information, opti-
mizing the transportation planning process.
AdvancedTrafc Management Systems (ATMSs) employ
a variety of relatively inexpensive detectors, cameras, and
communication systems to monitor trafc, optimize signal
timings on major arterials, and control the ow of trafc.
Incident Management Systems, for their part, provide
trafc operators with the tools to allowa quick and efcient
response to accidents, hazardous spills, and other emer-
gencies. Redundant communications systems link data
collection points, transportation operations centers, and
travel information portals into an integrated network
that can be operated efciently and intelligently.
Some example ITSapplications include on-boardnaviga-
tionsystems, crashnoticationsystems, electronic payment
systems, roadbed sensors, trafc video/control technologies,
weather information services, variable message signs, eet
tracking, and weigh in-motion technologies.
MAJOR CHALLENGES
One major challenge in handling ITS data involves mana-
ging the broad spectrum of requirements inherent in a
transportation system. ITS data must ow from and
betweenavarietyof locations anddevices suchas in-vehicle
sensors, police dispatch centers, infrastructure sensors,
computers, and databases. Each of these options has dif-
ferent bandwidth, formatting, and security requirements.
A one-solution-ts-all approach is not appropriate in this
type of environment. Fortunately, many standards can be
applied in concert with one another to cover all require-
ments. These standards and how they are used are dis-
cussed in the sections that follow.
DATA HANDLING IN ITS
DatawithinanITSoriginates at various sensors, suchas in-
pavement inductive loop detectors for measuring the pre-
sence of a vehicle, transponders for collecting toll fees, video
cameras, and operator input. These data typically are col-
lected and aggregated by a transportation agency for use in
managing trafc ow and in detecting and responding to
accidents. These data are often archived, and later data-
mining techniques are used to detect trafc trends. These
data also make their way to informationproviders for use in
radio, television, and Web trafc reports.
The different requirements for ITSdatahandlingdepend
on the medium used. For instance, the requirements for
data handling in communications stress the efcient use of
bandwidthandlowlatency, whereas datahandlingfor data-
mining applications require fast lookup capabilities.
Data Types and Formats
Various types of data are stored andmanipulated by anITS.
The types of data are discussed in the sections that follow.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Trafc Statistics. Speed, travel time, volume, and occu-
pancy data or other numeric measurements are used to
characterize the owof vehicles at a specic point or over a
specic segment of roadway. These data can be generated
from many types of detection systems, such as loop detec-
tors, microwaves, infrared or sonic detectors, video image
detection, automatic vehicle identication, license plate
matching systems, and wireless phone probes.
Weather/Environmental. Wind speed, wind direction,
temperature, humidity, visibility, and precipitation data
are collected by weather stations. This information can be
used to determine whether road salt is needed because of
icing conditions or whether danger exists to large vehicles
because of high wind speeds.
Video/Images. Video cameras are mounted high to pro-
vide unobstructed views of trafc. Video data are in the
form of encoded data streams or still images. Video feeds
are used to determine the cause and location of incidents
and to determine the overall level of service (congestion).
Incident/Event Reports. Incidents, construction/ mainte-
nance, events, and road conditions are some types of data
collected. This information usually is entered manually
into the system because anautomated means of detecting
an incident has yet to be developed. The reports provide
descriptive information on planned or unplanned occur-
rences that affect or may affect trafc ow.
Data Attributes. The following attributes are common
within ITS data: AccuracyHow precise is the data?
DetailIs the data a direct measurement or an esti-
mated/indirect measurement? TimelinessHow fresh is
the data? Availability/ReliabilityHow dependable is
the ow of data? DensityHow close together/far apart
are the data collection sources? AccessibilityHow easy is
the data accessed by a data consumer? CondenceIs the
data trustworthy? CoverageWhere is the data collected?
Standards for Data Handling
Standards provide a mechanism to ensure compatibility
among various disparate systems. These standards reduce
costs because software and hardware can be developed to
make use of a handful of standards rather thanof hundreds
of proprietary protocols.
The following organizations are involved in the devel-
opment of ITSstandards: the AmericanAssociationof State
Highway and Transportation Ofcials (AASHTO), Amer-
ican National Standards Institute (ANSI), American
Society for Testing & Materials (ASTM), Consumer Elec-
tronics Association (CEA), Institute of Electrical and Elec-
tronics Engineers (IEEE), Institute of Transportation
Engineers (ITE), Society of Automotive Engineers (SAE),
National Electrical Manufacturers Association (NEMA),
and the National Transportation Communications for
ITS Protocol (NTCIP). Note that NTCIP is a joint effort
among AASHTO, ITE, and NEMA.
The Federal Geographic Steering Committee (FGSC)
provides staff to support the U.S. Department of the Inter-
ior to manage and facilitate the National Spatial Data
Infrastructure (NSDI) (1).
Data Formats and Communication Protocols. Standards
are built off of one another. For instance, ITS standards
make use of standards for data formatting and communica-
tions, such as Extensible Markup Language (XML), Com-
mon Object Request Broker Architecture (CORBA), or
Abstract Syntax Notation number One (ASN.1). To better
understand ITS standards, these underlying standards
will be discussed rst.
Extensible Markup Language. XML is a standard of the
World Wide Web Consortium(W3C) (2). XML is a means of
encoding data so that computers can send and receive
information. XML allows computers to understand the
information content of the data and act on that content
(e.g., process the information, display the information to a
human, store the information in a database, and issue a
command to a eld device). Unlike most computer encoding
standards (e.g., Basic Encoding Rules, Octet Encoding
Rules, Packed Encoding Ruled, Common Data Representa-
tion, and Hypertext Markup Language), no single set of
encoding rules exists for XML. Instead, XML encoding
rules are customized for different applications. Further-
more, XML encoding rules include a mechanism for iden-
tifying each element of an XML document or message. See
Ref. 3 for an overview of XML use in ITS applications.
The advantages of using XML for ITS data handling are
for communications. It has pervasive support among soft-
ware companies, standards organizations, andgovernment
agencies. Also, XMLformatted data are often bundled with
Web services over Hyper-Text Transmission Protocol
(HTTP) viaSOAP(4) or XML-RPC(5). This bundlingallows
the data and associated commands to traverse rewalls
more easily. Some may see this process as a disadvantage,
however, because it basically is hiding the services from
network administrators and granting potentially unse-
cured access to important functions.
XML format is not well suited for data storage, how-
ever, because its hierarchal data structure does not lend
itself to easy storage and retrieval from relational data-
bases. However, XML databases, which natively can
store, query, and retrieve XML-formatted documents,
are now available (6).
A disadvantage to XML-formatted data is that the tags
and the textual nature of the documents tend to make them
muchmore verbose. This verbosity, inturn, requires higher
bandwidth communication links and more data processing
to encode and decode ITSdata as compared with CORBAor
ASN.1 (7,8).
Common Object Request Broker Architecture. CORBA is
an Object Management Group (OMG) standard for com-
munications betweencomputers (9). As part of CORBA, the
interface denitionlanguage (IDL) provides a platformand
computer language-independent specicationof the datato
be transmitted and the services that are available to a
CORBAapplication. The OMGrequires CORBAimplemen-
tations to make use of the Internet Inter-ORB Protocol
(IIOP) for encoding messages over the Internet. This pro-
tocol ensures that CORBA implementations from different
vendors running on different operating systems can inter-
act withone another. NTCIPhas a CORBA-based standard
for communications between transportation centers,
NTCIP 2305 (10). CORBA inherently is object oriented,
which allows for the denition of both data structures and
for commands in the form of interfaces.
One advantage to using CORBA are its relatively lower
bandwidthrequirementsascomparedwithXML-basedcom-
munications. CORBAhas wide industry support, and many
open-sourceimplementations are available(11,12). Because
CORBAis based onthe IIOPcommunications standard, the
various commercial and open-source implementations can
interoperate with one another. CORBA has the advantage,
ascomparedwithASN.1orXML, inthat it describesboththe
structure of the datatobetransmittedandwhat is tobedone
with the data. Finally, many counterpart extensions to
CORBA handle such things as authentication.
The disadvantages to CORBA are that it requires spe-
cialized knowledge to integrate it into ITS data handling
systems. Also, because CORBA does not make use of the
HTTP protocol, it does not traverse Web rewalls as easily
as XML-RPC or SOAP.
Abstract Syntax Notation. ASN.1 is a standard of the
International Standards Organization (ISO) that denes
a formalism for the specication of abstract data types.
ASN.1 is a formal notation used for describing data trans-
mission protocols, regardless of language implementation
and physical representation of these data, whatever the
application, whether complex or very simple. ASN/DATEX
is a NTCIP standard (NTCIP 2304). ASN.1 is not a com-
munication or data storage mechanism, but it is a standard
for expressing the format of complex data structures in a
machine-independent manner. Many encoding rules can
encode and decode data via ASN.1 such as the basic encod-
ing rules (BERs), canonical encoding rules (CERs), and
distinguished encoding rules (DERs) (13).
ASN.1 has the advantage that it unambiguously denes
the structure for ITS data and the various encoding
schemes allow for efcient transmission of the data over
even lower bandwidth communication systems. Many soft-
ware libraries are also available that handle the encoding
and decoding of ASN.1 data structures (14).
ITS Data Bus. Chartered in late 1995, the Data Bus
Committee is developing the concept of a dedicated ITS
data bus that may be installed on a vehicle to work in
parallel with existing automotive electronics (15). When
complete, the data bus will facilitate the addition of ITS
electronics devices to vehicles without endangering any of
its existing systems.
The ITS data bus will provide an open architecture to
permit interoperability that will allow manufacturers,
dealers, and vehicle buyers to install a wide range of
electronics equipment in vehicles at any time during the
vehicles lifecycle, with little or no expert assistance
required. The goal of the Data Bus Committee is to develop
SAErecommended practices and standards that dene the
message formats, message header codes, node IDs, applica-
tion services and service codes, data denitions, diagnostic
connectors, diagnostic services and test mode codes, ITS
data busvehicle bus gateway services and service codes,
network management services/functionality, and other
areas as may be needed.
National ITS Architecture. The National ITS Architec-
ture provides a common framework for planning, dening,
and integrating intelligent transportation systems (16). It
is a mature product that reects the contributions of a
broad cross section of the ITS community: transportation
practitioners, systems engineers, system developers, tech-
nology specialists, and consultants. The architecture pro-
vides high-level denitions of the functions, physical
entities, and information ows that are required for ITS.
The architecture is meant to be used as a planning tool for
state and local governments.
Trafc Management Data Dictionary. The Data Diction-
ary for Advanced Traveler Information Systems, Society of
Automotive Engineers (SAE) standard J2353, provides
concise denitions for the data elements used in advanced
traveler information systems (ATIS) (17). These denitions
provide a bit-by-bit breakdown of each data element using
ASN.1. The trafc management data dictionary is meant to
be used in conjunction with at least two other standards,
one for dening the message sets (e.g., SAEJ2354 and SAE
J2369) and the other for dening the communication pro-
tocol to be used.
TransXML. XML Schemas for the exchanges of Trans-
portationData(TransXML) is a edgling standardwiththe
goal to integrate various standards such as LandXML,
aecXML, ITS XML, and OpenGIS into a common frame-
work. This project has not had any results yet; see Ref. 18
for more information.
Geography Markup Language. Geography markup lan-
guage (GML) is anOpenGISstandard based onXMLthat is
used for encoding and storing geographic information,
including the geometry and properties of geographic fea-
tures (19). GML was developed by an international, non-
prot standards organization, the Open GIS Consortium,
Inc. GML describes geographic features using a variety of
XML elements such as features, coordinate referencing
systems, geometry, topology, time, units of measure, and
generalized values.
Spatial Data Transfer Standard. The Spatial Data Trans-
fer Standard (SDTS) is a National Institute of Standards
(NIST) standard for the exchange of digitized spatial data
betweencomputer systems (20). SDTSuses ISO8211 for its
physical le encoding and breaks the le down into mod-
ules, eachof whichis composedof records, whichinturnare
composed of elds. Thirty-four different types of modules
currently are dened by the standard, some of which are
specialized for a specic application.
LandXML. LandXML is an industry-developed stan-
dard schema for the exchange of data created during
land planning, surveying, and civil engineering design
processes by different applications software. LandXML
was developed by an industry consortium of land devel-
opers, universities, and various government agencies.
LandXML is used within GIS applications, survey eld
instruments, civil engineering desktop and computer
aided design (CAD)-based applications, instant three-
dimensional (3-D) viewers, and high-end 3Dvisualization
rendering applications. LandXML has provisions for not
only the storage of a features geometry, but also for its
attributes, such as property owner and survey status.
LandXML can be converted readily to GML format.
Scalable Vector Graphics. Scalable vector graphics
(SVG) is an XML-based standard for the storage and dis-
playof graphics using vector dataandgraphic primitives. It
primarily is used in the design of websites, but also it has
application for the display and transfer of vector-based
geographic data. It does not, however, have specic provi-
sions for the attributes that are associated with geographic
data such as road names and speed limits.
Location Referencing Data Model. The location referen-
cing data model is a National Cooperative Highway
Research Program(NCHRP) project that denes a location
referencing system. Within this system, a location can be
dened invarious formats suchas mile points or addresses.
It also provides for the conversion of a location between the
various formats and for the denition of a location in one,
two, or more dimensions (21).
International Standards Organization Technical Commit-
tee 211. The International Standards Organization Tech-
nical Committee 211 (ISO TC/211) has several geographic
standards that are of importance to intelligent transporta-
tion systems: Geographic informationRules for Applica-
tion Schema; ISO 19123, Geographic information
Schema for coverage geometry and functions; and ISO
19115, Geographic informationMetadata.
Data Mining and Analysis
Data mining and analysis of transportation data are useful
for transportation planning purposes (e.g., transit devel-
opment, safety analysis, and road design). Data quality
especially is important because the sensors involved can
malfunction and feed erroneous data into the analysis
process (22). For this reason, ITS data typically needs to
be ltered carefully to assure data quality. This ltering is
accomplished by throwing out inconsistent data. For
instance, a speed detector may be showing a constant
free ow speed while its upstream and downstream coun-
terparts show much slower speeds during peak usage
hours. In this case, the inconsistent detector data would
be thrown out and an average of the up and downstream
detectors would be used instead.
BIBLIOGRAPHY
1. National Spatial Data Infrastructure. Available: http://
www.geo-one-stop.gov/.
2. Extensible Markup Language (XML) 1.0 Specication. Avail-
able: http://www.w3.org/TR/REC-xml/.
3. XML in ITS Center-to-Center Communications. Available:
http://www.ntcip.org/library/documents/pdf/
9010v0107_XML_in_C2C.pdf.
4. SOAP Specications. Available: http://www.w3.org/TR/soap/.
5. XML-RPC Home Page. Available: http://www.xmlrpc.com/.
6. XML:DB Initiative. Available: http://xmldb-org.sourcefor-
ge.net/.
7. James Kobielus, Taming the XML beast,. Available: http://
www.networkworld.com/columnists/2005/011005kobie-
lus.html.
8. R. Elfwing, U. Paulsson, and L. Lundberg, Performance of
SOAP in web service environment compared to CORBA,
Proc. of the Ninth Asia-Pacic Software Engineering Confer-
ence, 2002, pp. 84.
9. CORBA IIOP Specication. Available: http://www.omg.org/
technology/documents/formal/corba_iiop.htm.
10. NTCIP2305- ApplicationProle for CORBA. Available: http://
www.ntcip.org/library/documents/pdf/2305v0111b.pdf.
11. MICO CORBA. Available: http://www.mico.org/.
12. The Community OpenORB Project. Available: http://open-
orb.sourceforge.net/.
13. X.690 ASN.1 encoding rules: Specication of Basic Encoding
Rules, Canonical Encoding Rules and Distinguished Encoding
Rules. Available: http://www.itu.int/ITU-T/studygroups/
com17/languages/X.690-0207.pdf.
14. ASN.1 Tool Links. Available: http://asn1.elibel.tm.fr/links.
15. ITS Data Bus Committee. Available: http://www.sae.org/tech-
nicalcommittees/databus.htm.
16. National ITS Architecture. Available: http://itsarch.iteris.-
com/itsarch/.
17. SAE J2353 Advanced Traveler Information Systems (ATIS)
DataDictionary. Available: http://www.standards.its.dot.gov/
Documents/J2353.pdf.
18. XML Schemas for Exchange of Transportation Data
(TransXML). Available: http://www4.trb.org/trb/crp.nsf/0/
32a8d2b6bea6dc3885256d0b006589f9?OpenDocument.
19. OpenGIS
1
Geography Markup Language (GML) Implemen-
tationSpecication. Available: http://www.opengis.org/docs/
02-023r4.pdf.
20. Spatial Data Transfer Standard (SDTS). Available: http://
ssdoo.gsfc.nasa.gov/nost/formats/sdts.html.
21. N. Koncz, and T. M. Adams, A data model for multi-dimen-
sional transportation location referencing systems, Urban
Regional Informa. Syst. Assoc. J., 14(2), 2002. Available:
http://www.urisa.org/Journal/protect/Vol14No2/Koncz.pdf.
22. M. Flinner, and H. Horsey, Trafc data edit procedures pooled
fund study trafc data quality (TDQ), 2000, SPR-2 (182).
JOHN F. DILLENBURG
PETER C. NELSON
University of Illinois at Chicago
Chicago, Illinois
D
DATA PRIVACY
The denition of privacy varies greatly among societies,
across time, and even among individuals; with so much
variation and interpretation, providing a single concrete
denition is difcult. Several concepts that are related to
privacy are as follows:
Solitude: The right of an individual to be undisturbed
by, or invisible from, some or all other members of
society, including protection of the home, work, or a
public place from unwanted intrusion.
Secrecy: The ability to keep personal information
secret from some or all others, including the right
to control the initial and onward distribution of per-
sonal information and the ability to communicate
with others in private.
Anonymity: The right to anonymity by allowing an
individual to remainnameless insocial or commercial
interaction.
LEGAL BACKGROUND
In their inuential article entitled The Right to Privacy
published in the Harvard Law Review in 1890, Samuel D.
Warren and Louis D. Brandeis described how [i]nstanta-
neous photographs and newspaper enterprise have
invaded the sacred precincts of private and domestic life;
and numerous mechanical devices threaten to make good
the prediction that what is whispered in the closet shall be
proclaimed from the house-tops.
Warren and Brandeis described how U.S. common law
couldbe usedto protect anindividuals right to privacy. The
article has been very inuential in the subsequent devel-
opment of legal instruments to protect data privacy. In
1948, the General Assembly of the United Nations passed
the Declaration of Human Rights, of which Article 12
states, [n]o one shall be subjected to arbitrary interference
with his privacy, family, home or correspondence, nor to
attacks upon his honour and reputation. Everyone has the
right to the protection of the law against such interference
or attacks. The majority of countries in the world now
consider privacy as a right and have enacted laws describ-
ing how and when citizens can expect privacy.
The level of legal protection of privacy varies from one
country to another. Inthe UnitedStates, there is no general
privacy protection lawto regulate private industry; instead
specic federal legislation is introduced to deal with parti-
cular instances of the collection and processing of personal
information. In addition, some states have amended their
consititutions to include specic rights to privacy. In areas
not covered by specic federal or state laws, companies are
left to self-regulate. The newexecutive post of Chief Privacy
Ofcer has been adopted by some companies, with respon-
sibilities including the development of privacy policy,
tracking legislation, monitoring competitors, training
employees, and if necessary, performing damage control
in the press.
The EuropeanUnionhas a comprehensive frameworkto
control the collection and distribution of personally identi-
able information by both state and private-sector institu-
tions. The 1998 EU Data Protection Directive is the latest
piece of Europeanlegislationthat follows onfroma series of
national laws initially developed in the 1970s and earlier
EU directives and OECD guidelines.
The EU Directive requires that all personal data are as
follows:
programmer is
propagated to its members. Thus, if an authorization to
read tech-reports is given to G
programmer. It is impor-
tant to note that roles and groups are two complementary
concepts; they are not mutually exclusive.
The enforcement of role-based policies present several
advantages. Authorization management results are sim-
plied by the separation of the users identity from the
authorizations they need to execute tasks. Several users
can be given the same set of authorizations simply by
assigning them the same role. Also, if a users responsibil-
ities change (e.g., because of a promotion), it is sufcient
to disable her for the previous roles and enable her for a
new set of roles, instead of deleting and inserting the
many access authorizations that this responsibility
change implies. A major advantage of role-based policies
. . . . . . .
. . . . . . .
. . . . . . .
SUBJECTS OBJECTS
C
I
U
I
n
f
o
r
m
a
t
i
o
n
F
l
o
w
r
e
a
d
s
r
e
a
d
s
r
e
a
d
s
w
r
i
t
e
s
w
r
i
t
e
s
w
r
i
t
e
s
C
I
U
Figure 6. Controlling information flow for integrity.
8 DATA SECURITY
is represented by the fact that authorizations of a role are
enabled only when the role is active for a user. This allows
the enforcement of the least privilege principle, whereby a
process is givenonly the authorizations it needs to complete
successfully. This connement of the process in a dened
workspace is an important defense against attacks aiming
at exploiting authorizations (as the Trojan Horse example
illustrated). Moreover, the denition of roles and related
authorizations ts with the information system organiza-
tion and allows us to support related constraints, such as,
for example, separation of duties (2325). Separation of
duties requires that no user should be given enough privi-
leges to be able to misuse the system. For instance, the
person authorizing a paycheck should not be the same
person who prepares it. Separation of duties can be
enforced statically, by controlling the specication of roles
associated with each user and authorizations associated
with each role, or dynamically, by controlling the actions
actually executed by users when playing particular roles
(4,24).
AUDIT
Authentication and access control do not guarantee com-
pletesecurity. Indeed, unauthorizedor improper uses of the
system can still occur. The reasons for this are various.
First, security mechanisms, like any other software or
hardware mechanism, can suffer from aws that make
them vulnerable to attacks. Second, security mechanisms
have a cost, both monetary and in loss of systems perfor-
mances. The more protection to reduce accidental and
deliberate violations is implemented, the higher the cost
of the system will be. For this reason, often organizations
prefer to adopt less secure mechanisms, which have little
impact on the systems performance, with respect to more
reliable mechanisms, that would introduce overhead pro-
cessing. Third, authorized users may misuse their privi-
leges. This last aspect is denitelynot the least, as misuse of
privileges by internal users is one major cause of security
violations.
This scenario raises the need for audit control. Audit
provides a post facto evaluation of the requests and the
accesses occurred to determine whether violations have
occurred or have been attempted. To detect possible viola-
tions, all user requests and activities are registered in an
audit trail (or log), for their later examination. An audit
trail is aset of records of computer events, where a computer
event is anyactionthat happens onacomputer system(e.g.,
logging into a system, executing a program, and opening a
le). A computer system may have more than one audit
trail, each devoted to a particular type of activity. The kind
and format of data stored in an audit trail may vary from
system to system; however, the information that should be
recorded for each event includes the subject making the
request, the object to which access is requested, the opera-
tionrequested, the time andlocationat whichthe operation
was requested, the response of the access control, and the
amount of resources used. An audit trail is generated by an
auditing system that monitors system activities. Audit
trails have many uses in computer security:
. Thus, [A
[ = a log n and
AA. Then, we nd
the largest number fromthe selected subset. Obviously, we
can do it in Q(log n) time. We argue that this element is the
correct answer with high probability. The algorithmfails if
all elements of the randomly selected subset are less than
or equal to the median M. As elements of A
are randomly
selected, it is easy to observe that any element of A
is less
than or equal to the median M with probability 1/2. There-
fore, the probability of failure is P
f
= (1=2)
a log n
= n
a
. In
summary, we can conclude that if the size of the subset A
satises [A
is a
correct result with probability at least 1 n
a
. Applying
the randomized algorithm, we solve the problemin Q(log n)
time with high probability.
Parallel Algorithms
Parallel algorithms provide methods to harness the power
of multiple processors working together to solve a problem.
The idea of using parallel computing is natural and has
been explored in multiple ways. Theoretical models for
parallel algorithms have been developed, including the
sharedmemory and message passing (MP) parallel systems.
The PRAM (Parallel Random Access Machine) is a sim-
ple model in which multiple processors access a shared
memory pool. Each read or write operation is executed
asynchronously by the processors. Different processors
collaborate to solve a problem by writing information
that will be subsequently read by others. The PRAM is a
popular model that captures the basic features of parallel
systems, and there is a large number of algorithms devel-
oped for this theoretical model. However, the PRAMis not
a realistic model: Real systems have difculty in guaran-
teeing concurrent access to memory shared by several
processors.
Variations of the main model have been proposed to
overcome the limitations of PRAM. Such models have
additional features, including different memory access
policies. For example, the EREW(exclusive read, exclusive
write) allows only one processor to access the memory for
read and write operations. The CREW model (concurrent
read, exclusive write) allows that multiple processors read
the memory simultaneously, whereas only one processor
can write to memory at each time.
MP is another basic model that is extensively used in
parallel algorithms. In MP, all information shared by pro-
cessors is sent via messages; memory is local and accessed
by a single processor. Thus, messages are the only mechan-
ism of cooperation in MP. A major result in parallel pro-
gramming is that the two mechanisms discussed (shared
memory and MP) are equivalent in computational power.
The success of a parallel algorithm is measured by the
amount of work that it can evenly share among processors.
Therefore, we need to quantify the speedup of a parallel
algorithm, given the size of the problem, the time T
S
(n)
necessary to solve the problem in a sequential computer,
and the time T
P
(n, p) needed by the parallel algorithmwith
p processors. The speedup f(n; p) can be computed as
f(n; p) = T
S
(n)=T
P
(n; p), and the asymptotic speedup
f
( p) = lim
n
f(n; p)). The desired situation is
that all processors can contribute to evenly reduce the
computational time. In this case, f(n; p) = O( p), and we
say that the algorithm has linear speedup.
We show a simple example of parallel algorithm. In
the prex computation problem, a list L = l
1
; . . . ; l
n
of
numbers is given, and the objective is to compute the
sequence of values
P
j
i=1
l
i
for j 1; . . . ; n. The following
algorithm uses n processors to compute the parallel prex
of n numbers in O(log n) time.
1. Algorithm Parallel-prex
2. Input: L = l
1
; . . . ; l
n
3. for k0 to log n| 1 do
4. Execute the following instructions in parallel:
5. for j 2
k
1 to n do
6. l
j
l
j2
k l
j
7. end
8. end
DATA STRUCTURES AND ALGORITHMS 7
As a sequential algorithmfor this problemhas complex-
ity Q(n), the speedup of using the parallel algorithm is
f(n; n) = Q(n=log n).
Heuristics and Metaheuristics
One of the most popular techniques for solving computa-
tionally hard problems is to apply the so-called heuristic
approaches. A heuristic is an ad hoc algorithm that applies
a set of rules to create or improve solutions to an NP-hard
problem. Heuristics give apractical wayof ndingsolutions
that can be satisfactory for most purposes. The main dis-
tinctive features of these type of algorithms are fast exe-
cution time and the goal of nding good-quality solutions
without necessarily providing any guarantee of quality.
Heuristics are particularly useful for solving in practice
hard large-scale problems for which more exact solution
methods are not available.
As an example of a simple heuristic, we can consider a
local search algorithm. Suppose we need to solve a combi-
natorial optimization problem, where we minimize an
objective functionf(x), andthe decisionvariable x is a vector
of n binary variables (i.e., x
i
0; 1 for i 1; . . . ; n). For
each problem, we dene a neighborhood N(x) of a feasible
solution x as the set of feasible solutions x
/
that can be
created by a well-dened modication of x. For example,
let x 0; 1
n
be a feasible solution, thenN(x) canbe dened
as the set of solutions x
k
such that
x
k
i
=
1 x
i
if i = k
x
i
otherwise
for k 1; . . . ; n. A point x 0; 1
n
is called a local opti-
mumif it does not have a neighbor whose objective value is
strictly better than f(x). In our example, which we suppose
is a minimization problem, it means that if x is a local
optimum, then f (x) _ f (x
k
) for all k 1; . . . ; n. In this
case, if we can generate a feasible solution x, then we can
check if x is locally optimal in time O(n) (notice that other
neighborhoods may lead to different search times). Let
x N(x), with f ( x) < f (x). Assigning x x, we can repeat
the afore-mentioned procedure that searches for a locally
optimal solution while the optimality criterion is reached.
For many problems, local search-based approaches similar
to the one described above proved to be very successful.
A metaheuristic is a family of heuristics that can be
applied to nd good solutions for difcult problems. The
main difference between metaheuristics and heuristics is
that, while a heuristic is an ad hoc procedure, a metaheur-
istic must follow a well-dened sequence of steps and use
some specied data structures. However, the exact imple-
mentation of some of the steps in a metaheuristic may be
customized, according to the target problem. Therefore, a
concrete implementation of a metaheuristic provides a
heuristic for a specic problem (hence, the use of the prex
meta). Elements of heuristics such as local search or
greedy methods are usually combined in metaheuristics
in order to nd solutions of better quality.
Examples of metaheuristic methods include simulated
annealing, genetic algorithms, tabu search, GRASP, path
relinking, reactive search, ant colony optimization, and
variable neighborhood search.
BIBLIOGRAPHIC NOTES
Algorithms and data structures have a vast literature, and
we provide only some of the most important references.
Popular books inalgorithms and data structures include the
ones by Cormen et al. (12), Horowitz et al. (13), Sedgewick
(14), and Knuth (15). Networks and graph algorithms are
covered in depth in Ahuja et al. (16). Books on data struc-
tures include the classics by Aho et al. (17) and Tarjan (18).
NP-hard problems have beendiscussed inseveral books,
but the best-known reference is the compendium by Garey
and Johnson (8). A standard reference on randomized
algorithms is Ref. (19). For more information on parallel
algorithms, and especially PRAM algorithms, see the book
by Jaja (20). Good reviews of approximation algorithms are
presented in the book edited by Hochbaum (21), and by
Vazirani (22). For a more detailed discussion of different
metaheuristics and related topics, the reader is referred to
Ref. (23).
Information about the complexity of numerical optimi-
zation problems is available in the paper by Pardalos and
Vavasis (24) and the books listed in Refs. (25) and (26).
BIBLIOGRAPHY
1. E. D. Demaine, Cache-oblivious algorithms and data struc-
tures, in Lecture Notes from the EEF Summer School on
Massive Data Sets, Lecture Notes in Computer Science.
New York: Springer-Verlag, 2002.
2. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran,
Cache-oblivious algorithms, in Proc. of 40th Annual Sympo-
sium on Foundations of Computer Science, New York, 1999.
3. L. Arge, The buffer tree: A new technique for optimal I/O
algorithms, in Proceedings of Fourth Workshop on Algorithms
and Data Structures (WADS), volume 955 of Lecture Notes in
Computer Science. New York: Springer-Verlag, 1995, pp. 334
345.
4. L. Arge, D. Vengroff, and J. Vitter, External-memory algo-
rithms for processing line segments in geographic information
systems, in Proceedings of the Third European Symposium on
Algorithms, volume 979 of Lecture Notes in Computer Science.
New York: Springer-Verlag, 1995, pp. 295310.
5. J. Abello, P. Pardalos, and M. Resende (eds.), Handbook of
Massive Data Sets. Dordrecht, the Netherlands: Kluwer Aca-
demic Publishers, 2002.
6. C. H. Papadimitriou, Computational Complexity. Reading,
MA: Addison-Wesley, 1994.
7. S. Cook, The complexity of theorem-proving procedures,
in Proc. 3rd Ann. ACM Symp. on Theory of Computing.
New York: Association for Computing Machinery, 1971, pp.
151158.
8. M. R. Garey and D. S. Johnson, Computers and Intractability -
A Guide to the Theory of NP-Completeness. New York: W. H.
Freeman and Company, 1979.
9. D. Johnson, Approximation algorithms for combinatorial pro-
blems, J. Comp. Sys. Sci., 9(3): 256278, 1974.
10. N. Christodes, Worst-case analysis of a new heuristic for the
travelling salesman problem, in J. F. Traub (ed.), Symposium
8 DATA STRUCTURES AND ALGORITHMS
on New Directions and Recent Results in Algorithms and
Complexity. New York: Academic Press, 1976, p. 441.
11. S. Arora and C. Lund, Hardness of approximations, in D.
Hochbaum (ed.), Approximation Algorithms for NP-hard Pro-
blems. Boston, MA: PWS Publishing, 1996.
12. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
Introduction to Algorithms, 2nd ed., Cambridge, MA: MIT
Press, 2001.
13. E. Horowitz, S. Sahni, and S. Rajasekaran, Computer Algo-
rithms. New York: Computer Science Press, 1998.
14. R. Sedgewick, Algorithms. Reading, MA: Addison-Wesley,
1983.
15. D. Knuth, The Art of Computer Programming Fundamental
Algorithms. Reading, MA: Addison-Wesley, 1997.
16. R. Ahuja, T. Magnanti, and J. Orlin, Network Flows. Engle-
wood Cliffs, NJ: Prentice-Hall, 1993.
17. A. V. Aho, J. E. Hopcroft, and J. D. Ullman, Data Structures
and Algorithms. Computer Science and Information Proces-
sing. Reading, MA: Addison-Wesley, 1982.
18. R. E. Tarjan, Data Structures and Network Algorithms. Regio-
nal Conference Series in Applied Mathematics. Philadelphia,
PA: Society for Industrial and Applied Mathematics, 1983.
19. R. Motwani and P. Raghavan, Randomized Algorithms.
Cambridge: Cambridge University Press, 1995.
20. J. JaJa, An Introduction to Parallel Algorithms. Reading, MA:
Addison-Wesley, 1992.
21. D. Hochbaum (ed.), Approximation Algorithms for NP-hard
Problems. Boston, MA: PWS Publishing, 1996.
22. V. Vazirani, Approximation Algorithms. New York: Springer-
Verlag, 2001.
23. M. Resende and J. de Sousa (eds.), Metaheuristics: Computer
Decision-Making. Dordrecht, the Netherlands: Kluwer Aca-
demic Publishers, 2004.
24. P. M. Pardalos and S. Vavasis, Complexity issues in numerical
optimization, Math. Program. B, 57(2): 1992.
25. P. Pardalos (ed.), Complexity in Numerical Optimization.
Singapore: World Scientic, 1993.
26. P. Pardalos (ed.), Approximation andComplexity in Numerical
Optimization: Continuous and Discrete Problems. Dordrecht,
the Netherlands: Kluwer Academic Publishers, 2000.
CARLOS A.S. OLIVEIRA
Oklahoma State University
Stillwater, Oklahoma
PANOS M. PARDALOS
OLEG A. PROKOPYEV
University of Florida
Gainesville, Florida
DATA STRUCTURES AND ALGORITHMS 9
D
DATA WAREHOUSE
INTRODUCTION
For a fewdecades, the role played by database technology in
companies and enterprises has only been that of storing
operational data, that is data generated by daily, routine
operations carried out within business processes (such as
selling, purchasing, and billing). On the other hand,
managers need to access quickly and reliably the strategic
information that supports decision making. Such informa-
tionis extracted mainly fromthe vast amount of operational
data stored in corporate databases, through a complex
selection and summarization process.
Very quickly, the exponential growth in data volumes
made computers the only suitable support for the decisional
process run by managers. Thus, starting from the late
1980s, the role of databases began to change, which led
to the rise of decision support systems that were meant as
the suite of tools and techniques capable of extracting
relevant information from a set of electronically recorded
data. Among decision support systems, data warehousing
systems are probably those that captured the most
attention fromboth the industrial and the academic world.
A typical decision-making scenario is that of a large
enterprise, with several branches, whose managers wish
to quantifyandevaluate the contributiongivenfromeachof
them to the global commercial return of the enterprise.
Because elemental commercial data are stored inthe enter-
prise database, the traditional approach taken by the
manager consists in asking the database administrators
to write an ad hoc query that can aggregate properly the
available datato produce the result. Unfortunately, writing
such a query is commonly very difcult, because different,
heterogeneous data sources will be involved. In addition,
the query will probably take avery long time to be executed,
because it will involve a huge volume of data, and it will run
together with the application queries that are part of the
operational workload of the enterprise. Eventually, the
manager will get on his desk a report in the formof a either
summary table, a diagram, or a spreadsheet, on which he
will base his decision.
This approach leads to a useless waste of time and
resources, and often it produces poor results. By the way,
mixingthese adhoc, analytical queries withthe operational
ones required by the daily routine causes the systemto slow
down, which makes all users unhappy. Thus, the core idea
of data warehousing is to separate analytical queries,
which are commonly called OLAP (On-Line Analytical
Processing) queries, from the operational ones, called
OLTP (On-Line Transactional Processing) queries, by
building a new information repository that integrates the
elemental data coming from different sources, organizes
theminto an appropriate structure, and makes themavail-
able for analyses and evaluations aimed at planning and
decision making.
Among the areas where data warehousing technologies
are employed successfully, we mention but a few: trade,
manufacturing, nancial services, telecommunications,
and health care. On the other hand, the applications of
data warehousing are not restricted to enterprises: They
also range fromepidemiology to demography, fromnatural
sciences to didactics. The common trait for all these elds is
the need for tools that enable the user to obtain summary
information easily and quickly out of a vast amount of data,
to use it for studying a phenomenon and discovering sig-
nicant trends and correlationsin short, for acquiring
useful knowledge for decision support.
BASIC DEFINITIONS
Adata warehousing systemcan be dened as a collection of
methods, techniques, and tools that support the so-called
knowledge worker (one who works primarily with informa-
tion or develops and uses knowledge in the workplace: for
instance, a corporate manager or a data analyst) indecision
making by transforming data into information. The main
features of data warehousing can be summarized as fol-
lows:
Bus architecture
Hub-and-spoke architecture
Federated architecture
In the independent data mart architecture, different
data marts are designed separately and built in a noninte-
grated fashion (Fig. 1). This architecture, although some-
times initially adopted in the absence of a strong
sponsorship toward an enterprise-wide warehousing
project or when the organizational divisions that make
up the company are coupled loosely, tends to be soon
replaced by other architectures that better achieve data
integration and cross-reporting.
The bus architecture is apparentlysimilar to the previous
one, with one important difference: Abasic set of conformed
dimension and facts, derived by a careful analysis of the
main enterprise processes, is adopted and shared as a
common design guideline to ensure logical integration of
data marts and an enterprise-wide view of information.
In the hub-and-spoke architecture, much attention is
given to scalability and extensibility and to achieving an
enterprise-wide view of information. Atomic, normalized
data are stored in a reconciled level that feeds a set of data
marts containing summarized data in multidimensional
form (Fig. 2). Users mainly access the data marts, but they
occasionally may query the reconciled level.
Thecentralizedarchitecture canbeviewedasaparticular
implementation of the hub-and-spoke architecture where
the reconciled level and the data marts are collapsed into a
single physical repository.
Finally, the federated architecture is sometimes adopted
in contexts where preexisting data warehouses/data marts
are to be integratednoninvasively to provide asingle, cross-
organizationdecisionsupport environment (e.g., inthe case
of mergers and acquisitions). Each data warehouse/data
mart is either virtually or physically integrated with the
2 DATA WAREHOUSE
others by leaning on a variety of advanced techniques such
as distributed querying, ontologies, and metadata inter-
operability.
ACCESSING THE DATA WAREHOUSE
This section discusses how users can exploit information
stored in the data warehouse for decision making. In the
following subsection, after introducing the particular
features of the multidimensional model, we will survey
the two main approaches for analyzing information:
reporting and OLAP.
The Multidimensional Model
The reasons why the multidimensional model is adopted
universally as the paradigm for representing data in data
warehouses are its simplicity, its suitability for business
analyses, and its intuitiveness for nonskilled computer
users, which are also caused by the widespread use of
spreadsheets as tools for individual productivity. Unfortu-
nately, although some attempts have been made in the
literature to formalize the multidimensional model (e.g.,
Ref. 1), none of them has emerged as a standard so far.
The multidimensional model originates from the obser-
vation that the decisional process is ruled by the facts of the
business world, such as sales, shipments, bank transac-
tions, and purchases. The occurrences of a fact correspond
to events that occur dynamically: For example, every sale or
shipment made is an event. For each fact, it is important to
know the values of a set of measures that quantitatively
describe the events: the revenue of a sale, the quantity
shipped, the amount of a bank transaction, and the dis-
count on a purchase.
The events that happen in the enterprise world are
obviously too many to be analyzed one by one. Thus, to
make them easily selectable and groupable, we imagine
arranging themwithinann-dimensional space whose axes,
called dimensions of analysis, dene different perspectives
for their identication. Dimensions commonly are discrete,
alphanumeric attributes that determine the minimum
granularity for analyzing facts. For instance, the sales in
a chain of stores can be represented within a three-dimen-
sional space whose dimensions are the products, the stores,
and the dates.
The concepts of dimension gave birth to the well-known
cube metaphor for representing multidimensional data.
According to this metaphor, events correspond to cells of
a cube whose edges represents the dimensions of analysis.
A cell of the cube is determined uniquely by assigning a
value to every dimension, and it contains a value for each
measure. Figure 3 shows an intuitive graphical represen-
tation of a cube centered on the sale fact. The dimensions
are product, store, and date. An event corresponds to the
sellingof agivenproduct inagivenstore onagivenday, and
it is described by two measures: the quantity sold and the
revenue. The gure emphasizes that the cube is sparse, i.e.,
that several events did not happen at all: Obviously, not all
products are sold every day in every store.
Normally, each dimension is structured into a hierarchy
of dimension levels (sometimes called roll-up hierarchy)
that group its values in different ways. For instance, pro-
ducts may be grouped according to their type and their
brand, and types may be grouped additionally into cate-
gories. Stores are grouped into cities, which in turn are
grouped into regions and nations. Dates are grouped into
months and years. On top of each hierarchy, a nal level
exists that groups together all possible values of ahierarchy
(all products, all stores, andall dates). Eachdimensionlevel
may be described even more by one or more descriptive
attributes (e.g., a product may be described by its name, its
color, and its weight).
A brief mention to some alternative terminology used
either in the literature or in the commercial tools is useful.
Figure 1. Independent data marts and bus
architectures (without and with conformed
dimensions and facts).
DATA WAREHOUSE 3
Although with the termdimension we refer to the attribute
that determines the minimum fact granularity, sometimes
the whole hierarchies are named as dimensions. Measures
are sometimes called variables, metrics, categories, proper-
ties, or indicators. Finally, dimension levels are sometimes
called parameters or attributes.
We now observe that the cube cells and the data they
contain, although summarizing the elemental data stored
withinoperational sources, are still very difcult to analyze
because of their huge number: Two basic techniques are
used, possibly together, to reduce the quantity of data and
thus obtain useful information: restriction and aggrega-
tion. For both, hierarchies play a fundamental role because
they determine how events may be aggregated and
selected.
Restricting data means cutting out a portion of the cube
to limit the scope of analysis. The simplest form of restric-
tion is slicing, where the cube dimensionality is reduced by
focusingonone single value for one or more dimensions. For
instance, as depicted in Fig. 4, by deciding that only sales of
store S-Mart are of interest, the decision maker actually
cuts a slice of the cube obtaining a two-dimensional sub-
cube. Dicing is a generalization of slicing in which a sub-
cube is determined by posing Boolean conditions on
hierarchy levels. For instance, the user may be interested
in sales of products of type Hi-Fi for the stores in Rome
during the days of January 2007 (see Fig. 4).
Although restriction is used widely, aggregation plays
the most relevant role in analyzing multidimensional data.
In fact, most often users are not interested in analyzing
events at the maximumlevel of detail. For instance, it may
be interesting to analyze sale events not ona daily basis but
by month. In the cube metaphor, this process means group-
ing, for each product and each store, all cells corresponding
to the days of the same month into one macro-cell. In the
aggregated cube obtained, each macro-cell represents a
synthesis of the data stored in the cells it aggregates: in
our example, the total number of items sold in each month
and the total monthly revenue, which are calculated by
summing the values of Quantity and Revenue through the
corresponding cells. Eventually, by aggregating along the
time hierarchy, an aggregated cube is obtained in which
each macro-cell represents the total sales over the whole
time period for each product and store. Aggregation can
also be operated along two or more hierarchies. For
instance, as shown in Fig. 5, sales can be aggregated by
month, product type, and city.
Noticeably, not every measure can be aggregated con-
sistently along all dimensions using the sum operator. In
some cases, other operators (such as average or minimum)
can be used instead, whereas in other cases, aggregation is
not possible at all. For details onthe two relatedproblems of
additivity and summarizability, the reader is referred to
Ref. 2.
Reporting
Reporting is oriented to users who need to access periodi-
cally information structured in a xed way. For instance, a
hospital must send monthly reports of the costs of patient
stays to a regional ofce. These reports always have the
same form, so the designer can write the query that gen-
erates the report and freeze it within an application so
that it can be executed at the users needs.
A report is associated with a query and a presentation.
The query typically entails selecting and aggregating
multidimensional data stored in one or more facts. The
presentation can be in tabular or graphical form (a dia-
gram, a histogram, a cake, etc.). Most reporting tools also
allow for automatically distributing periodic reports to
interested users by e-mail on a subscription basis or for
posting reports in the corporate intranet server for down-
loading.
OLAP
OLAP, which is probably the best known technique for
querying data warehouses, enables users to explore inter-
actively and analyze information based on the multidimen-
sional model. Although the users of reporting tools
essentially play a passive role, OLAP users can dene
actively a complex analysis session where each step taken
follows from the results obtained at previous steps. The
impromptu character of OLAP sessions, the deep knowl-
edge of data required, the complexity of the possible
queries, and the orientation toward users not skilled in
computer science maximize the importance of the employed
tool, whose interface necessarily has to exhibit excellent
features of exibility and friendliness.
An OLAP session consists in a navigation path that
reects the course of analysis of one or more facts from
different points of viewand at different detail levels. Sucha
path is realized into a sequence of queries, with each
differentially expressed with reference to the previous
Figure 2. Hub-and-spoke architecture; ODS stands for opera-
tional data store.
4 DATA WAREHOUSE
query. Query results are multidimensional; like for
reporting, OLAP tools typically represent data in either
tabular or graphical form.
Each step in the analysis session is marked by the
application of an OLAP operator that transforms the pre-
vious query into a new one. The most common OLAP
operators are as follows:
A user can have the DSS solve problems that the user
alone would not even attempt or that would consume a
great deal of time because of their complexity and
magnitude.
where the X
1
, . . . , X
k
lists all free variables.
A rule with n 0 is either a query or an integrity
constraint. When n>0, the rule is either an IDB rule
used to derive data or it may be an integrity constraint
that restricts what tuples may be in the database. Rules for
which n 1 and l 0 are called Horn rules.
DDBs restrict arguments of atomic formulas to con-
stants and variables, whereas in rst-order logic atomic
formulas may also contain function symbols as argu-
ments, which assures that answers to queries in DDBs
are nite
1
. Rules may be read either declaratively or
procedurally. A declarative reading of Formula (1) is:
L
1
or L
2
or . . . or L
n
is true if M
1
and M
2
and . . . and M
m
and not M
m1
and . . . and not M
ml
are all true.
A procedural reading of Formula (1) is:
1
Whenthere are functionsymbols, aninnite number of terms may
be generated from the nite number of constants and the function
symbols; hence, the answer to a query may be innite.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
L
1
or L
2
or . . . or L
n
are solved if M
1
and M
2
and . . . and
M
m
and not M
m1
and . . . and not M
ml
are solved.
The left-handside of the implication, L
1
or . . . or L
n
, is called
the head of the rule, whereas the right-hand side,
M
1
andM
2
and . . . andM
m
andnot M
m1
and . . . and not
M
ml
is called the body of the rule.
Queries to a database, QX
1
; . . . ; X
r
; are of the form
9 X
1
. . . 9 X
r
L
1
^ L
2
. . . ^ L
s
, written as L
1
; L
2
; . . . ;L
s
,
where s 1, the L
i
are literals and the X
i
; 1 i r; are
the free variables in Q. An answer to a query has the form
< a
11
; . . . ; a
1r
> <a
21
; . . . ; a
2r
> . . . <a
k1
; . . . ; a
kr
>
such that Qa
11
; . . . ; a
1r
_Qa
21
; . . . ; a
2r
_ . . . _ Qa
k1
; . . . ;
a
kr
is provable from the database, which means that an
inference system is used to nd answers to queries.
DDBs are closely related to logic programs when the
facts are restricted to atomic formulas and the rules have
only one atom in the left-hand side of a rule. The main
difference is that alogic programquerysearchis for asingle
answer, proceeding top-down fromthe query to an answer.
In DDBs, searches are bottom-up, starting from the facts,
to nd all answers. A logic program query might ask for
an item supplied by a supplier, whereas in a deductive
database, a query asks for all items supplied by a supplier.
DDBs restricted to atoms as facts, and rules that consist of
single atoms onthe left-hand side of a rule andatoms onthe
right-hand side of a rule that do not contain the default rule
for negation, not, are calledDatalogdatabases, (i.e., rules in
Formula(1), where n 1; m0, and l 0). Rules in Data-
log databases may be recursive. A traditional relational
database is a DDB where the EDB consists of atoms and
IDB rules are not recursive.
There are several different concepts of the relationship
of integrity constraints to the union of the EDB and the
IDB in the DDB. Two such concepts are consistency and
theoremhood. In the consistency approach (proposed by
Kowalski), the IC must be consistent with EDB[I DB.
In the theoremhood approach (proposed by Reiter and by
Lloyd and Topor), each integrity constraint must be a
theorem of EDB[IDB.
To answer queries that consist of conjunctions of posi-
tive and default negated atoms in Datalog requires that
semantics be associated with negation because only posi-
tive atoms can be derived from Datalog DDBs. How one
interprets the semantics of default negation can lead to
different answers. Two important semantics for handling
default negation are termed the closed-world assumption
(CWA), due to Reiter, and negation-as-nite-failure (NFF),
due to Clark. In the CW A, failure to prove the positive
atom implies that the negated atom is true. In the NFF,
predicates in the EDB and the IDB are considered the
if portion of the database and are closed by effectively
reversing the implication to achieve the only if part of
the database. The two approaches lead to slightly different
results. Negation, as applied to disjunctive theories, is
discussed later.
Example 1 (Ancestor). Consider the following database
that consists of parents and ancestors. The database con-
sists of two predicates, whose schema are p(X, Y), intended
to mean that Y is a parent of X, and a(X, Y), intended to
mean that Y is an ancestor of X. The database consists of
ve EDB statements and two IDB rules:
r1. p(mike, jack)
r2. p(sally, jack)
r3. p(katie, mike)
r4. p(beverly, mike)
r5. p(roger, sally)
r6. a(X, Y) p(X, Y)
r7. a(X, Y) p(X, Z), a(Z, Y)
The answer to the questionp(mike, X) is jack. The answer
to the question a(mike, X) is jack using rule r6. An answer to
the query a(roger, X) is sally using rule r6. Another answer
to the query a(roger, X), jack, is found by using rule r7.
For the query p(katie, jack), the answer by the CWA is
no; jack is not a parent of katie. The reason is that there
are only ve facts, none of which specify p(katie, jack), and
there are no rules that can be used to nd additional
parents.
More expressive power may be obtained in a DDB by
allowingnegatedatoms onthe right-handside of arule. The
semantics associated with such databases depends on how
the rule of negation is interpreted, as discussed in the
Datalog and Extended Deductive Databases section and
the Disjunctive Deductive Databases section.
HISTORICAL BACKGROUND OF DEDUCTIVE DATABASES
The prehistory of DDBs is considered to be from 1957 to
1970. The efforts in this period used primarily ad hoc or
simple approaches to performdeduction. The period1970to
1978 were the formative years, which preceded the start of
the eld. The period 1979 to 2003 sawthe development of a
theoretical framework and prototype systems.
Prehistory of Deductive Databases
In 1957, a system called ACSI-MATIC was under develop-
ment to automate work in Army intelligence. An objective
was to derive new data based on given information and
general rules. Chains of related data were sought and the
data contained reliability estimates. A prototype system
was implemented to derive new data whose reliability
values depended on the reliability of the original data.
The deduction used was modus ponens (i.e., from p and
p ! q, one concludes q, where p and q are propositions).
Several DDBs were developed in the 1960s. Although
in 1970 Codd founded the eld of Relational Databases,
relational systems were in use before then. In 1963, using
a relational approach, Levien and Maron developed a sys-
tem, Relational Data File (RDF), that had an inferential
capability, implemented through a language termed
INFEREX. An INFEREX program could be stored in the
system (such as in current systems that store views) and
re-executed, if necessary. A programmer specied reason-
2 DEDUCTIVE DATABASES
ing rules via an INFEREX program. The system handled
credibility ratings of sentences in forming deductions.
Theoretical work by Kuhns on the RDF project recognized
that there were classes of questions that were, in a sense,
not reasonable. For example, let the database consist of
the statement, Reichenbach wrote Elements of Symbolic
Logic. Whereas the question, What books has Reichen-
bach written?, is reasonable, the questions, What books
has Reichenbach not written?, or, Who did not write
Elements of Symbolic Logic?, are not reasonable. It is
one of the rst times that the issue of negation in queries
was explored.
In 1964, Raphael, for his Ph.D. thesis at M.I.T., devel-
opeda systemcalled Semantic InformationRetrieval (SIR),
which had a limited capability with respect to deduction,
using special rules. Green and Raphael subsequently desi-
gned and implemented several successors to SIR: QA1, a
re-implementation of SIR; QA 2, the rst system to
incorporate the Robinson Resolution Principle developed
for automated theorem proving; QA 3 that incorporated
added heuristics; and QA 3.5, which permitted alter-
native design strategies to be tested within the context
of the resolution theorem prover. Green and Raphael were
the rst to recognize the importance and applicability
of the work performed by Robinson in automated theorem
proving. They developed the rst DDB using formal tech-
niques based on the Resolution Principle, which is a gen-
eralization of modus ponens to rst-order predicate logic.
The Robinson Resolution Principle is the standard method
used to deduce new data in DDBs.
Deductive Databases: The Formative Years 19691978
The start of deductive databases is considered to be
November 1977, whena workshop, Logic andData Bases,
was organized in Toulouse, France. The workshop included
researchers who had performed work in deduction from
1969 to 1977 and used the Robinson Resolution Principle to
perform deduction. The Workshop, organized by Gallaire
and Nicolas, in collaboration with Minker, led to the pub-
licationof papers fromthe workshopinthe book, Logic and
Data Bases, edited by Gallaire and Minker.
Many signicant contributions were described in the
book. Nicolas and Gallaire discussed the difference between
model theory and proof theory. They demonstrated that the
approach taken by the database community was model
theoretic (i.e., the database represents the truths of the
theory and queries are answered by a bottom-up search).
However, in logic programming, answers to a query used
a proof theoretic approach, starting from the query, in a
top-down search. Reiter contributed two papers. One
dealt with compiling axioms. He noted that if the IDB
contained no recursive axioms, then a theorem prover
could be used to generate a new set of axioms where the
head of each axiom was dened in terms of relations in a
database. Hence, a theorem prover was no longer needed
during query operations. His second paper discussed the
CWA, whereby in a theory, if one cannot prove that an
atomic formula is true, then the negation of the atomic
formula is assumed to be true. Reiters paper elucidated
three major issues: the denition of a query, an answer to
a query, and how one deals with negation. Clark pre-
sented an alternative theory of negation. He introduced
the concept of if-and-only-if conditions that underly the
meaning of negation, called negation-as-nite-failure.
The Reiter and Clark papers are the rst to formally
dene default negation in logic programs and deductive
databases. Several implementations of deductive data-
bases were reported. Chang developed a system called
DEDUCE; Kellogg, Klahr, and Travis developed a system
called Deductively Augmented Data Management System
(DADM); and Minker described a systemcalled Maryland
Refutation Proof Procedure 3.0 (MRPPS 3.0). Kowalski
discussed the use of logic for data description. Darvas,
Futo, and Szeredi presented applications of Prolog to drug
dataanddruginteractions. NicolasandYazdaniandescribed
the importance of integrity constraints in deductive data-
bases. The bookprovided, for the rst time, a comprehensive
description of the interaction between logic and databases.
References to work on the history of the development of
the eld of deductive databases may be found in Refs. 1 and
2. A brief description of the early systems is contained in
Ref. 1.
Deductive Databases: Prototypes, 19792004
During the period 1979 through today, a number of proto-
type systems were developed based on the Robinson Reso-
lution Principle and bottom-up techniques. Most of the
prototypes that were developed during this period either
no longer exist or are not supported. In this section, we
describe several efforts because these systems contributed
to developments that were subsequently incorporated into
SQL as described in the SQL: 1999 section.
These systems include Validity developed by Nicolas and
Vieille; Coral, developedbyRamakrishnanat the University
of Wisconsin; andNAIL!, developedbyUllmanandhis group
at Stanford University. All of these prototypes introduced
new techniques for handling deductive databases.
The Validity systems predecessor was developed at the
European Computer Research Consortium, was directed
by Nicolas, and started in 1984. It led to the study of
algorithms and prototypes: deductive query evaluation
methods (QSQ/SLD and others); integrity checking
(Soundcheck); hypothetical reasoning and IC checking;
and aggregation through recursion.
Implementation at Stanford University, directed by
Ullman, started in 1985 on NAIL! (Not Another Implemen-
tation of Logic!). The effort led to the rst paper on recur-
sion using the magic sets method. See the Datalog
Databases section for a discussion of magic sets. Other
contributions were aggregation in logical rules and
theoretical contributions to negation: stratied negation
by Van Gelder, well-founded negation by Van Gelder,
Ross and Schlipf, (see the Nonstratied Deductive Data-
bases section for a discussion of stratication and well-
foundedness) and modularly stratied negation (3).
Implementation efforts at the University of Wisconsin,
directed by Ramakrishnan, on the Coral DDBs started in
the 1980s. Bottom-up and magic set methods were imple-
mented. The declarative query language supports general
Horn clauses augmented with complex terms, set-grouping,
DEDUCTIVE DATABASES 3
aggregation, negation, and relations with tuples that con-
tain universally quantied variables.
DATALOG AND EXTENDED DEDUCTIVE DATABASES
The rst generalization of relational databases was to
permit function-free recursive Horn rules in a database
(i.e., rules in which the head of a rule is an atom and the
body of a rule is a conjunction of atoms). So, in Formula (1),
n 1, m1 and l 0. These databases are called DDBs, or
Datalog databases.
Datalog Databases
In 1976, van Emden and Kowalski formalized the seman-
tics of logic programs that consist of Horn rules, where the
rules are not necessarily function-free. They recognized
that the semantics of Horn theories can be characterized
in three distinct ways: by model, xpoint, or proof theory.
These three characterizations lead to the same semantics.
When the logic program is function-free, their work pro-
vides the semantics for Datalog databases.
To better understand model theory we need to introduce
the concept of Herbrand universe, which is the set of con-
stants of the database. The Herbrand base is the set of all
atoms that can be constructed from the predicates using
only elements from the Herbrand universe for the argu-
ments. Aset of atoms fromthe Herbrand base that satises
all the rules is called a Herbrand model.
Model theory deals with the collection of models that
captures the intended meaning of the database. Fixpoint
theory deals with a xpoint operator that constructs the
collection of all atoms that can be inferred to be true from
the database. Proof theory provides a procedure that
nds answers to queries with respect to the database.
van Emden and Kowalski showed that the intersection
of all Herbrand models of a Horn DDB is the unique
minimal model, which is the same as all of the atoms in
the xpoint and are exactly the atoms provable from the
theory.
Example 2 (Example of Semantics). Consider Example 1.
The unique minimal Herbrand model of the database is:
M { p(mike, jack), p(sally, jack), p(katie, mike), p(bev-
erly, mike), p(roger, sally), a(mike, jack), a(sally,
jack), a(katie, mike), a(beverly, mike), a(roger, sally),
a(katie, jack), a(beverly, jack), a(roger, jack)}.
These atoms are all true, and when substituted into the
rules in Example 1, they make all of the rules true. Hence,
they form a model. If we were to add another fact to the
model M, say, p(jack, sally), it would not contradict any of
the rules, and it would also be a model. However, this fact
can be eliminated because the original set was a model and
is contained in the expanded model. That is, minimal
Herbrand models are preferred. It is also easy to see that
the atoms in M are the only atoms that can be derived from
the rules and the data. In Example 3, below, we show that
these atoms are in the xpoint of the database.
To nd if the negation of a ground atom is true, one can
subtract from the Herbrand base the minimal Herbrand
model. If the atomis containedinthis set, thenit is assumed
false. Alternatively, answering queries that consist of
negated atoms that are ground may be achieved using
negation-as-nite failure as described by Clark.
Initial approaches to answering queries in DDBs did not
handle recursion and were primarily top-down (or back-
ward reasoning). However, answering queries in relational
database systems was bottom-up (or forward reasoning) to
nd all answers. Several approaches were developed to
handle recursion, two of which are called the Alexander
and magic set methods, which make use of constants that
appear in a query and performsearch by bottom-up reason-
ing. Rohmer, Lescoeur, and Kerisit introduced the Alex-
ander method. Bancilhon, Maier, Sagiv, and Ullman
developed the concept of magic sets. These methods take
advantage of constants in the query and effectively com-
pute answers using a combined top-down and bottom-up
approach. Bry reconciled the bottom-up and top-down
methods to compute recursive queries. He showed that
the Alexander and magic set methods based on rewriting
and methods based on resolution implement the same top-
down evaluation of the original database rules by means of
auxiliary rules processed bottom-up. In principle, handling
recursion poses no additional problems. One can iterate
search (referred to as the naive method) until a xpoint is
reached, which can be achieved in a nite set of steps
because the database has a nite set of constants and is
function-free. However, it is unknown howmany steps will
be required to obtainthe xpoint. The Alexander andmagic
set methods improve search time when recursion exists,
such as for transitive closure rules.
Example 3 (Fixpoint). The xpoint of a database is the
set of all atoms that satisfy the EDB and the IDB. The
xpoint may be found in a naive manner by iterating until
no more atoms can be found. Consider Example 1 again.
Step0 f. That is, nothing is in the xpoint.
Step1 { p(mike, jack), p(sally, jack), p(katie, mike),
p(beverly, mike), p(roger, sally)}.
These are all facts, and satisfy r1, r2, r3, r4, and r5. The
atoms in Step0 [Step1 now constitutes a partial
xpoint.
Step2 {a(mike, jack), a(sally, jack), a(katie, mike),
a(beverly, mike), a(roger, sally)}
are found by using the results of Step0 [Step1 on
rules r6 and r7. Only rule r6 provides additional
atoms when applied. Step0 [Step1 [Step2
becomes the revised partial xpoint.
Step3 {a(katie, jack), a(beverly, jack), a(roger, jack)},
which results fromthe previous partial xpoint. These
were obtained from rule r7, which was the only rule
that provided new atoms at this step. The new partial
xpoint is Step0 [Step1 [Step2 [Step3.
Step4 f. No additional atoms can be found that
satisfy the EDB[IDB. Hence, the xpoint iteration
may be terminated, and the xpoint is Step0 [
Step1 [Step2 [Step3.
4 DEDUCTIVE DATABASES
Notice that this result is the same as the minimal
model M in Example 2.
Classes of recursive rules exist where it is known how
many iterations will be required. These rules lead to what
has been called bounded recursion, noted rst by Minker
andNicolas andextendedbyNaughtonandSagiv. Example
4 illustrates bounded recursion.
Example 4 (Bounded Recursion). If a rule is singular, then
it is bound to terminate in a nite number of steps inde-
pendent of the state of the database. A recursive rule is
singular if it is of the form
RF ^R
1
^ . . . ^R
n
where F is a conjunction of possibly empty base relations
(i.e., empty EDB) and R, R
1
, R
2
, . . ., R
n
are atoms that have
the same relation name if:
1. each variable that occurs in an atom R
i
and does
not occur in R only occurs in R
i
;
2. each variable in R occurs in the same argument
position in any atom R
i
where it appears, except
perhaps in at most one atom R
1
that contains all of
the variables of R.
Thus, the rule
RX; Y; Z RX; Y
0
; Z; RX; Y; Z
0
S 1 P; insertemployeeN; D; S
0
EMP(m; d)
where insert_EMP is an interactive program asking for a
name and a salary for the new manager, so that the
corresponding tuple can be inserted in the relation EMP.
Another important feature of active rules is their ability
to express dynamic dependencies. The particularity of
dynamic dependencies is that they refer to more than
one database state (as opposed to static dependencies
that refer to only one database state). A typical dynamic
dependency, in the context of the database of Fig. 1, is to
state that salaries must never decrease, which corresponds
to the following active rule:
on update
-tree
can be used for spatial data by linearizing a multidimen-
sional space using a space-lling curve such as the Z-order.
Many spatial indices (27) have been explored for multi-
dimensional Euclidean space. Representative indices for
point objects include grid les, multidimensional grid les
(109), Point-Quad-Trees, and Kd-trees. Representative
indices for extended objects include the R-tree family,
the Field-tree, Cell-tree, BSP-tree, and Balanced and
Nested grid les.
Grid Files. Grid les were introduced by Nievergelt
(110). Agridle divides the space into n-dimensional spaces
that can t into equal-size buckets. The structures are not
hierarchical and can be used to index static uniformly
distributed data. However, because of its structure, the
directory of a grid le canbe so sparse and large that a large
main memory is required. Several variations of grid les,
exists to index data efciently and to overcome these
limitations (111,112). An overview of grid les is given in
Ref. 27.
Tree indexes. R-tree aims to index objects in a hierarch-
ical index structure (113). The R-tree is a height-balanced
tree that is the natural extension of the B-tree for k-dimen-
sions. Spatial objects are represented in the R-tree by their
minimumbounding rectangle (MBR). Figure 11 illustrates
spatial objects organized as an R-tree index. R-trees can be
used to process both point and range queries.
Several variants of R-trees exist for better performance
of queries andstorage use. The R
+
-tree (114) is used to store
objects by avoiding overlaps among the MBRs, which
increases the performance of the searching. R
-trees
(115) relyonthe combinedoptimizationof the area, margin,
and overlap of each MBR in the intermediate nodes of the
tree, which results in better storage use.
L.name
Area(L.Geometry) > 20
Fa.name = campground
Distance(Fa.Geometry, L.Geometry) < 50
Facilities Fa Lake L
(a)
Facilities Fa
Fa.name = campground
Distance(Fa.Geometry, L.Geometry) < 50
Lake L
Area(L.Geometry) > 20
L.name
(b)
Distance(Fa.Geometry, L.Geometry) < 50
Fa.name = campground
Facilities Fa
L.name
Area(L.Geometry) > 20
Lake L
(c)
Figure 9. (a) Query tree, (b) pushing down: select operation, and (c) pushing down may not help.
Row PeanoHilbert Morton / Zorder
Figure 10. Space-lling curves to linearize a multidimensional space.
SPATIAL DATABASES 13
Many R-tree-based index structures (116119,120,121)
have been proposed to index spatiotemporal objects. A
survey of spatiotemporal access methods has been pro-
vided in Ref. 122.
Quadtree(123) is a space-partitioning index structure in
which the space is divided recursively into quads. This
recursive process is implemented until each quad is homo-
geneous. Several variations of quad trees are available to
store point data, raster data, and object data. Also, other
quad tree structures exist to index spatiotemporal data
sets, such as overlapping linear quad trees (24) and multi-
ple overlapping features (MOF) trees (125).
The Generalized Search Tree (GiST) (126) provides a
framework to build almost any kind of tree index on any
kind of data. Tree index structures, such as B
-tree and
R-tree, can be built using GiST. A spatial-partitioning
generalized search tree (SP-GiST) (127) is an extensible
index structure for space-partitioning trees. Index trees
such as quad tree and kd-tree can be built using SP-GiST.
Graph Indexes. Most spatial access methods provide
methods and operators for point and range queries over
collections of spatial points, line segments, and polygons.
However, it is not clear if spatial access methods can
efciently support network computations that traverse
line segments in a spatial network based on connectivity
rather than geographic proximity. A connectivity-clustered
access method for spatial network (CCAM) is proposed to
indexspatial networks basedongraphpartitioning (108) by
supporting network operations. An auxiliary secondary
index, such as B
PrX
2
Figure 12. (a) Learning data set: The geometry of the Darr wetland and the locations of the nests, (b) the spatial distribution of vegetation
durability over the marshland, (c) the spatial distribution of water depth, and (d) the spatial distribution of distance to open water.
SPATIAL DATABASES 15
The solution procedure can estimate Pr(l
i
|L
i
) from the
training data, where L
i
denotes a set of labels in the
neighborhood of s
i
excluding the label at s
i
. It makes this
estimate by examining the ratios of the frequencies of class
labels to the total number of locations in the spatial frame-
work. Pr(X|l
i
, L
i
) can be estimated using kernel functions
from the observed values in the training data set.
A more detailed theoretical and experimental compar-
ison of these methods can be found in Ref. 140. Although
MRF and SAR classication have different formulations,
they share a common goal of estimating the posterior
probability distribution. However, the posterior probability
for the two models is computed differently with different
assumptions. For MRF, the posterior is computed using
Bayes rule, whereas in SAR, the posterior distribution is
tted directly to the data.
Spatial Clustering. Spatial clustering is a process of
grouping a set of spatial objects into clusters so that objects
within a cluster have high similarity in comparison with
one another but are dissimilar to objects in other clusters.
For example, clustering is used to determine the hot
spots in crime analysis and disease tracking. Many crim-
inal justice agencies are exploring the benets provided by
computer technologies to identify crime hot spots to take
preventive strategies such as deploying saturation patrols
in hot-spot areas.
Spatial clustering canbe appliedto groupsimilar spatial
objects together; the implicit assumptionis that patterns in
space tend to be grouped rather than randomly located.
However, the statistical signicance of spatial clusters
should be measured by testing the assumption in the
data. One method to compute this measure is based on
quadrats (i.e., well-dened areas, often rectangular in
shape). Usually quadrats of random location and orienta-
tions in the quadrats are counted, and statistics derived
from the counters are computed. Another type of statistics
is based on distances between patterns; one such type is
Ripleys K-function (141). After the verication of the sta-
tistical signicance of the spatial clustering, classic clus-
tering algorithms (142) can be used to discover interesting
clusters.
Spatial Outliers. A spatial outlier (143) is a spatially
referenced object whose nonspatial attribute values differ
signicantlyfromthoseof other spatiallyreferencedobjects
in its spatial neighborhood. Figure 13 gives an example of
detecting spatial outliers in trafc measurements for sen-
sors on highway I-35W (North bound) for a 24-hour time
period. Station 9 seems to be a spatial outlier as it exhibits
inconsistent trafc ow as compared with its neighboring
stations. The reason could be that the sensor at station 9 is
malfunctioning. Detectingspatial outliers is useful inmany
applications of geographic information systems and spatial
databases, including transportation, ecology, public safety,
public health, climatology, and location-based services.
Spatial attributes are used to characterize location,
neighborhood, and distance. Nonspatial attribute dimen-
sions are used to compare a spatially referenced object
with its neighbors. Spatial statistics literature provides
two kinds of bipartite multidimensional tests, namely
graphical tests and quantitative tests. Graphical tests,
which are based on the visualization of spatial data, high-
light spatial outliers, for example, variogram clouds (141)
and Moran scatterplots (144). Quantitative methods pro-
vide a precise test to distinguish spatial outliers from the
remainder of data. A unied approach to detect spatial
outliers efciently is discussed in Ref. 145., Referance
146 provides algorithms for multiple spatial-outlier detec-
tion.
Spatial Colocation The colocation pattern discovery
process nds frequently colocated subsets of spatial event
types given a map of their locations. For example, the
analysis of the habitats of animals and plants may identify
the colocations of predatorprey species, symbiotic species,
or re events withfuel, ignitionsources andso forth. Figure
14 gives an example of the colocation between roads and
rivers in a geographic region.
Approaches to discovering colocation rules can be cate-
gorized into two classes, namely spatial statistics and data-
mining approaches. Spatial statistics-based approaches
use measures of spatial correlation to characterize the
relationship between different types of spatial features.
Measures of spatial correlationinclude the cross K-function
with Monte Carlo simulation, mean nearest-neighbor dis-
tance, and spatial regression models.
Data-mining approaches can be further divided into
transaction-based approaches and distance-based approa-
ches. Transaction-based approaches focus on dening
transactions over space so that an a priori-like algorithm
can be used. Transactions over space can be dened by a
reference-feature centric model. Under this model, trans-
actions are created around instances of one user-specied
spatial feature. The association rules are derived using the
a priori (147) algorithm. The rules formed are related to the
reference feature. However, it is nontrivial to generalize
the paradigmof forming rules related to a reference feature
to the case in which no reference feature is specied. Also,
dening transactions around locations of instances of all
features may yield duplicate counts for many candidate
associations.
In a distance-based approach (148150), instances of
objects are grouped together based on their Euclidean
distance from each other. This approach can be considered
to be an event-centric model that nds subsets of spatial
Figure 13. Spatial outlier (station ID 9) in trafc volume data.
16 SPATIAL DATABASES
features likely to occur in a neighborhood around instances
of given subsets of event types.
Research Needs
This section presents several research needs in the area of
spatiotemporal data mining and spatialtemporal net-
work mining.
SpatioTemporal Data Mining. Spatiotemporal (ST)
data mining aims to develop models and objective functions
and to discover patterns that are more suited to Spatio
temporal databases and their unique properties (15). An
extensive survey of Spatiotemporal databases, models,
languages, and access methods can be found in Ref. 152.
A bibliography of Spatiotemporal data mining can be
found in Ref. 153.
Spatiotemporal pattern mining focuses on discovering
knowledge that frequently is located together in space and
time. Referances 154, 155, and 156 dened the problems of
discovering mixed-drove and sustained emerging Spatio
temporal co-occurrence patterns and proposed interest
measures and algorithms to mine such patterns. Other
research needs include conation, in which a single feature
is obtained from several sources or representations. The
goal is to determine the optimal or best representation
based on a set of rules. Problems tend to occur during
maintenance operations and cases of vertical obstruction.
In several application domains, such as sensor net-
works, mobile networks, moving object analysis, and image
analysis, the need for spatiotemporal data mining is
increasing drastically. It is vital to develop new models
and techniques, to dene new spatiotemporal patterns,
andto formulize monotonic interest measures to mine these
patterns (157).
SpatioTemporal Network Mining. Inthe post-9/11 world
of asymmetric warfare in urban area, many human activ-
ities are centered about ST infrastructure networks, such
as transportation, oil/gas-pipelines, and utilities (e.g.,
water, electricity, and telephone). Thus, activity reports,
for example, crime/insurgency reports, may often use
network-based location references, for exmple, street
address such as 200 Quiet Street, Scaryville, RQ
91101. In addition, spatial interaction among activities
at nearby locations may be constrained by network con-
nectivity and network distances (e.g., shortest path along
roads or train networks) rather than geometric distances
(e.g., Euclidean or Manhattan distances) used in tradi-
tional spatial analysis. Crime prevention may focus on
identifying subsets of ST networks with high activity
levels, understanding underlying causes in terms of ST-
network properties, and designing ST-network-control
policies.
Existing spatial analysis methods face several chal-
lenges (e.g., see Ref. 158). First, these methods do not model
the effect of explanatory variables to determine the loca-
tions of network hot spots. Second, existing methods for
network pattern analysis are computationally expensive.
Third, these methods do not consider the temporal aspects
of the activity in the discovery of network patterns. For
example, the routes used by criminals during the day and
night may differ. The periodicity of bus/train schedules can
have an impact on the routes traveled. Incorporating the
time-dependency of transportation networks can improve
the accuracy of the patterns.
SUMMARY
In this chapter we presented the major research accom-
plishments and techniques that have emerged from the
area of spatial databases in the past decade. These accom-
plishments and techniques include spatial database mod-
eling, spatial query processing, and spatial access methods.
We have also identied areas in which more research is
needed, such as spatiotemporal databases, spatial data
mining, and spatial networks.
Figure 15 provides a summary of topics that continue to
drive the research needs of spatial database systems.
Increasingly available spatial data in the form of digitized
maps, remotely sensed images, Spatiotemporal data (for
example, from videos), and streaming data from sensors
have to be managed and processed efciently. New ways of
querying techniques to visualize spatial data in more than
one dimension are needed. Several advances have been
made in computer hardware over the last few years, but
many have yet to be fully exploited, including increases in
main memory, more effective storage using storage area
networks, greater availability of multicore processors, and
powerful graphic processors. A huge impetus for these
advances has been spatial data applications such as land
navigation systems and location-based services. To mea-
sure the quality of spatial database systems, new bench-
marks have to be established. Some benchmarks (159,160)
established earlier have become dated. Newer benchmarks
are needed to characterize the spatial data management
needs of other systems and applications such as Spatio
temporal databases, moving-objects databases, and loca-
tion-based services.
Figure 14. Colocation between roads and rivers in a hilly terrain
(Courtesy: Architecture Technology Corporation).
SPATIAL DATABASES 17
ACKNOWLEDGMENTS
We thank the professional organizations that have funded
the research on spatial databases, in particular, the
National Science Foundation, Army Research Laboratory,
Topographic Engineering Center, Oak Ridge National
Laboratory, Minnesota Department of Transportation,
and Microsoft Corporation. We thank members of the
spatial database and spatial data mining research group
at the University of Minnesota for rening the content of
this chapter. We also thank Kim Koffolt for improving the
readability of this chapter.
BIBLIOGRAPHY
1. R. H. Guting, An introduction to spatial database systems,
VLDB Journal, Special Issue on Spatial Database Systems,
3(4): 357399, 1994.
2. W. Kim, J. Garza, and A. Kesin, Spatial data management in
database systems, in Advances in Spatial Databases, 3rd
International Symposium, SSD93, Vol. 652. Springer, 1993.
3. Y. Manolopoulos, A. N. Papadopoulos, and M. G. Vassilako-
poulos, Spatial Databases: Technologies, Techniques and
Trends. Idea Group Publishing, 2004.
4. S. Shekhar and S. Chawla, Spatial Databases: A Tour. Pre-
ntice Hall, 2002.
5. Wikipedia. Available: http://en.wikipedia.org/wiki/Spatial_
Database, 2007.
6. M. Worboys andM. Duckham, GIS: AComputingPerspective,
2nd ed. CRC, 2004.
7. J. Schiller, Location-Based Services. Morgan Kaufmann,
2004.
8. A. Stefanidis and S. Nittel, GeoSensor Networks. CRC, 2004.
9. R. Scally, GIS for Environmental Management. ESRI Press,
2006.
10. L. Lang, Transportation GIS, ESRI Press, 1999.
11. P. Elliott, J. C. Wakeeld, N. G. Best, andD. J. Briggs, Spatial
Epidemiology: Methods and Applications. Oxford University
Press, 2000.
12. M. R. Leipnik and D. P. Albert, GIS in Law Enforcement:
Implementation Issues and Case Studies, CRC, 2002.
13. D. K. Arctur and M. Zeiler, Designing Geodatabases, ESRI
Press, 2004.
14. E. Beinat, A. Godfrind, and R. V. Kothuri, Pro Oracle Spatial,
Apress, 2004.
15. SQL Server 2008 (Code-name Katmai). Available: http://
www.microsoft.com/sql/prodinfo/futureversion/
default.mspx, 2007.
16. Google Earth. Available: http://earth.google.com, 2006.
17. Microsoft Virtual Earth. Available: http://www.microsoft.
com/virtualearth, 2006.
18. PostGIS. Available: http://postgis.refractions.net/, 2007.
19. MySQLSpatial Extensions. Available: http://dev.mysql.com/
doc/refman/5.0/en/spatial-extensions.html, 2007.
20. Sky Server, 2007. Available: http://skyserver.sdss.org/.
21. D. Chamberlin, Using the NewDB2: IBMs Object Relational
System, Morgan Kaufmann, 1997.
22. M. Stonebraker and D. Moore, Object Relational DBMSs: The
Next Great Wave. Morgan Kaufmann, 1997.
23. OGC. Available: http://www.opengeospatial.org/standards,
2007.
24. P. Rigaux, M. Scholl, and A. Voisard, Spatial Databases: With
Application to GIS, Morgan Kaufmann Series in Data Man-
agement Systems, 2000.
25. R. H. Guting and M. Schneider, Moving Objects Databases
(The Morgan Kaufmann Series in Data Management Sys-
tems). SanFrancisco, CA: MorganKaufmannPublishers Inc.,
2005.
26. S. Shekhar and H. Xiong, Encyclopedia of GIS, Springer,
2008, forthcoming.
27. H. Samet, Foundations of Multidimensional andMetric Data
Structures, Morgan Kaufmann Publishers, 2006.
28. ACM Geographical Information Science Conference. Avail-
able: http://www.acm.org.
29. ACM Special Interest Group on Management Of Data.
Available: http://www.sigmod.org/.
30. Geographic Information Science Center Summer and Winter
Assembly. Available: http://www.gisc.berkeley.edu/.
31. Geoinformatica. Available: http://www.springerlink.com/
content/100268/.
32. IEEE International Conference on Data Engineering. Avail-
able: http://www.icde2007.org/icde/.
33. IEEE Transactions on Knowledge and Data Engineering
(TKDE). Available: http://www.computer.org/tkde/.
34. International Journal of Geographical Information Science.
Available: http://www.tandf.co.uk/journals/tf/13658816.html.
35. International Symposium on Spatial and Temporal Data-
bases. Available: http://www.cs.ust.hk/ sstd07/.
36. Very Large Data Bases Conference. Available: http://
www.vldb2007.org/.
37. UCGIS, 1998.
38. S. Shekhar, R. R. Vatsavai, S. Chawla, and T. E. Burke,
Spatial Pictogram Enhanced Conceptual Data Models
and Their Translations to Logical Data Models, Integrated
Videos
Remotely sensed
imagery
Internet
Dual Core Processors
Storage Area Networks
XML Database
Stream Database
GIS
Location Based Services
Navigation
Cartography
Spatial Analysis
Data Out
3D Visualization
Animation
2D Maps
Digital photos
Spatial
Database
System
Data In
Applications
Platform
Digitized maps
Sensors
Figure 15. Topics driving future research needs in spatial data-
base systems.
18 SPATIAL DATABASES
Spatial Databases: Digital Images and GIS, Lecture Notes in
Computer Science, 1737: 77104, 1999.
39. N. R. Adam and A. Gangopadhyay, Database Issues in Geo-
graphic Information Systems, Norwell, MA: Kluwer Aca-
demic Publishers, 1997.
40. M. Stonebraker and G. Kemnitz, The Postgres Next Genera-
tion Database Management System, Commun. ACM, 34(10):
7892, 1991.
41. P. Bolstad, GIS Fundamentals: A First Text on Geographic
Information Systems, 2nd ed. Eider Press, 2005.
42. T. Hadzilacos and N. Tryfona, An Extended Entity-Relation-
ship Model for Geographic Applications, ACM SIGMOD
Record, 26(3): 2429, 1997.
43. S. Shekhar, R. R. Vatsavai, S. Chawla, andT. E. Burk, Spatial
pictogram enhanced conceptual data models and their trans-
lation to logical data models, Lecture Notes in Computer
Science, 1737: 77104, 2000.
44. F. T. Fonseca and M. J. Egenhofer, Ontology-driven geo-
graphic information systems, in Claudia Bauzer Medeiros
(ed.), ACM-GIS99, Proc. of the 7th International Symposium
on Advances in Geographic Information Systems, Kansas
City, US 1999, pp. 1419. ACM, 1999.
45. Time ontology in owl, Electronic, September 2005.
46. H. J. Berners-Lee, T. Lassila, and O. Lassila, The semantic
web, The Scientic American, 2001, pp. 3443.
47. M. J. Egenhofer, Toward the semantic geospatial web, Proc.
Tenth ACM International Symposium on Advances in Geo-
graphic Information Systems, 2002.
48. F. Fonseca and M. A. Rodriguez, From geo-pragmatics to
derivation ontologies: New directions for the geospatial
semantic web, Transactions in GIS, 11(3), 2007.
49. W3C. Resource Description Framework. Available: http://
www.w3.org/RDF/,2004.
50. G. Subbiah, A. Alam, L. Khan, B. Thuraisingham, An inte-
grated platform for secure geospatial information exchange
through the semantic web, Proc. ACM Workshop on Secure
Web Services (SWS), 2006.
51. OpenGeospatial ConsortiumInc. OpenGISReference Model.
Available: http://orm.opengeospatial.org/, 2006.
52. S. Shekhar and X. Liu, Direction as a Spatial Object: A
Summary of Results, in R. Laurini, K. Makki, and N. Pissi-
nou, (eds.), ACM-GIS 98, Proc. 6th international symposium
onAdvances inGeographic InformationSystems. ACM, 1998,
pp. 6975.
53. S. Skiadopoulos, C. Giannoukos, N. Sarkas, P. Vassiliadis,
T. Sellis, and M. Koubarakis, Computing and managing
cardinal direction relations, IEEE Trans. Knowledge and
Data Engineering, 17(12): 16101623, 2005.
54. ASpatiotemporal Model andLanguage for MovingObjects on
Road Networks, 2001.
55. Modeling and querying moving objects in networks, volume
15, 2006.
56. Modeling and Querying Moving Objects, 1997.
57. On Moving Object Queries, 2002.
58. K. K. L. Chan and C. D. Tomlin, Map Algebra as a Spatial
Language, in D. M. Mark and A. U. Frank, (eds.), Cognitive
and Linguistic Aspects of Geographic Space. Dordrecht,
Netherlands: Kluwer Academic Publishers, 1991, pp. 351
360.
59. W. Kainz, A. Riedl, and G. Elmes, eds, A Tetrahedronized
Irregular Network Based DBMS Approach for 3D Topo-
graphic Data, Springer Berlin Heidelberg, September 2006.
60. C. Arens, J. Stoter, and P. Oosterom, Modeling 3D Spatial
Objects ina Geo-DBMSUsing a 3DPrimitive, Computers and
Geosciences, 31(2): 165177, act 2005.
61. E. Kohler, K. Langkau, and M. Skutella, Time-expanded
graphs for ow-dependent transit times, in ESA02: Proceed-
ings of the 10thAnnual EuropeanSymposiumon Algorithms.
London, UK: Springer-Verlag, 2002, pp. 599611.
62. B. George and S. Shekhar, Time-aggregated graphs for mod-
eling spatiotemporal networks, in ER (Workshops), 2006,
pp. 8599.,
63. K. Eickhorst, P. Agouris, and A. Stefanidis, Modeling and
Comparing Spatiotemporal Events, in Proc. 2004 annual
national conference on Digital government research. Digital
Government Research Center, 2004, pp. 110.
64. Geographic Markup Language. Available: http://www.open-
gis.net/gml/, 2007.
65. S. Shekhar, R. R. Vatsavai, N. Sahay, T. E. Burk, andS. Lime,
WMSandGMLbasedInteroperable WebMapping System, in
9th ACM International Symposium on Advances in Geo-
graphic Information Systems, ACMGIS01. ACM, November
2001.
66. CityGML, 2007. Available: http://www.citygml.org/.
67. Keyhole Markup Language. Available: http://code.google.-
com/apis/kml/documentation/, 2007.
68. V. Gaede and O. Gunther, Multidimensional access methods,
ACM Computing Surveys, 30, 1998.
69. G. R. Hjaltason and H. Samet, Ranking in spatial data
bases, in Symposium on Large Spatial Databases, 1995,
pp. 8395.
70. N. Roussopoulos, S. Kelley, and F. Vincent, Nearest neighbor
queries, in SIGMOD 95: Proc. 1995 ACM SIGMOD interna-
tional conference on Management of data, New York: ACM
Press, 1995, pp. 7179.
71. D. Papadias, Y. Tao, K. Mouratidis, and C. K. Hui, Aggregate
nearest neighbor queries in spatial databases, ACM Trans.
Database Systems, 30(2): 529576, 2005.
72. F. Korn and S. Muthukrishnan, Inuence Sets Based on
Reverse Nearest Neighbor Queries, in Proc. ACM Interna-
tional Conference on Management of Data, SIGMOD, 2000,
pp. 201212.
73. J. M. Kang, M. Mokbel, S. Shekhar, T. Xia, and D. Zhang,
Continuous evaluation of monochromatic and bichromatic
reverse nearest neighbors, in Proc. IEEE 23rd International
Conference on Data Engineering (ICDE), 2007.
74. I. Stanoi, D. Agrawal, and A. ElAbbadi, Reverse Nearest
Neighbor Queries for Dynamic Databases, in ACM SIGMOD
Workshop on Research Issues in Data Mining and Knowledge
Discovery, 2000, pp. 4453.
75. A. Nanopoulos, Y. Theodoridis, and Y. Manolopoulos, C2P:
Clustering based on Closest Pairs, in Proc. International
Conference on Very Large Data Bases, VLDB, 2001, pp.
331340.
76. T. Xia and D. Zhang, Continuous Reverse Nearest Neighbor
Monitoring, in Proc. International Conference on Data Engi-
neering, ICDE, 2006.
77. T. Brinkoff, H.-P. Kriegel, R. Schneider, andB. Seeger, Multi-
step processing of spatial joins, In Proc. ACM International
Conference onManagement of Data, SIGMOD, 1994, pp. 197
208.
78. L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, and J.S.
Vitter, Scalable sweeping-based spatial join, in Proc. Very
Large Data Bases (VLDB), 1998, pp. 570581.
SPATIAL DATABASES 19
79. N. Mamoulis and D. Papadias, Slot index spatial join, IEEE
Trans. Knowledge and Data Engineering, 15(1): 211231,
2003.
80. M.-J. Lee, K.-Y. Whang, W.-S. Han, and I.-Y. Song, Trans-
form-space view: Performing spatial join in the transform
space using original-space indexes, IEEE Trans. Knowledge
and Data Engineering, 18(2): 245260, 2006.
81. S. Shekhar, C.-T. Lu, S. Chawla, andS. Ravada, Efcient join-
index-basedspatial-joinprocessing: Aclustering approach, in
IEEE Trans. Knowledge and Data Engineering, 14(6): 1400
1421, 2002.
82. M. Zhu, D. Papadias, J. Zhang, and D. L. Lee, Top-k spatial
joins, IEEE Trans. Knowledge and Data Engineering, 17(4):
567579, 2005.
83. H. Hu and D. L. Lee, Range nearest-neighbor query, IEEE
Trans. Knowledge and Data Engineering, 18(1): 7891, 2006.
84. M. L. Yiu, N. Mamoulis, and D. Papadias, Aggregate nearest
neighbor queries in road networks, IEEE Trans. Knowledge
and Data Engineering, 17(6): 820833, 2005.
85. J. van den Bercken and B. Seeger, An evaluation of generic
bulk loading techniques, in VLDB 01: Proc. 27th Interna-
tional Conference on Very Large Data Bases. San Francisco,
CA: Morgan Kaufmann Publishers Inc., 2001, pp. 461470.
86. A. Aggarwal, B. Chazelle, L. Guibas, C. ODunlaing, and C.
Yap, Parallel computational geometry. Proc. 25th IEEESym-
posium on Foundations of Computer Science, 1985, pp. 468
477.
87. S. G. Akl and K. A. Lyons, Parallel Computational Geometry,
Englewood Cliffs, NJ: Prentice Hall, 1993.
88. R. Sridhar, S. S. Iyengar, and S. Rajanarayanan, Range
search in parallel using distributed data structures, Interna-
tional Conference on Databases, Parallel Architectures, and
Their Applications, 1990, pp. 1419.
89. M. P. Armstrong, C. E. Pavlik, and R. Marciano, Experiments
in the measurement of spatial association using a parallel
supercomputer, Geographical Systems, 1: 267288, 1994.
90. G. Brunetti, A. Clematis, B. Falcidieno, A. Sanguineti, andM.
Spagnuolo, Parallel processing of spatial data for terrain
characterization, Proc. ACM Geographic Information Sys-
tems, 1994.
91. W. R. Franklin, C. Narayanaswami, M. Kankanahalli, D.
Sun, M. Zhou, and P. Y. F. Wu, Uniform grids: A technique
for intersection detection on serial and parallel machines,
Proc. 9th Automated Cartography, 1989, pp. 100109.
92. E. G. Hoel and H. Samet, Data Parallel RTree Algorithms,
Proc. International Conference on Parallel Processing, 1993.
93. E. G. Hoel and H. Samet, Performance of dataparallel spatial
operations, Proc. of the 20thInternational Conference onVery
Large Data Bases, 1994, pp. 156167.
94. V. Kumar, A. Grama, and V. N. Rao, Scalable load balancing
techniques for parallel computers, J. Parallel andDistributed
Computing, 22(1): 6069, July 1994.
95. F. Wang, A parallel intersection algorithmfor vector polygon
overlay, IEEE Computer Graphics and Applications, 13(2):
7481, 1993.
96. Y. Zhou, S. Shekhar, and M. Coyle, Disk allocation methods
for parallelizing grid les, Proc. of the Tenth International
Conference on Data Engineering, IEEE, 1994, pp. 243
252.
97. S. Shekhar, S. Ravada, V. Kumar, D. Chubband, and G.
Turner, Declustering and load-balancing methods for paral-
lelizing spatial databases, IEEE Trans. Knowledge and Data
Engineering, 10(4): 632655, 1998.
98. M. T. Fang, R. C. T. Lee, and C. C. Chang, The idea of
de-clustering and its applications, Proc. of the International
Conference on Very Large Data Bases, 1986, pp. 181188.
99. D. R. Liu and S. Shekhar, A similarity graph-based approach
to declustering problem and its applications, Proc. of the
Eleventh International Conference on Data Engineering,
IEEE, 1995.
100. A. Belussi and C. Faloutsos. Estimating the selectivity of
spatial queries using the correlation fractal dimension, in
Proc. 21st International Conference on Very Large Data
Bases, VLDB, 1995, pp. 299310.
101. Y. Theodoridis, E. Stefanakis, and T. Sellis, Cost models for
join queries in spatial databases, in Proceedings of the IEEE
14thInternational Conference onDataEngineering, 1998, pp.
476483.
102. M. Erwig, R. Hartmut Guting, M. Schneider, and M. Vazir-
giannis, Spatiotemporal datatypes: Anapproachto modeling
and querying moving objects in databases, GeoInformatica,
3(3): 269296, 1999.
103. R. H. Girting, M. H. Bohlen, M. Erwig, C. S. Jensen, N. A.
Lorentzos, M. Schneider, and M. Vazirgiannis, A foundation
for representing and querying moving objects, ACMTransac-
tions on Database Systems, 25(1): 142, 2000.
104. S. Borzsonyi, D. Kossmann, and K. Stocker, The skyline
operator, In Proc. the International Conference on Data Engi-
neering, Heidelberg, Germany, 2001, pp. 421430.
105. J. M. Hellerstein and M. Stonebraker, Predicate migration:
Optimizing queries with expensive predicates, In Proc. ACM-
SIGMOD International Conference on Management of Data,
1993, pp. 267276.
106. Y. Theodoridis and T. Sellis, A model for the prediction of
r-tree performance, in Proceedings of the 15th ACM Sympo-
sium on Principles of Database Systems PODS Symposium,
ACM, 1996, pp. 161171.
107. T. Asano, D. Ranjan, T. Roos, E. Wiezl, and P. Widmayer,
Space-lling curves and their use in the design of geometric
data structures, Theoretical Computer Science, 181(1): 315,
July 1997.
108. S. Shekhar and D.R. Liu, A connectivity-clustered access
method for networks and network computation, IEEETrans.
Knowledge and Data Engineering, 9(1): 102119, 1997.
109. J. Lee, Y. Lee, K. Whang, and I. Song, A physical database
design method for multidimensional le organization, Infor-
mation Sciences, 120(1): 3165(35), November 1997.
110. J. Nievergelt, H. Hinterberger, and K. C. Sevcik, The grid le:
An adaptable, symmetric multikey le structure, ACM
Trancsactions on Database Systems, 9(1): 3871, 1984.
111. M. Ouksel, The interpolation-based grid le, Proc. of Fourth
ACM SIGACT-SIGMOD Symposium on Principles of Data-
base Systems, 1985, pp. 2027.
112. K. Y. Whang and R. Krishnamurthy, Multilevel grid les,
IBM Research Laboratory Yorktown, Heights, NY, 1985.
113. A. Guttman, R-trees: A Dynamic Index Structure for Spatial
Searching, Proc. of SIGMOD International Conference on
Management of Data, 1984, pp. 4757.
114. T. Sellis, N. Roussopoulos, and C. Faloutsos, The R
+
-tree: A
dynamic index for multidimensional objects, Proc. 13thInter-
national Conference on Very Large Data Bases, September
1987, pp. 507518.
115. N. Beckmann, H.-P. Kriegel, R. Schneider, andB. Seeger, The
R
-Tree: An Opti-
mized Spatiotemporal Access Method for Predictive
Queries, in VLDB, 2003, pp. 790801.
122. M. F. Mokbel, T. M. Ghanem, andW. G. Aref, Spatio-temporal
access methods, IEEE Data Engineering Bulletin, 26(2): 40
49, 2003.
123. R. A. Finkel andJ. L. Bentley, Quadtrees: Adatastructurefor
retrieval on composite keys, Acta Informatica, 4:19, 1974.
124. T. Tzouramanis, M. Vassilakopoulos, and Y. Manolopoulos,
Overlapping linear quadtrees: A Spatiotemporal access
method, ACM-Geographic Information Systems, 1998, pp.
17.
125. Y. Manolopoulos, E. Nardelli, A. Papadopoulos, and G.
Proietti, MOF-Tree: A Spatial Access Method to Manipulate
Multiple Overlapping Features, Information Systems, 22(9):
465481, 1997.
126. J. M. Hellerstein, J. F. Naughton, and A. Pfeffer, Generalized
Search Trees for Database System, Proc. 21th International
Conference on Very Large Database Systems, September 11
15 1995.
127. W. G. Aref and I. F. Ilyas. SP-GiST: An Extensible Database
Index for Supporting Space Partitioning Trees, J. Intell. Inf.
Sys., 17(23): 215240, 2001.
128. M. Kornacker and D. Banks, High-Concurrency Locking in
R-Trees, 1995.
129. M. Kornacker, C. Mohan, and Joseph M. Hellerstein, Con-
currency and recovery in generalized search trees, in SIG-
MOD 97: Proc. 1997 ACM SIGMOD international
conference on Management of data. New York, ACM Press,
1997, pp. 6272.
130. S. Shekhar, P. Zhang, Y. Huang, and R. R. Vatsavai, Spatial
data mining, in Hillol Kargupta and A. Joshi, (eds.), Book
Chapter in Data Mining: Next Generation Challenges and
Future Directions.
131. N. A. C. Cressie, Statistics for Spatial Data. NewYork: Wiley-
Interscience, 1993.
132. R. Haining, Spatial Data Analysis : Theory and Practice,
Cambridge University Press, 2003.
133. Y. Jhung and P. H. Swain, Bayesian Contextual Classica-
tion Based on Modied M-Estimates and Markov Random
Fields, IEEE Trans. Pattern Analysis and Machine Intelli-
gence, 34(1): 6775, 1996.
134. A. H. Solberg, T. Taxt, andA. K. Jain, AMarkovRandomField
Model for Classication of Multisource Satellite Imagery,
IEEE Trans. Geoscience and Remote Sensing, 34(1): 100
113, 1996.
135. D.A. Grifth, Advanced Spatial Statistics. Kluwer Academic
Publishers, 1998.
136. J. LeSage, Spatial Econometrics. Available: http://www.spa-
tial-econometrics.com/, 1998.
137. S. Shekhar, P. Schrater, R. Raju, and W. Wu, Spatial con-
textual classication and prediction models for mining geos-
patial data, IEEE Trans. Multimedia, 4(2): 174188, 2002.
138. L. Anselin, Spatial Econometrics: methods and models, Dor-
drecht, Netherlands: Kluwer, 1988.
139. S.Z. Li, A Markov RandomField Modeling, Computer Vision.
Springer Verlag, 1995.
140. S. Shekhar et al. Spatial Contextual Classication and Pre-
diction Models for Mining Geospatial Data, IEEE Transac-
tion on Multimedia, 4(2), 2002.
141. N. A. Cressie, Statistics for Spatial Data (revised edition).
New York: Wiley, 1993.
142. J. Han, M. Kamber, and A. Tung, Spatial Clustering Methods
in Data Mining: A Survey, Geographic Data Mining and
Knowledge Discovery. Taylor and Francis, 2001.
143. V. Barnett and T. Lewis, Outliers in Statistical Data, 3rd ed.
New York: John Wiley, 1994.
144. A. Luc, Local Indicators of Spatial Association: LISA, Geo-
graphical Analysis, 27(2): 93115, 1995.
145. S. Shekhar, C.-T. Lu, and P. Zhang, A unied approach to
detecting spatial outliers, GeoInformatica, 7(2), 2003.
146. C.-T. Lu, D. Chen, and Y. Kou, Algorithms for Spatial Outlier
Detection, IEEE International Conference on Data Mining,
2003.
147. R. Agrawal and R. Srikant, Fast algorithms for mining asso-
ciation rules, in J. B. Bocca, M. Jarke, and C. Zaniolo (eds.),
Proc. 20th Int. Conf. Very Large Data Bases, VLDB, Morgan
Kaufmann, 1994, pp. 487499.
148. Y. Morimoto, Mining Frequent Neighboring Class Sets in
Spatial Databases, in Proc. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2001.
149. S. Shekhar and Y. Huang, Co-location Rules Mining: A Sum-
mary of Results, Proc. Symposium on Spatial and Spatio
temporal Databases, 2001.
150. Y. Huang, S. Shekhar, and H. Xiong, Discovering Co-location
Patterns from Spatial Datasets: A General Approach, IEEE
Trans. Knowledge and Data Engineering, 16(12): 14721485,
December 2005.
151. J. F. Roddick and B. G. Lees, Paradigms for spatial and
Spatiotemporal data mining. Taylor and Frances, 2001.
152. M. Koubarakis, T. K. Sellis, A. U. Frank, S. Grumbach,
R. Hartmut Guting, C. S. Jensen, N. A. Lorentzos, Y. Man-
olopoulos, E. Nardelli, B. Pernici, H.-J. Schek, M. Scholl, B.
Theodoulidis, and N. Tryfona, eds, Spatiotemporal Data-
bases: The CHOROCHRONOS Approach, Vol. 2520 of Lec-
ture Notes in Computer Science. Springer, 2003.
153. J. F. Roddick, K. Hornsby, and M. Spiliopoulou, An updated
bibliography of temporal, spatial, and Spatiotemporal data
mining research, Proc. First International Workshop on Tem-
poral, Spatial and Spatiotemporal Data Mining, 2001, pp.
147164.
154. M. Celik, S. Shekhar, J. P. Rogers, and J. A. Shine, Sustained
emerging Spatiotemporal co-occurrence pattern mining: A
summary of results, 18th IEEE International Conference on
Tools with Articial Intelligence, 2006, pp. 106115.
155. M. Celik, S. Shekhar, J. P. Rogers, J. A. Shine, and J. S. Yoo,
Mixed-drove Spatiotemporal co-occurrence pattern mining:
Asummaryof results, SixthInternational Conference onData
Mining, IEEE, 2006, pp. 119128.
SPATIAL DATABASES 21
156. M. Celik, S. Shekhar, J. P. Rogers, J. A. Shine, andJ. M. Kang,
Mining at most top-k mixed-drove Spatiotemporal co-occur-
rencepatterns: Asummaryof results, inProc. of the Workshop
on Spatiotemporal Data Mining (In conjunction with ICDE
2007), 2008, forthcoming.
157. J. F. Roddick, E. Hoel, M. J. Egenhofer, D. Papadias, and B.
Salzberg, Spatial, temporal and Spatiotemporal databases -
hot issues and directions for phd research, SIGMOD record,
33(2), 2004.
158. O. Schabenberger and C. A. Gotway, Statistical Methods for
Spatial Data Analysis. Chapman & Hall/CRC, 2004.
159. M. Stonebraker, J. Frew, K. Gardels, and J. Meredith, The
Sequoia 2000 Benchmark, in Peter Buneman and Sushil
Jajodia (eds.), Proc. 1993 ACM SIGMOD International Con-
ference on Management of Data. ACM Press, 1993. pp. 211.
160. J. M. Patel, J.-B. Yu, N. Kabra, K. Tufte, B. Nag, J. Burger, N.
E. Hall, K. Ramasamy, R. Lueder, C. Ellmann, J. Kupsch, S.
Guo, D. J. DeWitt, and J. F. Naughton, Building a scaleable
geo-spatial dbms: Technology, implementation, and evalua-
tion, in ACM SIGMOD Conference, 1997, pp. 336347.
VIJAY GANDHI
JAMES M. KANG
SHASHI SHEKHAR
University of Minnesota
Minneapolis, Minnesota
22 SPATIAL DATABASES
S
STATISTICAL DATABASES
INTRODUCTION
Statistical databases are databases that contain statistical
information. Such databases normally are released by
national statistical institutes, but on occasion they can
also be released by health-care authorities (epidemiology)
or by private organizations (e.g., consumer surveys). Sta-
tistical databases typically come in three formats:
v
Tabular data, that is, tables with counts or magni-
tudes, which are the classic output of ofcial statistics.
v
Queryable databases, that is, online databases to
which the user can submit statistical queries (sums,
averages, etc.).
v
Microdata, that is, les in which each record contains
information on an individual (a citizen or a company).
The peculiarity of statistical databases is that they
should provide useful statistical information, but they
should not reveal private information on the individuals
to whomtheyrefer (respondents). Indeed, supplyingdatato
national statistical institutes is compulsory in most coun-
tries, but in return those institutes commit to preserving
the privacy of respondents. Inference control in statistical
databases, also known as statistical disclosure control
(SDC), is a discipline that seeks to protect data instatistical
databases so that they can be published without revealing
condential information that can be linked to specic indi-
viduals among those to when the data correspond. SDC is
applied to protect respondent privacy in areas such as
ofcial statistics, health statistics, and e-commerce (shar-
ing of consumer data). Because data protection ultimately
means data modication, the challenge for SDC is to
achieve protection with minimum loss of accuracy sought
by database users.
In Ref. 1, a distinction is made between SDC and other
technologies for database privacy, like privacy-preserving
data mining (PPDM) or private information retrieval
(PIR): What makes the difference between those technol-
ogies is whose privacy they seek. Although SDC is aimed
at respondent privacy, the primary goal of PPDM is to
protect owner privacy whenseveral database owners wish
to cooperate in joint analyses across their databases with-
out giving away their original data to each other. On its
side, the primary goal of PIR is user privacy, that is, to
allow the user of a database to retrieve some information
item without the database exactly knowing which item
was recovered.
The literature on SDC started in the 1970s, with the
seminal contribution by Dalenius (2) in the statistical
community and the works by Schlorer (3) and Denning
et al. (4) in the database community. The 1980s saw
moderate activity in this eld. An excellent survey of the
state of the art at the end of the 1980s is in Ref. 5. In the
1990s, renewed interest accussed in the statistical
community and the discipline was developed even more
under the names of statistical disclosure control in Europe
and statistical disclosure limitation in America. Subse-
quent evolution has resulted in at least three clearly differ-
entiated subdisciplines:
v
Tabular data protection. The goal here is to publish
static aggregate information, i.e., tables, in such a way
that no condential informationonspecic individuals
among those to which the table refers can be inferred.
See Ref. 6 for a conceptual survey.
v
Queryable databases. The aggregate information
obtained by a user as a result of successive queries
should not allow him to infer information on specic
individuals. Since the late 1970s, this has been
known to be a difcult problem, which is subject
to the tracker attack (4,7). SDC strategies here
include perturbation, query restriction, and camou-
age (providing interval answers rather than exact
answers).
v
Microdata protection. It is only recently that data
collectors (statistical agencies and the like) have
beenpersuadedto publishmicrodata. Therefore, micro-
dataprotectionis the youngest subdiscipline, andit has
experienced continuous evolution in the last fewyears.
Its purpose is to maskthe original microdataso that the
masked microdata still are useful analytically but can-
not be linked to the original respondents.
Several areas of application of SDC techniques exist,
which include but are not limited to the following:
v
Ofcial statistics. Most countries have legislation that
compels national statistical agencies to guarantee
statistical condentiality when they release data col-
lected from citizens or companies. It justies the
research on SDC undertaken by several countries,
among them the European Union (e.g., the CASC
project) and the United States.
v
Health information. This area is one of the most sen-
sitive regarding privacy. For example, in the United
States, the Privacy Rule of the Health Insurance
Portability and Accountability Act (HIPAA) requires
the strict regulation of protected health information
for use inmedical research. Inmost Westerncountries,
the situation is similar.
v
E-commerce. Electronic commerce results in the
automated collection of large amounts of consumer
data. This wealth of information is very useful to
companies, which are often interested in sharing
it with their subsidiaries or partners. Such consumer
information transfer should not result in public
proling of individuals and is subject to strict
regulation.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
THEORY
Formal Denition of Data Formats
Amicrodata le Xwith s respondents and t attributes is an
s t matrix, where X
ij
is the value of attribute j for respon-
dent i. Attributes can be numerical (e.g., age or salary) or
categorical (e.g., gender or job).
The attributes ina microdata set canbe classied infour
categories that are not necessarily disjoint:
v
Identiers. These attributes unambiguously identify
the respondent. Examples are the passport number,
social security number, and name-surname.
v
Quasi-identiers or key attributes. These attributes
identify the respondent with some degree of ambigu-
ity. (Nonetheless, a combination of key attributes may
provide unambiguous identication.) Examples are
address, gender, age, and telephone number.
v
Condential outcome attributes. These attributes con-
tain sensitive information on the respondent. Exam-
ples are salary, religion, political afliation, andhealth
condition.
v
Noncondential outcome attributes. Other attributes
contain nonsensitive information on the respondent.
Frommicrodata, tabular datacanbe generatedby cross-
ing one or more categorical attributes. Formally, a table is a
function
T : D(X
i1
) D(X
i2
) ::: D(X
il
) R or N
where l _ t is the number of crossed categorical attributes
and D(X
ij
) is the domain where attribute X
ij
takes its
values.
Two kinds of tables exist: frequency tables that display
the count of respondents at the crossing of the categorical
attributes (in N) and magnitude tables that display infor-
mation on a numeric attribute at the crossing of the
categorical attributes (in R). For example, given some
census microdata containing attributes Job and
Town, one can generate a frequency table that displays
the count of respondents doing each job type in each town.
If the census microdata also contain the Salary attri-
bute, one cangenerate a magnitude table that displays the
average salary for each job type in each town. The number
n of cells in a table is normally much less than the number
s of respondent records in a microdata le. However,
tables must satisfy several linear constraints: marginal
row and column totals. Additionally, a set of tables is
called linked if they share some of the crossed categorical
attributes: For example, Job Town is linked to
Job Gender.
Overview of Methods
Statistical disclosure control will be rst reviewed for tab-
ular data, then for queryable databases, and nally for
microdata.
Methods for Tabular Data
Despite tables that display aggregate information, a risk of
disclosure exists in tabular data release. Several attacks
are conceivable:
v
External attack. For example, let a frequency table
Job Town be released where a single respondent
exists for job J
i
and town T
j
. Then if a magnitude table
is released with the average salary for each job type
and each town, the exact salary of the only respondent
with job J
i
working in town T
j
is disclosed publicly.
v
Internal attack. Even if two respondents for job J
i
and
town T
j
exist, the salary of each of them is disclosed to
each other.
v
Dominance attack. If one (or a few) respondents dom-
inate in the contribution to a cell of a magnitude table,
the dominant respondent(s) can upper-bound the con-
tributions of the rest (e.g., if the table displays the total
salary for each job type and town and one individual
contributes 90% of that salary, he knows that his
colleagues in the town are not doing very well).
SDC methods for tables fall into two classes: nonpertur-
bative and perturbative. Nonperturbative methods do not
modify the values in the tables; the best known method in
this class is cell suppression (CS). Perturbative methods
output a table with some modied values; well-known
methods in this class include controlled rounding (CR)
and the recent controlled tabular adjustment (CTA).
The ideaof CSis tosuppress those cells that areidentied
as sensitive, i.e., from which the above attacks can extract
sensitive information, by the so-called sensitivity rules (e.g.,
the dominance rule, whichidenties a cell as sensitive if it is
vulnerable to a dominance attack). Sensitive cells are the
primary suppressions. Then additional suppressions (sec-
ondary suppressions) are performed to prevent primary
suppressions from being computed or even inferred within
a prescribed protection interval using the row and column
constraints (marginal row and column totals). Usually, one
attempts to minimize either the number of secondary sup-
pressions or their pooled magnitude, which results in com-
plex optimization problems. Most optimization methods
used are heuristic, based on mixed integer linear program-
ming or network ows (8), and most of them are implemen-
ted in the t-Argus free software package (9). CR rounds
values in the table to multiples of a rounding base, which
may entail rounding the marginal totals as well. Onits side,
CTAmodies the values inthe tabletoprevent the inference
of values of sensitive cells within a prescribed protection
interval. The idea of CTA is to nd the closest table to the
original one that ensures such a protection for all sensitive
cells. It requires optimization methods, which are typically
based on mixed linear integer programming. Usually CTA
entails less information loss than does CS.
Methods for Queryable Databases
In SDC of queryable databases, three main approaches
exist to protect a condential vector of numeric data
from disclosure through answers to user queries:
2 STATISTICAL DATABASES
v
Data perturbation. Perturbing the data is a simple
and effective approach whenever the users do not
require deterministically correct answers to queries
that are functions of the condential vector. Pertur-
bation can be applied to the records on which queries
are computed (input perturbation) or to the query
result after computing it on the original data (output
perturbation). Perturbation methods can be found in
Refs. 1012.
v
Query restriction. This approach is right if the user
does require deterministically correct answers and
these answers have to be exact (i.e., a number).
Because exact answers to queries provide the user
with very powerful information, it may become neces-
sary to refuse to answer certain queries at some stage
to avoid disclosure of a condential datum. several
criteria decide whether a query can be answered;
one of them is query set size control, that is, to refuse
answers to queries that affect a set of records that is too
small. An example of the query restriction approach
can be found in Ref. 13.
v
Camouage. If deterministically correct, nonexact
answers (i.e., small-interval answers) sufce, con-
dentiality via camouage [CVC, (14)] is a good option.
With this approach, unlimited answers to any con-
ceivable query types are allowed. The idea of CVCis to
camouage the condential vector a by making it
part of the relative interior of a compact set P of
vectors. Then each query q = f(a) is answered with
an inverval [q
, q
] that contains [f
, f
], where f
and f
are, respectively, the minimum and the max-
imum of f over P.
Methods for Microdata
Microdata protection methods can generate the protected
microdata set X
/
v
Either by masking original data, i.e., generating X
/
a
modied version of the original microdata set X.
v
Or by generating synthetic data X
/
that preserve some
statistical properties of the original data X.
Masking methods can in turn be divided in two cate-
gories depending on their effect on the original data (6):
v
Perturbative. The microdata set is distorted before
publication. In this way, unique combinations of
scores in the original dataset may disappear, and
new, unique combinations may appear in the per-
turbed dataset; such confusion is benecial for pre-
serving statistical condentiality. The perturbation
method used should be such that statistics computed
on the perturbed dataset do not differ signicantly
from the statistics that would be obtained on the
original dataset. Noise addition, microaggregation,
data/rank swapping, microdata rounding, resam-
pling, and PRAMare examples of perturbative mask-
ing methods (see Ref. 8 for details).
v
Nonperturbative. Nonperturbative methods do not
alter data; rather, they produce partial suppressions
or reductions of detail in the original dataset. Sam-
pling, global recoding, top and bottomcoding, and local
suppression are examples of nonperturbative masking
methods.
Although a reasonable overview of the methods for
protecting tables or queryable databases has been given
above, microdata protection methods are more diverse, so
that a description of some of them is needed.
Some Microdata Protection Methods
Additive Noise. Additive noise is a family of perturbative
masking methods. The noise additions algorithms in the
literature are as follows:
v
Masking by uncorrelated noise addition. The vector of
observations x
j
for the jth attribute of the original
dataset X
j
is replaced by a vector
z
j
= x
j
e
j
where e
j
is a vector of normally distributed errors drawn
from a random variable e
j
~N(0; s
2
e
j
), such that
Cov(e
t
; e
l
) = 0 for all t ,= l. This algorithm neither pre-
serve variances nor correlations.
v
Masking by correlatednoise addition. Correlated noise
addition also preserves means and additionally allows
preservation of correlation coefcients. The difference
withthe previous methodis that the covariance matrix
of the errors is now proportional to the covariance
matrix of the original data, i.e., e ~N(0; S
e
), where
S
e
= aS, with S being the covariance matrix of the
original data.
v
Masking by noise additionandlinear transformation.
In Ref. 15, a method is proposed that ensures by
additional transformations that the sample covar-
iance matrix of the masked attributes is an unbiased
estimator for the covariance matrix of the original
attributes.
v
Masking by noise addition and nonlinear transforma-
tion. Combining simple additive noise and nonlinear
transformation has also been proposed in such a way
that application to discrete attributes is possible and
univariate distributions are preserved. Unfortunately,
the application of this method is very time-consuming
and requires expert knowledge on the dataset and the
algorithm. See Ref. 8 for more details.
In practice, only simple noise addition (two rst var-
iants) or noise addition with linear transformation are
used. When using linear transformations, a decision has
to be made whether to reveal themto the data user to allow
for bias adjustment in the case of subpopulations.
In general, additive noise is not suitable to protect
categorical data. On the other hand, it is well suited for
continuous data for the following reasons:
STATISTICAL DATABASES 3
v
It makes no assumptions on the range of possible
values for X
i
(which may be innite).
v
The noise being added typically is continuous and with
mean zero, which suits well continuous original data.
Microaggregation. Microaggregation is a family of SDC
techniques for continous microdata. The rationale behind
microaggregation is that condentiality rules in use allow
publication of microdata sets if records correspond to
groups of k or more individuals, where no individual dom-
inates (i.e., contributes too much to) the group and k is a
threshold value. Strict application of such condentiality
rules leads to replacing individual values with values com-
puted on small aggregates (microaggregates) before pub-
lication. It is the basic principle of microaggregation.
To obtain microaggregates in a microdata set with n
records, these are combined to formg groups of size at least
k. For each attribute, the average value over each group is
computed and is used to replace each of the original aver-
aged values. Groups are formed using a criterion of max-
imal similarity. Once the procedure has been completed,
the resulting (modied) records can be published.
The optimal k-partition (fromthe information loss point
of view) is dened to be the one that maximizes within-
group homogeneity; the higher the within-group homoge-
neity, the lower the information loss, because microaggre-
gationreplaces values ina group by the group centroid. The
sum of squares criterion is common to measure homogene-
ity in clustering. The within-groups sum of squares SSE is
dened as
SSE =
X
g
i=1
X
n
i
j=1
(x
i j
x
i
)
/
(x
i j
x
i
)
The lower the SSE, the higher the within-group homo-
geneity. Thus, in terms of sums of squares, the optimal k-
partition is the one that minimizes SSE.
Given a microdata set that consists of p attributes, these
canbe microaggregated together or partitionedinto several
groups of attributes. Also, the way to formgroups may vary.
Several taxonomies are possible to classify the microaggre-
gation algorithms in the literature: (1) xed group size
(1618) versus variable group size (1921), (2) exact opti-
mal [only for the univariate case, (22)] versus heuristic
microaggregation (the rest of the microaggregation litera-
ture), and(3) categorical (18,23) versus continuous (the rest
of references cited in this paragraph).
To illustrate, we next give a heuristic algorithm called
the MDAV [maximum distance to average vector, (18,24)]
for multivariate xed-group-size microaggregation on
unprojected continuous data. We designed and implemen-
ted MDAV for the m-Argus package (17).
1. Compute the average record x of all records in the
dataset. Consider the most distant record x
r
to the
average record x (using the squared Euclidean dis-
tance).
2. Find the most distant record x
s
from the record x
r
considered in the previous step.
3. Form two groups around x
r
and x
s
, respectively. One
group contains x
r
and the k 1 records closest to x
r
.
The other group contains x
s
and the k 1 records
closest to x
s
.
4. If at least 3k records do not belong to any of the two
groups formed in Step 3, go to Step 1, taking as a new
dataset the previous dataset minus the groups
formed in the last instance of Step 3.
5. If between 3k 1 and 2k records do not belong to any
of the two groups formed in Step 3: a) compute the
average record x of the remaining records, b) nd the
most distant record x
r
fromx, c) forma group contain-
ing x
r
and the k 1 records closest to x
r
, and d) form
another group containing the rest of records. Exit the
Algorithm.
6. If less than 2k records do not belong to the groups
formed inStep 3, forma newgroup withthose records
and exit the Algorithm.
The above algorithm can be applied independently to
each group of attributes that results from partitioning
the set of attributes in the dataset.
Data Swapping and Rank Swapping. Data swapping ori-
ginally was presented as a perturbative SDC method for
databases that contain only categorical attributes. The
basic idea behind the method is to transform a database
by exchanging values of condential attributes among
individual records. Records are exchanged in such a way
that low-order frequency counts or marginals are main-
tained.
Even though the original procedure was not used often
in practice, its basic idea had a clear inuence in subse-
quent methods. Avariant of data swapping for microdata is
rank swapping, which will be described next insome detail.
Althoughoriginally described only for ordinal attributes
(25), rank swapping can also be used for any numeric
attribute. First, values of an attribute X
i
are ranked in
ascending order; then each ranked value of X
i
is swapped
with another ranked value randomly chosen within a
restricted range (e.g., the rank of two swapped values
cannot differ by more than p% of the total number of
records, where p is an input parameter). This algorithm
independently is used on each original attribute in the
original dataset.
It is reasonable to expect that multivariate statistics
computed from data swapped with this algorithm will be
less distorted than those computed after an unconstrained
swap.
Pram. The post-randomization method [PRAM, (26)] is
a probabilistic, perturbative method for disclosure protec-
tion of categorical attributes in microdata les. In the
masked le, the scores on some categorical attributes
for certain records in the original le are changed to a
different score according to a prescribed probability
mechanism, namely a Markov matrix called the PRAM
matrix. The Markov approach makes PRAM very general
because it encompasses noise addition, data suppression,
and data recoding.
4 STATISTICAL DATABASES
Because the PRAM matrix must contain a row for each
possible value of each attribute to be protected, PRAM
cannot be used for continuous data.
Sampling. This approach is a nonperturbative masking
method. Instead of publishing the original microdata
le, what is published is a sample S of the original set of
records (6).
Sampling methods are suitable for categorical micro-
data, but for continuous microdata they probably should be
combined with other masking methods. The reason is that
sampling alone leaves a continuous attribute X
i
unper-
turbed for all records in S. Thus, if attribute X
i
is present
in an external administrative public le, unique matches
with the published sample are very likely: Indeed, given
a continuous attribute X
i
and two respondents o
1
and o
2
,
it is highly unlikely that X
i
will take the same value for both
o
1
and o
2
unless o
1
= o
2
(this is true even if X
i
has been
truncated to represent it digitally).
If, for a continuous identifying attribute, the score of a
respondent is only approximately known by an attacker, it
might still make sense to use sampling methods to protect
that attribute. However, assumptions on restricted
attacker resources are perilous and may prove denitely
too optimistic if good-quality external administrative les
are at hand.
Global Recoding. This approach is a nonperturbative
masking method, also known sometimes as generalization.
For a categorical attribute X
i
, several categories are com-
bined to form new (less specic) categories, which thus
result in a new X
/
i
with [D(X
/
i
)[ <[D(X
i
)[, where [ [ is the
cardinality operator. For a continuous attribute, global
recoding means replacing X
i
by another attribute X
/
i
; which
is a discretized version of X
i
. In other words, a potentially
innite range D(X
i
) is mapped onto a nite range D(X
/
i
).
This technique is used in the m-Argus SDC package (17).
This technique is more appropriate for categorical
microdata, where it helps disguise records with strange
combinations of categorical attributes. Global recoding is
used heavily by statistical ofces.
Example. If a record exist with Marital status =
Widow/er and Age = 17, global recoding could be applied
toMarital status tocreate abroader categoryWidow/er or
divorced so that the probability of the above record being
unique would diminish. &
Global recoding can also be used on a continuous attri-
bute, but the inherent discretization leads very often to an
unaffordable loss of information. Also, arithmetical opera-
tions that were straightforward on the original X
i
are no
longer easy or intuitive on the discretized X
/
i
.
Top and Bottom Coding. Top and bottom coding are
special cases of global recoding that can be used on attri-
butes that can be ranked, that is, continuous or categorical
ordinal. The idea is that top values (those above a certain
threshold) are lumped together to forma newcategory. The
same is done for bottom values (those below a certain
threshold). See Ref. 17.
Local Suppression. This approach is a nonperturbative
masking method in which certain values of individual
attributes are suppressed with the aim of increasing the
set of records agreeingonacombinationof keyvalues. Ways
to combine local suppression and global recoding are imple-
mented in the m-Argus SDC package (17).
If a continuous attribute X
i
is part of a set of key
attributes, then each combination of key values probably
is unique. Because it does not make sense to suppress
systematically the values of X
i
, we conclude that local
suppression is oriented to categorical attributes.
Synthetic Microdata Generation. Publication of
synthetic i.e.,simulated data was proposed long ago
as a way to guard against statistical disclosure. The idea is
to generate datarandomly withthe constraint that certain
statistics or internal relationships of the original dataset
should be preserved.
More than ten years ago, Rubin suggested in Ref. 27 to
create an entirely synthetic dataset based on the original
survey data and multiple imputation. Asimulationstudy of
this approach was given in Ref. 28.
We next sketch the operation of the original proposal by
Rubin. Consider an original microdata set X of size n
records drawn from a much larger population of N indivi-
duals, where are background attributes A, noncondential
attributes B, and condential attributes C exist. Back-
ground attributes are observed and available for all N
individuals in the population, whereas B and C are only
available for the n records in the sample X. The rst step is
to construct from X a multiply-imputed population of N
individuals. This population consists of the n records in X
and M (the number of multiple imputations, typically
between 3 and 10) matrices of (B, C) data for the N n
nonsampled individuals. The variability in the imputed
values ensures, theoretically, that valid inferences can
be obtained on the multiplyimputed population. A model
for predicting(B, C) fromAis usedto multiply-impute (B, C)
in the population. The choice of the model is a nontrivial
matter. Once the multiply-imputed population is available,
a sample Z of n
/
records can be drawn from it whose
structure looks like the one a sample of n
/
records drawn
from the original population. This process can be done M
times to create M replicates of (B, C) values. The result are
M multiply-imputed synthetic datasets. To make sure no
original data are inthe synthetic datasets, it is wise to draw
the samples from the multiply-imputed population exclud-
ing the n original records from it.
Synthetic data are appealing in that, at a rst glance,
they seem to circumvent the reidentication problem:
Because published records are invented and do not derive
from any original record, it might be concluded that no
individual can complain fromhaving been reidentied. At
a closer look this advantage is less clear. If, by chance, a
published synthetic record matches a particular citizens
noncondential attributes (age, marital status, place of
residence, etc.) and condential attributes (salary, mort-
gage, etc.), reidentication using the noncondential
attributes is easy and that citizen may feel that his con-
dential attributes have been unduly revealed. In that
case, the citizen is unlikely to be happy with or even
STATISTICAL DATABASES 5
understandthe explanationthat the recordwas generated
synthetically.
On the other hand, limited data utility is another pro-
blem of synthetic data. Only the statistical properties
explicitly captured by the model used by the data protector
are preserved. A logical question at this point is why not
directly publish the statistics one wants to preserve rather
than release a synthetic microdata set.
One possible justication for synthetic microdata would
be whether valid analyses could be obtained on several
subdomains; i.e., similar results were obtained in several
subsets of the original dataset and the corresponding sub-
sets of the synthetic dataset. Partially synthetic or hybrid
microdata are more likely to succeed in staying useful for
subdomain analysis. However, when using partially syn-
thetic or hybrid microdata, we lose the attractive feature of
purely synthetic data that the number of records in the
protected (synthetic) dataset is independent fromthe num-
ber of records in the original dataset.
EVALUATION
Evaluation of SDCmethods must be carried out in terms of
data utility and disclosure risk.
Measuring Data Utility
Dening what a generic utility loss measure is can be a
tricky issue. Roughly speaking, such a denition should
capture the amount of information loss for a reasonable
range of data uses.
We will attempt a denition on the data with maximum
granularity, that is, microdata. Similar denitions apply to
rounded tabular data; for tables with cell suppressions,
utility normally is measured as the reciprocal of the num-
ber of suppressed cells or their pooled magnitude. As to
queryable databases, they can be viewed logically as tables
as far as data utility is concerned: Adenied query answer is
equivalent to a cell suppression, and a perturbed answer is
equivalent to a perturbed cell.
We will say little information loss occurs if the protected
dataset is analytically valid and interesting according to
the following denitions by Winkler (29):
v
A protected microdata set is analytically valid if it
approximately preserves the following with respect
to the original data (some conditions apply only to
continuous attributes):
1. Means and covariances ona small set of subdomains
(subsets of records and/or attributes)
2. Marginal values for a few tabulations of the data
3. At least one distributional characteristic
v
A microdata set is interesting analytically if six
attributes on important subdomains are provided
that can be validly analyzed.
More precise conditions of analytical validity and ana-
lytical interest cannot be stated without taking specic
data uses into account. As imprecise as they may be, the
above denitions suggest some possible measures:
v
Compare raw records in the original and the protected
dataset. The more similar the SDC method to the
identity function, the less the impact (but the higher
the disclosure risk!). This process requires pairing
records in the original dataset and records in the
protected dataset. For masking methods, each record
intheprotecteddataset is pairednaturallytothe record
in the original dataset from which it originates. For
synthetic protected datasets, pairing is less obvious.
v
Compare some statistics computed on the original and
the protected datasets. The above denitions list some
statistics that should be preserved as much as possible
by an SDC method.
A strict evaluation of information loss must be based on
the data uses to be supported by the protected data. The
greater the differences between the results obtained on
original and protected data for those uses, the higher the
loss of information. However, very often microdata protec-
tion cannot be performed in a data use-specic manner, for
the following reasons:
v
Potential data uses are very diverse, and it may be
even hard to identify them all at the moment of data
release by the data protector.
v
Even if all data uses could be identied, releasing
several versions of the same original dataset so that
the ith version has an information loss optimized for
the ith data use may result in unexpected disclosure.
Because that data often must be protected with no
specic data use inmind, generic informationloss measures
are desirable to guide the data protector in assessing how
muchharmis being inicted to the data by a particular SDC
technique.
Information loss measures for numeric data. Assume a
microdata set with n individuals (records) I
1
, I
2
, . . ., I
n
and p
continuous attributes Z
1
, Z
2
, . . ., Z
p
. Let Xbe the matrix that
represents the original microdata set (rows are records and
columns are attributes). Let X
/
bethe matrixthat represents
the protectedmicrodataset. The following tools are useful to
characterize the information contained in the dataset:
v
Covariance matrices V (on X) and V
/
(on X
/
).
v
Correlation matrices R and R
/
.
v
Correlation matrices RF and RF
/
between the p attri-
butes and the p factors PC
1
, . . ., PC
p
obtained through
principal components analysis.
v
Communality between each p attribute and the rst
principal component PC
1
(or other principal compo-
nents PC
i
s). Communality is the percent of each
attribute that is explained by PC
1
(or PC
i
). Let C be
the vector of communalities for X and C
/
be the
corresponding vector for X
/
.
v
Factor score coefcient matrices F and F
/
. Matrix F
contains the factors that should multiply each attri-
6 STATISTICAL DATABASES
bute in X to obtain its projection on each principal
component. F
/
is the corresponding matrix for X
/
.
A single quantitative measure does not seen to reect
completely those structural differences. Therefore, we
proposed in Refs. 30 and 31 to measure information loss
through the discrepancies between matrices X, V, R, RF, C,
and F obtained on the original data and the corresponding
X
/
, V
/
, R
/
, RF
/
, C
/
, and F
/
obtained on the protected dataset.
In particular, discrepancy between correlations is related
to the informationloss for datauses suchas regressions and
cross tabulations.
Matrix discrepancy can be measured in at least three
ways:
Mean square error. Sum of squared component-wise
differences between pairs of matrices, divided by the
number of cells in either matrix.
Mean absolute error. Sum of absolute component-
wise differences between pairs of matrices, divided
by the number of cells in either matrix.
Mean variation. Sum of absolute percent variation of
components in the matrix computed on protected
data with respect to components in the matrix com-
puted on original data, divided by the number of cells
in either matrix. This approach has the advantage of
not being affected by scale changes of attributes.
The following table summarizes the measures proposed
in the above references. In this table, p is the number of
attributes, nis the number of records, and the components
of matrices are represented by the corresponding lower-
case letters (e.g., x
ij
is acomponent of matrixX). Regarding
X X
/
measures, it also makes sense to compute those on
the averages of attributes rather than on all data (call this
variant X X
/
). Similarly, for V V
/
measures, it would
also be sensible to use themto compare only the variances
of the attributes, i.e., to compare the diagonals of the
covariance matrices rather than the whole matrices
(call this variant S S
/
).
Information loss measures for categorical data. These
measures canbe basedona direct comparisonof categorical
values anda comparisonof contingency tables onShannons
entropy. See Ref. 30 for more details.
Bounded information loss measures. The information
loss measures discussed above are unbounded, (i.e., they
do not take values in a predened interval) On the other
hand, as discussed below, disclosure risk measures are
bounded naturally (the risk of disclosure is bounded natu-
rally between 0 and 1). Dening bounded information loss
measures may be convenient to enable the data protector to
trade off information loss against disclosure risk. In Ref. 32,
probabilistic information loss measures bounded between 0
and 1 are proposed for continuous data.
Measuring Disclosure Risk
In the context of statistical disclosure control, disclosure
risk can be dened as the risk that a user or an intruder
can use the protected dataset X
/
to derive condential
information on an individual among those in the original
dataset X.
Disclosure risk can be regarded from two different
perspectives:
1. Attributedisclosure. This approachto disclosure is
dened as follows. Disclosure takes place when an
attribute of an individual can be determined more
accurately with access to the released statistic than it
is possible without access to that statistic.
2. Identity disclosure. Attribute disclosure does not
imply a disclosure of the identity of any individual.
Mean square error Mean absolute error Mean variation
X X
/
P
p
j=1
P
n
i=1
(x
i j
x
/
i j
)
2
np
P
p
j=1
P
n
i=1
[x
i j
x
/
i j
[
np
P
p
j=1
P
n
i=1
[x
i j
x
/
i j
[
[x
i j
[
np
V V
/
P
p
j=1
P
1_i_j
(v
i j
v
/
i j
)
2
p( p 1)
2
P
p
j=1
P
1_i_j
[v
i j
v
/
i j
[
p( p 1)
2
P
p
j=1
P
1_i_j
[v
i j
v
/
i j
[
[v
i j
[
p( p 1)
2
R R
/
P
p
j=1
P
1_i < j
(r
i j
r
/
i j
)
2
p( p 1)
2
P
p
j=1
P
1_i < j
[r
i j
r
/
i j
[
p( p 1)
2
P
p
j=1
P
1_i < j
[r
i j
r
/
i j
[
[r
i j
[
p( p 1)
2
RF RF
/
P
p
j=1
w
j
P
p
i=1
(r f
i j
r f
/
i j
)
2
p
2
P
p
j=1
w
j
P
p
i=1
[r f
i j
r f
/
i j
[
p
2
P
p
j=1
w
j
P
p
i=1
[r f
i j
r f
/
i j
[
[r f
i j
[
p
2
C C
/
P
p
i=1
(c
i
c
/
i
)
2
p
P
p
i=1
[c
i
c
/
i
[
p
P
p
i=1
[c
i
c
/
i
[
[c
i
[
p
F F
/
P
p
j=1
w
j
P
p
i=1
( f
i j
f
/
i j
)
2
p
2
P
p
j=1
w
j
P
p
i=1
[ f
i j
f
/
i j
[
p
2
P
p
j=1
w
j
P
p
i=1
[ f
i j
f
/
i j
[
[ f
i j
[
p
2
STATISTICAL DATABASES 7
Identity disclosure takes place when a record in the
protected dataset can be linked with a respondents
identity. Two main approaches usually are employed
for measuring identity disclosure risk: uniqueness
and reidentication.
2.1 Uniqueness. Roughly speaking, the risk of
identity disclosure is measured as the probabil-
ity that rare combinations of attribute values in
the released protected data are indeed rare in
the original population from which the data
come. This approach is used typically with non-
perturbative statistical disclosure control meth-
ods and, more specically, with sampling. The
reason that uniqueness is not used with pertur-
bative methods is that when protected attribute
values are perturbed versions of original attri-
bute values, it does not make sense to investi-
gate the probability that a rare combination of
protected values is rare in the original dataset,
because that combination is most probably not
found in the original dataset.
2.2 Reidentication. This empirical approach
evaluates the risk of disclosure. In this case,
record linkage software is constructed to esti-
mate the number of reidentications that might
be obtained by a specialized intruder. Reidenti-
cation through record linkage provides a more
unied approach than do uniqueness methods
because the former can be applied to any kind of
masking and not just to nonperturbative mask-
ing. Moreover, reidentication can also be
applied to synthetic data.
Trading off Information Loss and Disclosure Risk
The mission of SDC to modify data in such a way that
sufcient protection is provided at minimum information
loss suggests that a good SDC method is one achieving a
good tradeoff between disclosure risk and information
loss. Several approaches have been proposed to handle
this tradeoff. We discuss SDC scores, R-U maps, and k-
anonymity.
Score Construction. Following this idea, Domingo-
Ferrer and Torra (30) proposed a score for method perfor-
mance rating based on the average of information loss and
disclosure risk measures. For each method M and para-
meterization P, the following score is computed:
Score(X; X
/
) =
IL(X; X
/
) DR
/
(X; X
/
)
2
where ILis an information loss measure, DRis a disclosure
risk measure, and X
/
is the protected dataset obtained after
applying method M with parameterization P to an original
dataset X.
In Ref. 30, IL and DR were computed using a weighted
combination of several information loss and disclosure risk
measures. With the resulting score, a ranking of masking
methods (and their parameterizations) was obtained. To
illustrate how a score can be constructed, we next describe
the particular score used in Ref. 30.
Example. Let X and X
/
be matrices that represent origi-
nal and protected datasets, respectively, where all attri-
butes are numeric. Let V and R be the covariance matrix
and the correlation matrix of X, respectively; let X be the
vector of attribute averages for X, and let S be the diagonal
of V. Dene V
/
, R
/
, X
/
, and S
/
analogously from X
/
. The
Information Loss (IL) is computed by averaging the mean
variations of X X
/
, X X
/
, V V
/
, S S
/
, and the mean
absolute error of R R
/
and by multiplying the resulting
average by 100. Thus, we obtain the following expression
for information loss:
IL =
100
5
P
p
j=1
P
n
i=1
[x
i j
x
/
i j
[
[x
i j
[
np
P
p
j=1
[n
1
j
x
/
j
[
[n
1
j
[
p
0
B
B
B
B
@
P
p
j=1
P
1_i_j
[v
i j
v
/
i j
[
[v
i j
[
p( p 1)
2
P
p
j=1
[v
j j
v
/
j j
[
[v
j j
[
p
P
p
j=1
P
1_i_j
[r
i j
r
/
i j
[
p( p 1)
2
1
C
C
C
C
A
The expression of the overall score is obtained by com-
bining information loss and information risk as follows:
Score =
IL
(0:5DLD0:5PLD)ID
2
2
Here, DLD(distance linkage disclosure risk) is the percen-
tage of correctly linked records using distance-based record
linkage (30), PLD (probabilistic linkage record disclosure
risk) is the percentage of correctly linked records using
probabilistic linkage (33), ID (interval disclosure) is the
percentage of original records falling in the intervals
around their corresponding masked values, and IL is the
information loss measure dened above.
Based on the above score, Domingo-Ferrer and Torra
(30) found that, for the benchmark datasets and the intru-
ders external information they used, two good performers
among the set of methods and parameterizations they tried
were as follows: (1) rankswapping withparameter paround
15 (see the description above) and (2) multivariate micro-
aggregation on unprojected data taking groups of three
attributes at a time (Algorithm MDAV above with parti-
tioning of the set of attributes). &
Using a score permits regarding the selection of a mask-
ing method and its parameters as an optimization problem.
A masking method can be applied to the original data le,
and then a postmasking optimization procedure can be
applied to decrease the score obtained.
On the negative side, no specic score weighting can do
justice to all methods. Thus, when ranking methods, the
8 STATISTICAL DATABASES
values of all measures of information loss and disclosure
risk should be supplied along with the overall score.
R-U maps. A tool that may be enlightening when trying
to construct a score or, to more generally, to optimize the
tradeoff between information loss and disclosure risk is a
graphical representation of pairs of measures (disclosure
risk, information loss) or their equivalents (disclosure risk,
data utility). Such maps are called R-U condentiality
maps (34). Here, R stands for disclosure risk and U stands
for data utility. In its most basic form, an R-U condenti-
ality map is the set of paired values (R,U), of disclosure risk
and data utility that correspond to various strategies for
data release (e.g., variations on a parameter). Such (R,U)
pairs typically are plotted in a two-dimensional graph, so
that the user can grasp easily the inuence of a particular
method and/or parameter choice.
k-Anonymity. A different approach to facing the conict
between information loss and disclosure risk is suggested
by Samarati and Sweeney (35). A protected dataset is said
to satisfy k-anonymity for k > 1 if, for each combination of
key attribute values (e.g., address, age, and gender), at
least k records exist in the dataset sharing that combina-
tion. Now if, for a given k, k-anonymity is assumed to be
enough protection, one can concentrate on minimizing
information loss with the only constraint that k-anonymity
shouldbe satised. This methodis acleanwayof solvingthe
tension between data protection and data utility. Because
k-anonymity usually is achieved via generalization
(equivalent to global recoding, as said above) and local
suppression, minimizing information loss usually trans-
lates to reducing the number and/or the magnitude of
suppressions.
k-Anonymity bears some resemblance to the underlying
principle of multivariate microaggregation and is a useful
concept because key attributes usually are categorical or
can be categorized, [i.e., they take values in a nite (and
ideally reduced) range] However, reidentication is not
basedonnecessarilycategorical keyattributes: Sometimes,
numeric outcome attributes, which are continuous and
often cannot be categorized, give enough clues for reiden-
tication. Microaggregation was suggested in Ref. 18 as a
possible way to achieve k-anonymity for numeric, ordinal,
and nominal attributes: The idea is to use multivariate
microaggregation on the key attributes of the dataset.
Future
Manyopenissues inSDCexist, some of whichhopefully can
be solved with additional research and some of which are
likely to stay open because of the inherent nature of SDC.
We rst list some of the issues that probably could and
should be settled in the near future:
v
Identifying a comprehensive listing of data uses (e.g.,
regression models, and association rules) that would
allow the denition of data use-specic information
loss measures broadly accepted by the community;
those new measures could complement and/or replace
the generic measures currently used. Work in this line
was started in Europe in 2006 under the CENEX SDC
project sponsored by Eurostat.
v
Devising disclosure risk assessment procedures that
are as universally applicable as record linkage while
being less greedy in computational terms.
v
Identifying, for each domain of application, which are
the external data sources that intruders typically can
access to attempt reidentication. This capability
would help data protectors to gure out in more rea-
listic terms which are the disclosure scenarios they
should protect data against.
v
Creating one or several benchmarks to assess the
performance of SDC methods. Benchmark creation
currently is hampered by the condentiality of the
original datasets to be protected. Data protectors
should agree on a collection of noncondential, origi-
nal-looking datasets (nancial datasets, population
datasets, etc.), which can be used by anybody to com-
pare the performance of SDCmethods. The benchmark
should also incorporate state-of-the-art disclosure risk
assessment methods, which requires continuous
update and maintenance.
Other issues exist whose solutionseems less likely inthe
near future, because of the very nature of SDC methods. If
an intruder knows the SDC algorithm used to create a
protected data set, he can mount algorithm-specic reiden-
tication attacks that can disclose more condential infor-
mation than conventional data mining attacks. Keeping
secret the SDC algorithm used would seem a solution, but
inmany cases, the protecteddataset gives some clues onthe
SDC algorithm used to produce it. Such is the case for a
rounded, microaggregated, or partially suppressed micro-
data set. Thus, it is unclear to what extent the SDC algo-
rithm used can be kept secret.
CROSS-REFERENCES
statistical disclosure control. See Statistical Databases.
statistical disclosure limitation. See Statistical Data-
bases.
inference control. See Statistical Databases.
privacy in statistical databases. See Statistical Data-
bases.
BIBLIOGRAPHY
1. J. Domingo-Ferrer, A three-dimensional conceptual frame-
work for database privacy, in Secure Data Management-4th
VLDB Workshop SDM2007, Lecture Notes in Computer
Science, vol. 4721. Berlin: Springer-Verlag, 2007, pp. 193202.
2. T. Dalenius, The invasion of privacy problem and statistics
production. Anoverview, Statistik Tidskrift, 12: 213225, 1974.
3. J. Schlorer, Identication and retrieval of personal records
from a statistical data bank. Methods Inform. Med., 14 (1):
713, 1975.
4. D. E. Denning, P. J. Denning, and M. D. Schwartz, The
tracker: a threat to statistical database security, ACMTrans.
Database Syst., 4 (1): 7696, 1979.
STATISTICAL DATABASES 9
5. N. R. AdamandJ. C. Wortmann, Security-control for statistical
databases: a comparative study, ACM Comput. Surv., 21 (4):
515556, 1989.
6. L. Willenborg and T. DeWaal, Elements of Statistical Disclo-
sure Control, New York: Springer-Verlag, 2001.
7. J. Schlorer, Disclosure fromstatistical databases: quantitative
aspects of trackers, ACM Trans. Database Syst., 5: 467492,
1980.
8. A. Hundepool, J. Domingo-Ferrer, L. Franconi, S. Giessing,
R. Lenz, J. Longhurst, E. Schulte-Nordholt, G. Seri, and P.-P.
DeWolf, Handbook on Statistical Disclosure Control (version
1.0). Eurostat (CENEX SDC Project Deliverable), 2006.
Available: http://neon.vb.cbs.nl/CENEX/.
9. A. Hundepool, A. van deWetering, R. Ramaswamy, P.-P.
deWolf, S. Giessing, M. Fischetti, J.-J. Salazar, J. Castro,
and P. Lowthian, t-ARGUSv. 3.2 Software andUsers Manual,
CENEX SDC Project Deliverable, 2007. Available: http://
neon.vb.cbs.nl/casc/TAU.html.
10. G. T. Duncan and S. Mukherjee, Optimal disclosure limitation
strategy in statistical databases: deterring tracker attacks
through additive noise, J. Am. Statist. Assoc., 45: 720729,
2000.
11. K. Muralidhar, D. Batra, and P. J. Kirs, Accessibility, security
and accuracy in statistical databases: the case for the multi-
plicative xed data perturbation approach, Management Sci.,
41: 15491564, 1995.
12. J. F. Traub, Y. Yemini, and H. Wozniakowski, The statistical
security of a statistical database, ACM Trans. Database Syst.,
9: 672679, 1984.
13. F. Y. Chin and G. Ozsoyoglu, Auditing and inference control in
statistical databases, IEEE Trans. Software Engin., SE-8:
574582, 1982.
14. R. Gopal, R. Garnkel, and P. Goes, Condentiality via camou-
age: The CVCapproachto disclosure limitationwhenanswer-
ing queries to databases, Operations Res., 50: 501516, 2002.
15. J. J. Kim, Amethodfor limitingdisclosureinmicrodatabasedon
randomnoise and transformation, Proceedings of the Sectionon
Survey Research Methods, Alexandria, VA, 1986, pp. 303308.
16. D. Defays and P. Nanopoulos, Panels of enterprises and
condentiality: the small aggregates method, Proc. of 92 Sym-
posium on Design and Analysis of Longitudinal Surveys,
Ottawa: 1993, pp. 195204. Statistics Canada,
17. A. Hundepool, A. Van deWetering, R. Ramaswamy, L. Fran-
coni, A. Capobianchi, P.-P. DeWolf, J. Domingo-Ferrer, V.
Torra, R. Brand, and S. Giessing, m-ARGUS version 4.0 Soft-
ware andUsers Manual, Statistics Netherlands, VoorburgNL,
2005. Available: http://neon.vb.cbs.nl/casc.
18. J. Domingo-Ferrer and V. Torra, Ordinal, continuous and
heterogenerous k-anonymity through microaggregation,
Data Mining Knowl. Discov., 11 (2): 195212, 2005.
19. J. Domingo-Ferrer and J. M. Mateo-Sanz, Practical data-
oriented microaggregation for statistical disclosure control,
IEEE Trans. Knowledge Data Engin., 14 (1): 189201, 2002.
20. M. Laszlo and S. Mukherjee, Minimum spanning tree parti-
tioning algorithm for microaggregation, IEEE Trans. Knowl-
edge Data Engineer., 17 (7): 902911, 2005.
21. J. Domingo-Ferrer, F. Sebe, and A. Solanas, Apolynomial-time
approximation to optimal multivariate microaggregation,
Computers Mathemat. Applicat., In press.
22. S. L. Hansen and S. Mukherjee, A polynomial algorithm for
optimal univariate microaggregation, IEEE Trans. Knowl.
Data Engineer., 15 (4): 10431044, 2003.
23. V. Torra, Microaggregation for categorical variables: a median
based approach, inPrivacy in Statistical Databases-PSD2004,
Lecture Notes inComputer Science, vol. 3050. Berlin: Springer-
Verlag, 2004, pp. 162174.
24. J. Domingo-Ferrer, A. Mart nez-Balleste, J. M. Mateo-Sanz,
and F. Sebe, Efcient multivariate data-oriented microaggre-
gation, VLDB Journal, 15: 355369, 2006.
25. B. Greenberg, Rankswapping for ordinal data, 1987, Washing-
ton, DC: U. S. Bureauof the Census (unpublished manuscript).
26. J. M. Gouweleeuw, P. Kooiman, L. C. R. J. Willenborg, and P.-
P. DeWolf, Post randomisation for statistical disclosure con-
trol: theory and implementation, 1997. Research Paper no.
9731. Voorburg: Statistics Netherlands.
27. D. B. Rubin, Discussion of statistical disclosure limitation, J.
Ofcial Stat., 9 (2): 461468, 1993.
28. J. P. Reiter, Satisfying disclosure restrictions with synthetic
data sets, J. Ofcial Stat., 18 (4): 531544, 2002.
29. W. E. Winkler, Re-identication methods for evaluating the
condentiality of analytically valid microdata, Rese. in Of-
cial Stat., 1 (2): 5069, 1998.
30. J. Domingo-Ferrer and V. Torra, A quantitative comparison of
disclosure control methods for microdata, in P. Doyle, J. I.
Lane, J. J. M. Theeuwes, and L. Zayatz (eds.), Condentiality,
Disclosure andData Access: Theory andPractical Applications
for Statistical Agencies, Amsterdam: North-Holland, 2001, pp.
111134.
31. F. Sebe, J. Domingo-Ferrer, J. M. Mateo-Sanz, and V. Torra,
Post-maskingoptimizationof the tradeoff betweeninformation
loss and disclosure risk in masked microdata sets, in Inference
Control in Statistical Databases, Lecture Notes in Computer
Science, vol. 2316. Springer-Verlag, 2002, pp. 163171.
32. J. M. Mateo-Sanz, J. Domingo-Ferrer, and F. Sebe, Probabil-
istic information loss measures in condentiality protection
of continuous microdata.
33. I. P. Fellegi and A. B. Sunter, A theory for record linkage,
J. Am. Stat. Associat., 64 (328): 11831210, 1969.
34. G. T. Duncan, S. E. Fienberg, R. Krishnan, R. Padman, and S.
F. Roehrig, Disclosurelimitationmethods andinformationloss
for tabular data, in P. Doyle, J. I. Lane, J. J. Theeuwes, and L.
V. Zayatz (eds.), Condentiality, Disclosure and Data Access:
Theory and Practical Applications for Statistical Agencies,
Amsterdam: North-Holland, 2001, pp. 135166.
35. P. Samarati and L. Sweeney, Protecting privacy when disclos-
ing information: k-anonymity and its enforcement through
generalization and suppression, Technical Report, SRI Inter-
national, 1998.
FURTHER READING
A. Hundepool, J. Domingo-Ferrer, L. Franconi, S. Giessing, R.
Lenz, J. Longhurst, E. Schulte-Nordholt, G. Seri, andP.-P. DeWolf,
Handbook on Statistical Disclosure Control (version 1.0), Eurostat
(CENEX SDC Project Deliverable), 2006. Available: http://
neon.vb.cbs.nl/CENEX/.
L. Willenborg and T. DeWaal, Elements of Statistical Disclosure
Control, New York: Springer-Verlag, 2001.
JOSEP DOMINGO-FERRER
Rovira i Virgili University
Tarragona, Catalonia, Spain
10 STATISTICAL DATABASES
S
SYSTEM MONITORING
The term system refers to a computer system that is com-
posedof hardware andsoftware for dataprocessing. System
monitoring collects information about the behavior of a
computer system while the system is running. What is of
interest here is run-time information that cannot be
obtained by static analysis of programs. All collected infor-
mation is essentially about system correctness or perfor-
mance. Such information is vital for understanding how a
system works. It can be used for dynamic safety checking
and failure detection, program testing and debugging,
dynamic task scheduling and resource allocation, per-
formance evaluation and tuning, system selection and
design, and so on.
COMPONENTS AND TECHNIQUES FOR SYSTEM
MONITORING
Systemmonitoring has three components. First, the jobs to
be run and the items to be measured are determined. Then,
the system to be monitored is modied to run the jobs and
take the measurements. This is the major component.
Monitoring is accomplished in two operations: triggering
and recording (1). Triggering, also called activation, is the
observationanddetectionof specied events during system
execution. Recording is the collection and storage of data
pertinent to those events. Finally, the recorded data are
analyzed and displayed.
The selection and characterization of the jobs to be run
for monitoring is important, because it is the basis for
interpreting the monitoring results and guaranteeing
that the experiments are repeatable. A collection of jobs
to be run is called a test workload (24); for performance
monitoring, this refers mainly to the load rather than the
work, or job. A workload can be real or synthetic. A real
workloadconsists of jobs that are actually performed by the
users of the system to be monitored. A synthetic workload,
usually called a benchmark, consists of batch programs or
interactive scripts that are designed to represent the actual
jobs of interest. Whether a workload is real or synthetic
does not affect the monitoring techniques.
Items to be measured are determined by the applica-
tions. They can be about the entire system or about differ-
ent levels of the system, from user-level application
programs to operating systems to low-level hardware cir-
cuits. For the entire system, one may need to knowwhether
jobs are completed normally and performance indices such
as job completion time, called turnaround time in batch
systems and response time in interactive systems, or the
number of jobs completed per unit of time, called through-
put (3). For application programs, one may be interested in
how often a piece of code is executed, whether a variable is
read between two updates, or howmany messages are sent
by a process. For operating systems, one may need to know
whether the CPUis busy at certaintimes, howoftenpaging
occurs, or how long an I/O operation takes. For hardware
circuits, one may need to knowhowoftena cache element is
replaced, or whether a network wire is busy.
Monitoring can use either event-driven or sampling
techniques (3). Event-driven monitoring is based on obser-
ving changes of system state, either in software programs
or hardware circuits, that are caused by events of interest,
such as the transition of the CPU from busy to idle. It is
often implemented as special instructions for interrupt
intercept that are inserted into the systemto be monitored.
Sampling monitoring is based on probing at selected time
intervals, into either software programs or hardware cir-
cuits, to obtain data of interest, such as what kinds of
processes occupy the CPU. It is often implemented as timer
interrupts during whichthe state is recorded. Note that the
behavior of the system under a given workload can be
simulated by a simulation tool. Thus monitoring that
should be performed on the real system may be carried
out on the simulation tool. Monitoring simulation tools is
useful, or necessary, for understanding the behavior of
models of systems still under design.
Monitoring can be implemented using software, hard-
ware, or both (1,35). Software monitors are programs that
are inserted into the system to be monitored. They are
triggered upon appropriate interrupts or by executing
the inserted code. Data are recorded in buffers in the
working memory of the monitored systemand, whenneces-
sary, written to secondary storage. Hardware monitors are
electronic devices that are connected to specic system
points. They are triggered upon detecting signals of inter-
est. Data are recorded in separate memory independent of
the monitored system. Hybrid monitors combine techni-
ques from software and hardware. Often, the triggering is
carriedout usingsoftware, andthe datarecordingis carried
out using hardware.
The data collected by a monitor must be analyzed and
displayed. Based on the way in which results are analyzed
and displayed, a monitor is classied as an on-line monitor
or a batch monitor. On-line monitors analyze and display
the collected data in real-time, either continuously or at
frequent intervals, while the system is still being moni-
tored. This is also called continuous monitoring (6). Batch
monitors collect data rst and analyze and display them
later using a batch program. In either case, the analyzed
data can be presented using many kinds of graphic charts,
as well as text and tables.
ISSUES IN SYSTEM MONITORING
Major issues of concern in monitoring are at what levels we
can obtain information of interest, what modications to
the system are needed to perform the monitoring, the
disturbance of such modications to the system behavior,
and the cost of implementing such modications. There are
also special concerns for monitoring real-time systems,
parallel architectures, and distributed systems.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Activities and data structures visible to the user process
can be monitored at the application-program level. These
include function and procedure calls and returns, assign-
ments to variables, loopings and branchings, inputs and
outputs, as well as synchronizations. Activities and data
structures visible to the kernel can be monitored at the
operating-system level; these include system state transi-
tions, external interrupts, system calls, as well as data
structures such as process control blocks. At the hardware
level, various patterns of signals on the buses can be
monitored. Obviously, certain high-level information can-
not be obtained by monitoring at a lower level, and vice
versa. It is worth noting that more often high-level infor-
mation can be used to infer low-level information if one
knows enough about all the involved components, such as
the compilers, but the converse is not true, simply because
more often multiple high-level activities are mapped to the
same low-level activity.
In general, the software and hardware of a system are
not purposely designed to be monitored. This oftenrestricts
what can be monitored in a system. To overcome these
restrictions, modications to the system, called instrumen-
tation, are often required, for example, inserting interrupt
instructions or attaching hardware devices. The informa-
tion obtainable with a monitor and the cost of measure-
ments determine the measurability of a computer system
(3,7). At one extreme, every system component can be
monitored at the desired level of detail, while at the other
extreme, only the external behavior of the system as a
whole can be monitored. When a low degree of detail is
required, a macroscopic analysis, which requires measure-
ment of global indices such as turnaround time and
response time, is sufcient. When a high degree of detail
is needed, a microscopic analysis, which requires, say, the
time of executing each instruction or loading each indivi-
dual page, must be performed.
Monitoring often interferes with the system behavior,
since it may consume system resources, due to the time of
performing monitoring activities and the space of storing
collected data, which are collectively called the overhead of
monitoring. A major issue in monitoring is to reduce the
perturbation. It is easy to see that a macroscopic analysis
incurs less interference than a microscopic analysis.
Usually, sampling monitoring causes less interference
thanevent-driven monitoring. Interms of implementation,
software monitors always interfere and sometimes inter-
fere greatly with the systemto be monitored, but hardware
monitors cause little or no interference.
Implementing monitoring usually has a cost, since it
requires modication to the systemto be monitored. There-
fore, an important concern is to reduce the cost. Software
monitors are simply programs, so they are usually less
costly to develop and easier to change. In contrast, hard-
ware monitors require separate hardware devices and thus
are usually more difcult to build and modify.
Finally, special methods and techniques are necessary
for monitoring real-time systems, parallel architectures,
and distributed systems. Real-time systems have real-time
constraints, so interference becomes much more critical.
For parallel architectures, monitoring needs to handle
issues arising from interprocessor communication and
scheduling, cache behavior, and shared memory behavior.
For distributedsystems, monitoring must take into account
ordering of distributed events, message passing, synchro-
nization, as well as various kinds of failures.
MONITORING PRINCIPLES
A set of principles is necessary to address all the issues
involved in monitoring. The major task is to determine the
monitoring techniques needed based on the applications
and the trade-offs. Methods and tools that facilitate mon-
itoring are also needed.
Consider the major task. Given the desired information,
one rst needs to determine all levels that canbe monitored
to obtain the information. For each possibility, one deter-
mines all modications of the system that are needed to
perform the monitoring. Then one needs to assess the
perturbation that the monitoring could cause. Finally,
one must estimate the cost of the implementations. Clearly,
unacceptable perturbation or cost helps reduce the possi-
bilities. Then, one needs to evaluate all possibilities based
on the following trade-offs.
First, monitoring at a higher level generally requires
less modication to the system and has smaller implemen-
tation cost, but it may have larger interference with the
system behavior. Thus one principle is to monitor at the
highest level whose interference is acceptable. This implies
that, if a software monitor has acceptable interference, one
should avoid using a hardware monitor. Furthermore, to
reduce implementation cost, for a systembeing designed or
that is difcult to measure, one can use simulation tools
instead of the real system if credibility can be established.
Second, macroscopic analysis generally causes less per-
turbationto the systembehavior thanmicroscopic analysis,
andit oftenrequires less modicationto the systemandhas
smaller cost. Therefore, a second principle is to use macro-
scopic analysis instead of microscopic analysis if possible.
While sampling is a statistical technique that records data
only at sampled times, event detection is usually used to
record all potentially interesting events and construct the
execution trace. Thus one should avoid using tracing if the
desired information can be obtained by sampling.
Additionally, one should consider workload selection
and data analysis. Using benchmarks instead of real work-
load makes the experiments repeatable and facilitates
comparison of monitoring results. It can also reduce the
cost, since running real jobs could be expensive or impos-
sible. Thus using benchmarks is preferred, but a number of
common mistakes need to be carefully avoided (4). Data
analysis involves a separate trade-off: the on-line method
adds time overhead but can reduce the space overhead.
Thus even when monitoring results do not need to be
presented in an on-line fashion, on-line analysis can be
used to reduce the space overhead and, when needed,
separate processors can be used to reduce also the time
overhead.
Finally, special applications determine special monitor-
ing principles. For example, for monitoring real-time sys-
tems, perturbation is usually not tolerable, but a full trace
is often needed to understand system behavior. To address
2 SYSTEM MONITORING
this problem, one may perform microscopic monitoring
based on event detection and implement monitoring in
hardware so as to sense signals on buses at high speed
and with low overhead. If monitoring results are needed in
an on-line fashion, separate resources for data analysis
must be used. Of course, all these come at a cost.
To facilitate monitoring, one needs methods andtools for
instrumenting the system, efcient data structures and
algorithms for storing and manipulating data, and tech-
niques for relating monitoring results to the source
program to identify problematic code sections. Instrumen-
tationof programs canbedone viaprogramtransformation,
by augmenting the source code, the target code, the run-
time environment, the operating system, or the hardware.
Often, combinations of these techniques are used. Efcient
data structures and algorithms are needed to handle
records of various execution information, by organizing
them in certain forms of tables and linked structures.
They are critical for reducing monitoring overhead. Addi-
tional information from the compiler and other involved
components can be used to relate monitoring results with
points insource programs. Monitoring results canalso help
select candidate jobs for further monitoring.
In summary, a number of trade-offs are involved in
determining the monitoring techniques adopted for a par-
ticular application. Tools should be developed and used to
help instrument the system, reduce the overhead, and
interpret the monitoring results.
WORKLOAD SELECTION
To understand howa complex systemworks, one rst needs
to determine what to observe. Thus before determining how
to monitor a system, one must determine what to monitor
and why it is important to monitor them. This enables one
to determine the feasibility of the monitoring, based on the
perturbation and the cost, and then allows repeating and
justifying the experiments.
Selecting candidate jobs to be run and measurements to
be taken depends on the objectives of monitoring. For
monitoring that is aimed at performance behavior, such
as systemtuning or task scheduling, one needs to select the
representative loadof work. For monitoringthat is aimedat
functional correctness, such as for debugging and fault-
tolerance analysis, one needs to isolate the buggy or
faulty parts.
A real workload best reects system behavior under
actual usage, but it is usually unnecessarily expensive,
complicated, or even impossible to use as a test workload.
Furthermore, the test results are not easily repeated and
are not good for comparison. Therefore, a synthetic work-
load is normally used. For monitoring the functional cor-
rectness of a system, a test suite normally consists of data
that exercise various parts of the system, andmonitoring at
those parts is set up accordingly. For performance monitor-
ing, the load of work, rather than the actual jobs, is the
major concern, and the approaches below have been used
for obtaining test workloads (3,4,8).
Addition instruction was used to measure early com-
puters, which had mainly a few kinds of instructions.
Instruction mixes, each specifying various instructions
together with their usage frequencies, were used when
the varieties of instructions grew. Then, when pipelining,
instruction caching, and address translation mechanisms
made computer instruction times highly variable, kernels,
which are higher-level functions, such as matrix inver-
sion and Ackermanns function, which represent services
provided by the processor, came into use. Later on, as input
and output became an important part of real workload,
synthetic programs, which are composed of exerciser
loops that make a specied number of service calls or I/O
requests, came into use. For domain-specic kinds of
applications, such as banking or airline reservation, appli-
cation benchmarks, representative subsets of the functions
in the application that make use of all resources in the
system, are used. Kernels, synthetic programs, and appli-
cation benchmarks are all called benchmarks. Popular
benchmarks include the sieve kernel, the LINPACK
benchmarks, the debitcredit benchmark, and the SPEC
benchmark suite (4).
Consider monitoring the functional behavior of a sys-
tem. For general testing, the test suite should have com-
plete coverage, that is, all components of the systemshould
be exercised. For debugging, one needs to select jobs that
isolate the problematic parts. This normally involves
repeatedlyselectingmore specializedjobs andmorefocused
monitoring points based onmonitoring results. For correct-
ness checking at given points, one needs to select jobs that
leadto different possible results at those points andmonitor
at those points. Special methods are used for special classes
of applications; for example, for testing fault-tolerance in
distributed systems, message losses or process failures can
be included in the test suite.
For system performance monitoring, selection should
consider the services exercised as well as the level of detail
andrepresentativeness (4). The starting point is to consider
the system as a service provider and select the workload
and metrics that reect the performance of services pro-
vided at the system level and not at the component level.
The amount of detail in recording user requests should be
determined. Possible choices include the most frequent
request, the frequency of request types, the sequence of
requests with time stamps, and the average resource
demand. The test workload should also be representative
of the real application. Representativeness is reected at
different levels (3) at the physical level, the consumptions
of hardware and software resources should be representa-
tive; at the virtual level, the logical resources that are closer
to the users point of view, such as virtual memory space,
should be representative; at the functional level, the test
workload should include the applications that perform the
same functions as the real workload.
Workload characterization is the quantitative descrip-
tion of a workload (3,4). It is usually done in terms of
workload parameters that can affect system behavior.
These parameters are about service requests, such as
arrival rate and duration of request, or about measured
quantities, such as CPU time, memory space, amount of
read and write, or amount of communication, for which
systemindependent parameters are preferred. In addition,
various techniques have been used to obtain statistical
SYSTEM MONITORING 3
quantities, such as frequencies of instruction types, mean
time for executing certain I/Ooperations, and probabilities
of accessing certain devices. These techniques include
averaging, histograms, Markov models, and clustering.
Markov models specify the dependency among requests
using a transition diagram. Clustering groups similar com-
ponents in a workload in order to reduce the large number
of parameters for these components.
TRIGGERING MECHANISM
Monitoring can use either event-driven or sampling tech-
niques for triggering and data recording (3). Event-driven
techniques can lead to more detailed and accurate informa-
tion, while sampling techniques are easier to implement
and have smaller overhead. These two techniques are not
mutually exclusive; they can coexist in a single tool.
Event-Driven Monitoring
Anevent ina computer systemis any change of the systems
state, such as the transition of a CPUfrombusy to idle, the
change of content ina memory location, or the occurrence of
a pattern of signals on the memory bus. Therefore, a way of
collecting data about system activities is to capture all
associated events and record them in the order they occur.
A software event is an event associated with a programs
function, suchas the change of content inamemorylocation
or the start of an I/O operation. A hardware event is a
combination of signals in the circuit of a system, such as
a patternof signals onthe memory bus or signals sent to the
disk drive.
Event-driven monitoring using software is done by
inserting a special trap code or hook in specic places of
the application program or the operating system. When an
event to be captured occurs, the inserted code causes con-
trol to be transferred to anappropriate routine. The routine
records the occurrence of the event andstores relevant data
in a buffer area, which is to be written to secondary storage
and/or analyzed, possibly at alater time. Thenthe control is
transferred back. The recorded events and data form an
event trace. It can provide more information than any other
method on certain aspects of a systems behavior.
Producing full event traces using software has high
overhead, since it can consume a great deal of CPU time
by collecting and analyzing a large amount of data. There-
fore, event tracing in software should be selective, since
intercepting too many events may slow down the normal
execution of the system to an unacceptable degree. Also, to
keep buffer space limited, buffer content must be written to
secondary storage with some frequency, which also con-
sumes time; the system may decide to either wait for the
completion of the buffer transfer or continue normally with
some data loss.
Inmost cases, event-drivenmonitoring using software is
difcult to implement, since it requires that the application
program or the operating system be modied. It may also
introduce errors. To modify the system, one must under-
stand its structure and function and identify safe places for
the modications. In some cases, instrumentation is not
possible whenthe source code of the systemis not available.
Event-driven monitoring in hardware uses the same
techniques as in software, conceptually and in practice,
for handling events. However, since hardware uses sepa-
rate devices for trigger and recording, the monitoring over-
head is small or zero. Some systems are evenequipped with
hardware that makes event tracing easier. Such hardware
can help evaluate the performance of a system as well as
test and debug the hardware or software. Many hardware
events can also be detected via software.
Sampling Monitoring
Sampling is a statistical technique that can be used when
monitoring all the data about a set of events is unnecessary,
impossible, or too expensive. Instead of monitoring the
entire set, one can monitor a part of it, called a sample.
Fromthis sample, it is then possible to estimate, often with
a high degree of accuracy, some parameters that charac-
terize the entire set. For example, one can estimate the
proportion of time spent in different code segments by
sampling program counters instead of recording the event
sequence and the exact event count; samples can also be
taken to estimate how much time different kinds of pro-
cesses occupy CPU, howmuchmemory is used, or howoften
a printer is busy during certain runs.
In general, sampling monitoring can be used for mea-
suring the fractions of a given time interval each system
component spends in its various states. It is easy to imple-
ment using periodic interrupts generated by a timer. Dur-
ing an interrupt, control is transferred to a data-collection
routine, where relevant data in the state are recorded. The
data collected during the monitored interval are later
analyzed to determine what happened during the interval,
in what ratios the various events occurred, and how differ-
ent types of activities were related to each other. Besides
timer interrupts, most modern architectures also include
hardware performance counters, which can be used for
generating periodic interrupts (9). This helps reduce the
need for additional hardware monitoring.
The accuracy of the results is determined by how repre-
sentative a sample is. When one has no knowledge of
the monitored system, random sampling can ensure repre-
sentativeness if the sample is sufciently large. It should
be noted that, since the sampled quantities are functions
of time, the workload must be stationary to guarantee
validity of the results. In practice, operating-system
workload is rarely stationary during long periods of time,
but relatively stationary situations can usually be obtained
by dividing the monitoring interval into short periods of,
say, a minute and grouping homogeneous blocks of data
together.
Sampling monitoring has two major advantages. First,
the monitored program need not be modied. Therefore,
knowledge of the structure and function of the monitored
program, and often the source code, is not needed for
sampling monitoring. Second, sampling allows the system
to spend much less time in collecting and analyzing a much
smaller amount of data, and the overhead can be kept less
than 5% (3,9,10). Furthermore, the frequency of the inter-
rupts can easily be adjusted to obtain appropriate sample
size and appropriate overhead. In particular, the overhead
4 SYSTEM MONITORING
can also be estimated easily. All these make sampling
monitoring particularly good for performance monitoring
and dynamic system resource allocation.
IMPLEMENTATION
System monitoring can be implemented using software
or hardware. Software monitors are easier to build and
modify and are capable of capturing high-level events and
relating them to the source code, while hardware monitors
can capture rapid events at circuit level and have lower
overhead.
Software Monitoring
Software monitors are used to monitor application pro-
grams and operating systems. They consist solely of instru-
mentation code inserted into the system to be monitored.
Therefore, they are easier to build and modify. At each
activation, the inserted code is executed and relevant data
are recorded, using the CPU and memory of the monitored
system. Thus software monitors affect the performance and
possibly the correctness of the monitored system and are
not appropriate for monitoringrapidevents. For example, if
the monitor executes 100 instructions at each activation,
and each instruction takes 1 ms, then each activation takes
0.1 ms; to limit the time overhead to 1%, the monitor must
be activated at intervals of 10 ms or more, that is, less than
100 monitored events should occur per second.
Software monitors can use both event-driven and sam-
pling techniques. Obviously, a major issue is howto reduce
the monitoring overhead while obtaining sufcient infor-
mation. When designing monitors, there may rst be a
tendency to collect as much data as possible by tracing or
sampling many activities. It may evenbe necessary to add a
considerable amount of load to the system or to slow down
the program execution. After analyzing the initial results,
it will be possible to focus the experiments on specic
activities in more detail. In this way, the overhead can
usually be kept within reasonable limits. Additionally,
the amount of the data collected may be kept to a minimum
by using efcient data structures and algorithms for sto-
rage and analysis. For example, instead of recording the
state at each activation, one may only need to maintain a
counter for the number of times each particular state has
occurred, and these counters may be maintained in a hash
table (9).
Inserting code into the monitored system can be done in
three ways: (1) adding a program, (2) modifying the appli-
cation program, or (3) modifying the operating system
(3). Adding a program is simplest and is generally pre-
ferred to the other two, since the added program can easily
be removed or added again. Also, it maintains the inte-
grity of the monitored program and the operating system.
It is adequate for detecting the activity of a system or a
program as a whole. For example, adding a program that
reads the system clock before and after execution of a
program can be used to measure the execution time.
Modifying the application program is usually used for
event-driven monitoring, which can produce an execution
trace or anexact prole for the application. It is basedonthe
use of software probes, which are groups of instructions
inserted at critical points in the program to be monitored.
Each probe detects the arrival of the ow of control at the
point it is placed, allowing the execution path and the
number of times these paths are executed to be known.
Also, relevant data in registers and in memory may be
examined when these paths are executed. It is possible to
performsampling monitoring by using the kernel interrupt
service from within an application program, but it can be
performed more efciently by modifying the kernel.
Modifying the kernel is usually used for monitoring the
systemas a service provider. For example, instructions can
be inserted to read the system clock before and after a
service is provided in order to calculate the turnaround
time or response time; this interval cannot be obtainedfrom
within the application program. Sampling monitoring can
be performed efciently by letting an interrupt handler
directly record relevant data. The recorded data can be
analyzed to obtain information about the kernel as well as
the application programs.
Software monitoring, especially event-driven monitor-
ing in the application programs, makes it easy to obtain
descriptive data, such as the name of the procedure that is
called last inthe application programor the name of the le
that is accessed most frequently. This makes it easy to
correlate the monitoring results with the source program,
to interpret them, and to use them.
There are two special software monitors. One keeps
system accounting logs (4,6) and is usually built into the
operating system to keep track of resource usage; thus
additional monitoring might not be needed. The other
one is program execution monitor (4,11), used often for
nding the performance bottlenecks of application pro-
grams. It typically produces an execution prole, based
onevent detectionor statistical sampling. For event-driven
precise proling, efcient algorithms have been developed
to keep the overhead to a minimum (12). For sampling
proling, optimizations have been implemented to yield an
overhead of 1% to 3%, so the proling can be employed
continuously (9).
Hardware Monitoring
With hardware monitoring, the monitor uses hardware to
interface to the system to be monitored (5,1316). The
hardware passively detects events of interest by snooping
on electric signals in the monitored system. The monitored
systemis not instrumented, and the monitor does not share
any of the resources of the monitored system. The main
advantage of hardware monitoring is that the monitor does
not interfere with the normal functioning of the monitored
systemandrapidevents canbe captured. The disadvantage
of hardware monitoring is its cost and that it is usually
machine dependent or at least processor dependent. The
snooping device and the signal interpretation are bus and
processor dependent.
In general, hardware monitoring is used to monitor the
run-time behavior of either hardware devices or software
modules. Hardware devices are generally monitored to
examine issues such as cache accesses, cache misses, mem-
ory access times, total CPUtimes, total executiontimes, I/O
SYSTEM MONITORING 5
requests, I/Ogrants, and I/Obusy times. Software modules
are generallymonitoredto debugthe modules or to examine
issues such as the bottlenecks of a program, the deadlocks,
or the degree of parallelism.
A hardware monitor generally consists of a probe, an
event lter, a recorder, and a real-time clock. The probe is
high-impedance detectors that interface with the buses of
the systemto be monitored to latchthe signals onthe buses.
The signals collected by the probe are manipulated by the
event lter to detect events of interest. The data relevant to
the detected event along with the value of the real-time
clock are saved by the recorder. Based on the implementa-
tion of the event lter, hardware tools can be classied as
xed hardware tools, wired program hardware tools, and
stored program hardware tools (5,13).
With xed hardware tools, the event ltering mechan-
ism is completely hard-wired. The user can select neither
the events to be detected nor the actions to be performed
upon detection of an event. Such tools are generally
designed to measure specic parameters and are often
incorporated into a system at design time. Examples of
xed hardware tools are timing meters and counting
meters. Timing meters or timers measure the duration of
an activity or execution time, and counting meters or coun-
ters count occurrences of events, for example, references to
a memory location. When a certain value is reached in a
timer (or a counter), an electronic pulse is generated as an
output of the timer (or the counter), which may be used to
activate certain operations, for instance, to generate an
interrupt to the monitored system.
Wired-program hardware tools allow the user to detect
different events by setting the event ltering logic. The
event lter of a wired-program hardware tool consists of a
set of logic elements of combinational and sequential cir-
cuits. The interconnection between these elements can be
selected and manually manipulated by the user so as to
match different signal patterns and sequences for different
events. Thus wired-program tools are more exible than
xed hardware tools.
With stored-program hardware tools, ltering func-
tions can be congured and set up by software. Generally,
a stored-program hardware tool has its own processor,
that is, its own computer. The computer executes pro-
grams to set up ltering functions, to dene actions in
response to detected events, and to process and display
collected data. Their ability to control ltering makes
stored-program tools more exible and easier to use.
Logical state analyzers are typical examples of stored-
program hardware tools. With a logical state analyzer,
one can specify states to be traced, dene triggering
sequences, and specify actions to be taken when certain
events are detected. In newer logical state analyzers, all
of this can be accomplished through a graphical user
interface, making them very user-friendly.
Hybrid Monitoring
One of the drawbacks of the hardware monitoring approach
is that as integrated circuit techniques advance, more
functions are built on-chip. Thus desired signals might
not be accessible, and the accessible information might
not be sufcient to determine the behavior inside the
chip. For example, with increasingly sophisticated caching
algorithms implemented for on-chip caches, the informa-
tion collected from external buses may be insufcient to
determine what data need to be stored. Prefetched instruc-
tions anddata might not be usedby the processor, andsome
events can only be identied by a sequence of signal pat-
terns rather than by a single address or instruction. There-
fore passively snooping on the bus might not be effective.
Hybrid monitoring is an attractive compromise between
intrusive software monitoring and expensive nonintrusive
hardware monitoring.
Hybrid monitoring uses both software and hardware to
perform monitoring activities (5,1618). In hybrid moni-
toring, triggering is accomplished by instrumented soft-
ware and recording is performed by hardware. The
instrumented program writes the selected data to a hard-
ware interface. The hardware device records the data at
the hardware interface along with other data such as the
current time. Perturbation to the monitored system is
reduced by using hardware to store the collected data
into a separate storage device.
Current hybrid monitoring techniques use two different
triggering approaches. One has a set of selected memory
addresses to trigger data recording. When a selected
address is detected on the system address bus, the moni-
toring device records the address and the data on the
system data bus. This approach is called memory-mapped
monitoring. The other approach uses the coprocessor
instructions to trigger event recording. The recording
unit acts as a coprocessor that executes the coprocessor
instructions. This is called coprocessor monitoring.
With memory-mapped monitoring, the recording part of
the monitor acts like a memory-mapped output device with
a range of the computers address space allocated to it
(5,16,17). The processor can write to the locations in that
range in the same way as to the rest of the memory. The
system or program to be monitored is instrumented to
write to the memory locations representing different
events. The recording section of the monitor generally
contains a comparator, a clock and timer, an overow con-
trol, and an event buffer. The clock and timer provide the
time reference for events. The resolution of the clock
guarantees that no two successive events have the same
time stamp. The comparator is responsible for checking the
monitored systems address bus for designated events.
Once such an address is detected, the matched address,
the time, and the data on the monitored systems data bus
are stored in the event buffer. The overow control is used
to detect events lost due to buffer overow.
With coprocessor monitoring, the recording part is
attached to the monitored processor through a coprocessor
interface, like a oating-point coprocessor (18). The recor-
der contains a set of data registers, which can be accessed
directly by the monitored processor through coprocessor
instructions. The system to be monitored is instrumented
using two types of coprocessor instructions: data instruc-
tions and event instructions. Data instructions are used
to send event-related information to the data registers
of the recorder. Event instructions are used to inform the
recorder of the occurrence of an event. When an event
6 SYSTEM MONITORING
instruction is received by the recorder, the recorder saves
its data registers, the event type, and a time stamp.
DATA ANALYSIS AND PRESENTATION
The collected data are voluminous and are usually not in a
form readable or directly usable, especially low-level data
collected inhardware. Presenting these data requires auto-
mated analyses, which may be simple or complicated,
depending on the applications. When monitoring results
are not needed in an on-line fashion, one can store all
collected data, at the expense of the storage space, and
analyze them off-line; this reduces the time overhead of
monitoring caused by the analysis. For monitoring that
requires on-line data analysis, efcient on-line algorithms
are needed to incrementally process the collected data, but
such algorithms are sometimes difcult to design.
The collected data can be of various forms (4). First,
they can be either qualitative or quantitative. Qualitative
data form a nite category, classication, or set, such as
the set {busy, idle} or the set of weekdays. The elements
can be ordered or unordered. Quantitative data are
expressed numerically, for example, using integers or oat-
ing-point numbers. They can be discrete or continuous. It is
easy to see that each kind of data can be represented in a
high-level programming language and can be directly dis-
played as text or numbers.
These data can be organized into various data struc-
tures during data analysis, as well as during data col-
lection, and presented as tables or diagrams. Tables and
diagrams such as line charts, bar charts, pie charts,
and histograms are commonly used for all kinds of data
presentation, not just for monitoring. The goal is to make
the most important information the most obvious, and
concentrate on one theme in each table or graph; for
example, concentrate on CPU utilization over time, or
on the proportion of time various resources are used.
With the advancement of multimedia technology, moni-
tored data are now frequently animated. Visualization
helps greatly in interpreting the measured data. Moni-
tored data may also be presented using hypertext or
hypermedia, allowing details of the data to be revealed
in a step-by-step fashion.
A number of graphic charts have been developed spe-
cially for computer system performance analysis. These
include Gantt charts and Kiviat graphs (4).
Gantt charts are used for showing system resource
utilization, in particular, the relative duration of a number
of Boolean conditions, each denoting whether a resource is
busy or idle. Figure 1 is a sample Gantt chart. It shows the
utilization of three resources: CPU, I/O channel, and net-
work. The relative sizes and positions of the segments are
arranged to show the relative overlap. For example, the
CPU utilization is 60%, I/O 50%, and network 65%. The
overlap between CPU and I/O is 30%, all three are used
during 20% of the time, and the network is used alone 15%
of the time.
A Kiviat graph is a circle with unit radius and in which
different radial axes represent different performance
metrics. Each axis represents a fraction of the total time
during which the condition associated with the axis is true.
The points corresponding to the values on the axis can be
connected by straight-line segments, thereby dening a
polygon. Figure 2 is a sample Kiviat graph. It shows the
utilization of CPU and I/O channel. For example, the CPU
unitization is 60%, I/O 50%, and overlap 30%. Various
typical shapes of Kiviat graphs indicate how loaded and
balancedasystemis. Most often, anevennumber of metrics
are used, and metrics for which high is good and for which
low is good alternate in the graph.
APPLICATIONS
From the perspective of application versus system, mon-
itoring can be classied into two categories: that required
by the user of a system and that required by the system
itself. For example, for performance monitoring, the former
concerns the utilization of resources, including evaluating
performance, controlling usage, and planning additional
resources, and the latter concerns the management of the
system itself, so as to allow the system to adapt itself
dynamically to various factors (3).
Fromauser point of view, applications of monitoring can
be divided into two classes: (1) testing and debugging, and
(2) performance analysis and tuning. Dynamic system
management is an additional class that can use techniques
from both classes.
Testing and Debugging
Testing and debugging are aimed primarily at system cor-
rectness. Testing checks whether a system conforms to its
requirements, while debugging looks for sources of bugs.
They are two major activities of all software development.
Network
I/O
CPU
100% 80% 60% 40% 20%
20 20
30 20
60
10 15
0%
Figure 1. A sample Gantt chart for utilization profile.
CPU and
I/O busy
30%
CPU busy
60%
I/O busy
50%
Figure 2. A sample Kiviat graph for utilization profile.
SYSTEM MONITORING 7
Systems are becoming increasingly complex, and static
methods, such as program verication, have not caught
up. As a result, it is essential to look for potential problems
by monitoring dynamic executions.
Testing involves monitoring system behavior closely
while it runs a test suite and comparing the monitoring
results with the expected results. The most general strat-
egy for testing is bottom-up: unit test, integration test, and
system test. Starting by running and monitoring the func-
tionality of each component separately helps reduce the
total amount of monitoring needed. If any difference
between the monitoring results and the expected results
is found, then debugging is needed.
Debugging is the process of locating, analyzing,
and correcting suspected errors. Two main monitoring
techniques are used: single stepping and tracing. In
single-step mode, an interrupt is generated after each
instruction is executed, and any data in the state can be
selected and displayed. The user then issues a command to
let the system take another step. In trace mode, the user
selects the data to be displayed after each instruction is
executed and starts the execution at a specied location.
Execution continues until a specied condition on the data
holds. Tracing slows down the execution of the program, so
special hardware devices are needed to monitor real-time
operations.
Performance Evaluation and Tuning
Amost important application of monitoring is performance
evaluation and tuning (3,4,8,13). All engineered systems
are subject to performance evaluation. Monitoring is the
rst and key step in this process. It is used to measure
performance indices, such as turnaround time, response
time, throughput, and so forth.
Monitoring results can be used for performance evalua-
tion and tuning in as least the following six ways (4,6).
First, monitoring results help identify heavily used seg-
ments of code and optimize their performance. They can
also lead to the discovery of inefcient data structures that
cause excessive amount of memory access. Second, mon-
itoring can be used to measure system resource utilization
and nd performance bottleneck. This is the most popular
use of computer system monitoring (6). Third, monitoring
results can be used to tune system performance by balan-
cing resource utilization and favoring interactive jobs. One
can repeatedly adjust system parameters and measure the
results. Fourth, monitoring results can be used for work-
load characterization and capacity planning; the latter
requires ensuring that sufcient computer resources will
be available to run future workloads with satisfactory
performance. Fifth, monitoring can be used to compare
machine performance for selection evaluation. Monitoring
on simulation tools can also be used in evaluating the
design of a new system. Finally, monitoring results can
be used to obtain parameters of models of systems and to
validate models. They can also be used to validate models,
that is, to verify the representativeness of a model. This is
done by comparing measurements takenonthe real system
and on the model.
Dynamic System Management
For a system to manage itself dynamically, typically mon-
itoring is performed continuously, and data are analyzed in
an on-line fashion to provide dynamic feedback. Such feed-
back canbe used for managing boththe correctness and the
performance of the system.
An important class of applications is dynamic safety
checking and failure detection. It is becoming increasingly
important as computers take over more complicated
and safety-critical tasks, and it has wide applications in
distributed systems, in particular. Monitoring system
state, checking whether it is in an acceptable range, and
notifying appropriate agents of any anomalies are essen-
tial for the correctness of the system. Techniques for
testing and debugging can be used for such monitoring
and checking.
Another important class of applications is dynamic
task scheduling and resource allocation. It is particularly
important for real-time systems and service providers,
both of which are becoming increasingly widely used. For
example, monitoring enables periodic review of program
priorities on the basis of their CPUutilization and analysis
of page usage so that more frequently used pages can
replace less frequently used pages. Methods and tech-
niques for performance monitoring and tuning can be
used for these purposes. They have low overhead and
therefore allow the system to maintain a satisfactory level
of performance.
MONITORING REAL-TIME, PARALLEL,
AND DISTRIBUTED SYSTEMS
In a sequential system, the execution of a process is deter-
ministic, that is, the process generates the same output in
every execution in which the process is given the same
input. This is not true in parallel systems. In a parallel
system, the execution behavior of a parallel program in
response to a xed input is indeterminate, that is, the
results may be different in different executions, depending
on the race conditions present among processes and syn-
chronization sequences exercised by processes (1). Moni-
toring interference may cause the programto face different
sets of race conditions and exercise different synchroniza-
tion sequences. Thus instrumentation may change the
behavior of the system. The converse is also true: removing
instrumentation code from a monitored system may cause
the system to behave differently.
Testing and debugging parallel programs are very dif-
cult because an execution of a parallel program cannot
easily be repeated, unlike sequential programs. One chal-
lenge in monitoring parallel programs for testing and
debugging is to collect enough information with minimum
interference so the execution of the program can be
repeated or replayed. The execution behavior of a parallel
program is bound by the input, the race conditions, and
synchronization sequences exercised in that execution.
Thus data related to the input, race conditions, and syn-
chronization sequences need to be collected. Those events
are identied as process-level events (1). To eliminate the
behavior change causedbyremovinginstrumentationcode,
8 SYSTEM MONITORING
instrumentation code for process-level events may be kept
in the monitored system permanently. The performance
penalty can be compensated for by using faster hardware.
To monitor a parallel or distributed system, all the three
approachessoftware, hardware, and hybridmay be
employed. All the techniques described above are applic-
able. However, there are some issues special to parallel,
distributed, and real-time systems. These are discussed
below.
To monitor single-processor systems, only one event-
detection mechanism is needed because only one event of
interest may occur at a time. In a multiprocessor system,
several events may occur at the same time. With hardware
and hybrid monitoring, detection devices may be used for
each local memory bus and the bus for the shared memory
and I/O. The data collected can be stored in a common
storage device. To monitor distributed systems, each
node of the system needs to be monitored. Such a node is
a single processor or multiprocessor computer in its own
right. Thus eachnode should be monitored accordingly as if
it were an independent computer.
Events generally need to be recorded with the times at
which they occurred, so that the order of events can be
determined and the elapsed time between events can be
measured. The time can be obtained fromthe systembeing
monitored. In single-processor or tightly coupled multi-
processor systems, there is only one system clock, so it is
guaranteed that an event with an earlier time stamp
occurred before an event with a later time stamp. In other
words, events are totally ordered by their time stamps.
However, in distributed systems, each node has its own
clock, which may have a different reading from the clocks
onother nodes. There is no guarantee that anevent withan
earlier time stamp occurred before an event with a later
time stamp in distributed systems (1).
Indistributed systems, monitoring is distributed to each
nodeof the monitoredsystembyattachingamonitor to each
node. The monitor detects events and records the data on
that node. In order to understand the behavior of the
systemas a whole, the global state of the monitored system
at certaintimes needs to be constructed. To do this, the data
collected at each individual node must be transferred to a
central location where the global state can be built. Also,
the recorded times for the events on different nodes must
have a common reference to order them. There are two
options for transferring data to the central location. One
optionis to let the monitor use the networkof the monitored
system. This approach can cause interference to the com-
munication of the monitored system. To avoid such inter-
ference, an independent network for the monitor can be
used, allowing it to have a different topology and different
transmission speed than the network of the monitored
system. For the common time reference, each node has a
local clock and a synchronizer. The clock is synchronized
with the clocks on other nodes by the synchronizer.
The recordedevent dataoneachnode canbetransmitted
immediately to a central collector or temporarily stored
locally and transferred later to the central location. Which
method is appropriate depends on how the collected
data will be used. If the data are used in an on-line fashion
for dynamic display or for monitoring system safety
constraints, the data should be transferred immediately.
This may require a high-speed network to reduce the
latency between the system state and the display of that
state. If the data are transferred immediately with a high-
speed network, little local storage is needed. If the data are
used in an off-line fashion, they can be transferred at any
time. The data can be transferred after the monitoring is
done. In this case, each node should have mass storage to
store its local data. There is a disadvantage with this
approach. If the amount of recorded data on nodes is not
evenly distributed, too much data could be stored at one
node. Building a sufciently large data storage for every
node can be very expensive.
In monitoring real-time systems, a major challenge is
how to reduce the interference caused by the monitoring.
Real-time systems are those whose correctness depends not
only on the logical computation but also on the times at
which the results are generated. Real-time systems must
meet their timing constraints to avoid disastrous conse-
quences. Monitoring interference is unacceptable in most
real-time systems (1,14), since it may change not only the
logical behavior but also the timing behavior of the mon-
itored system. Software monitoring generally is unaccep-
table for real-time monitoring unless monitoring is
designed as part of the system (19). Hardware monitoring
has minimal interference to the monitored system, so it is
the best approach for monitoring real-time systems. How-
ever, it is very expensive to build, and sometimes it might
not provide the needed information. Thus hybrid monitor-
ing may be employed as a compromise.
CONCLUSION
Monitoring is an important technique for studying the
dynamic behavior of computer systems. Using collected
run-time information, users or engineers can analyze,
understand, and improve the reliability and performance
of complex systems. This article discussed basic concepts
and major issues in monitoring, techniques for event-
driven monitoring and sampling monitoring, and their
implementation in software monitors, hardware monitors,
and hybrid monitors. With the rapid growth of computing
power, the use of larger and more complex computer sys-
tems has increased dramatically, which poses larger chal-
lenges to system monitoring (2022). Possible topics for
future study include:
T
Ac
1
c
2
, where elements of the matrix Adenote
the similarity between colors. In addition, VisualSEEK
uses features such as centroid and minimum bounding
box of each region. Measurements such as distances
between regions and relative spatial locations are also
used in the query.
VisualSEEK allows the user to sketch spatial arrange-
ments of color regions, position themon the query grid, and
assign them properties of color, size, and absolute location.
The system then nds the images that contain the most
similar arrangements of regions. The systemautomatically
extracts and indexes salient color regions from the images.
By using efcient indexing techniques on color, region
sizes, and both absolute and relative spatial locations, a
wide variety of color/spatial visual queries can be com-
puted.
WebSEEK searches the World Wide Web for images
and video by looking for le extensions such as .jpg, .mov,
.gif, and .mpeg. The system extracts keywords from
the URL and hyperlinks. It creates a histogram on the
keywords to nd the most frequent ones, which are put
into classes and subclasses. An image can be put into more
than one classes. An image taxonomy is constructed in a
semiautomatic way using text features (suchas associated
html tags, captions, and articles) and visual features (such
as color, texture, and spatial layout) in the same way as
they are used in VisualSEEK. The user can search for
images by walking through the taxonomy. The system
displays a selection of images, which the user can search
by using similarity of color histograms. The histogramcan
be modied either manually or through relevance feed-
back.
BIBLIOGRAPHIC NOTES
Comprehensive surveys exist onthe topics of content-based
image retrieval, which include surveys by Aigrain et al.
(43), Rui et al. (44), Yoshitaka and Ichikawa (45),
Smeulders et al. (16), and recently, Datta et al. (46,47).
Multimedia information retrieval as a broader research
area covering video, audio, image, and text analysis has
been surveyed extensively by Sebe et al. (48), Snoek and
Worring (32), and Lew et al. (49). Surveys also exist on
closely related topics such as indexing methods for visual
data by Marsicoi et al. (50), face recognition by Zhao et al.
(51), applications of content-based retrieval to medicine by
Mu ller et al. (52), andapplications to cultural andhistorical
imaging by Chen et al. (53).
Image feature extraction is of ultimate importance to
content-based retrieval. Comparisons of different image
measures for content-based image retrieval are given in
Refs. (16,17 and 54).
Example visual database systems that support content-
based retrieval include IBM QBIC (6), Virage (55), and
NEC AMORE (56) in the commercial domain and MIT
Photobook (57), UIUC MARS (34,58), Columbia Visual-
SEEK (42), and UCSB NeTra (59) in the academic
domain. Practical issues such as system implementation
and architecture, their limitations and how to overcome
them, intuitive result visualization, and systemevaluation
were discussed by Smeulders et al. in Ref. (16). Many of
these systems such as AMORE or their extensions such as
UIUC WebMARS (60) and Columbia WebSEEK (19) sup-
port image search on the World Wide Web. Webseer (18)
and ImageScape (61) are systems designed for Web-based
image retrieval. Kher et al. (62) provides a survey of these
Web-based image retrieval systems.
Interested readers may refer to Refs. (16,17, 49 and 63)
for surveys and technical details of bridging the semantic
gap. Important early work that introduced relevance
feedback into image retrieval included Ref. (34), which
was implemented in the MARS system (58). A review of
techniques for relevance feedback is given by Zhou and
Huang (64).
Readers who are interested in the quadtree and related
hierarchical data structures may consult the survey (65) by
Samet. An encyclopedic description of high-dimensional
metric data structures was published recently as a book
(37). Bohm et al. (66) give a review of high-dimensional
indexing techniques of multimedia data.
VISUAL DATABASE 9
Relational database management systems have been
studied extensively for decades and have resulted in
numerous efcient optimization and implementation
techniques. For concepts and techniques in relational
database management, readers may consult popular data-
base textbooks, such as Ref. (40). Common machine learn-
ing and data mining algorithms for data clustering and
classication can be found in data mining textbooks, such
as Ref. (61). Image feature extraction depends on image
processing algorithms and techniques, which are
well explained in image processing textbooks, such as
Refs. (22 and 23).
BIBLIOGRAPHY
1. T. Kunii, S. Weyl, and J. M. Tenenbaum, A relational data-
base scheme for describing complex picture with color
and texture, Proc. 2nd Internat. Joint Conf. Pattern Recogni-
tion, Lyngby-Coperhagen, Denmark, 1974, pp. 7389.
2. S. K. ChangandK. S. Fu, (eds.), Pictorial InformationSystems,
Lecture Notes inComputer Science 80, Berlin: Springer-Verlag,
1980.
3. S. K. Chang and T. L. Kunii, Pictorial database systems, IEEE
Computer, 14 (11): 1321, 1981.
4. S. K. Chang, (ed.), Special issue on pictorial database systems,
IEEE Computer, 14 (11): 1981.
5. H. Tamura and N. Yokoya, Image database systems: a survey,
Pattern Recognition, 17 (1): 2943, 1984.
6. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang,
B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele,
and P. Yanker, Query by image and video content: the QBIC
system, IEEE Computer, 28 (9): 2332, 1995.
7. N. Roussopoulos, C. Faloutsos, and T. Sellis, An efcient
pictorial database system for PSQL, IEEE Trans. Softw.
Eng., 14 (5): 639650, 1988.
8. M. M. Zloof, Query-by-example: a database language, IBM
Syst. J., 16 (4): 324343, 1977.
9. N. S. Chang and K. S. Fu, Query-by-pictorial example, IEEE
Trans. Softw. Eng., 6 (6): 519524, 1980.
10. T. Joseph and A. F. Cardenas, PICQUERY: a high level query
language for pictorial database management, IEEE Trans.
Softw. Eng., 14 (5): 630638, 1988.
11. L. Yang and J. K. Wu, Towards a semantic image database
system, Data and Knowledge Enginee, 22 (2): 207227, 1997.
12. K. Yamaguchi and T. L. Kunii, PICCOLO logic for a picture
database computer and its implementation, IEEETrans. Com-
put., C-31 (10): 983996, 1982.
13. J. A. Orenstein and F. A. Manola, PROBE spatial data model-
ing and query processing in an image database application,
IEEE Trans. Softw. Eng., 14 (5): 611629, 1988.
14. S. K. Chang, C. W. Yan, D. C. Dimitroff, and T. Arndt, An
intelligent image database system, IEEE Trans. Softw. Eng.,
14 (5): 681688, 1988.
15. A. Brink, S. Marcus, and V. Subrahmanian, Heterogeneous
multimedia reasoning, IEEE Computer, 28 (9): 3339, 1995.
16. A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R.
Jain, Content-based image retrieval at the end of the early
years, IEEE Trans. Patt. Anal. Machine Intell., 22 (12): 1349
1380, 2000.
17. Y. Liu, D. Zhang, G. Lu, and W.-Y. Ma, A survey of content-
based image retrieval with high-level semantics, Pattern
Recognition, 40 (1): 262282, 2007.
18. C. Frankel, M. J. Swain, and V. Athitsos, Webseer: an image
search engine for the world wide web, Technical Report TR-96-
14, Chicago, Ill.: University of Chicago, 1996.
19. J. R. Smith and S.-F. Chang, Visually searching the web for
content, IEEE Multimedia, 4 (3): 1220, 1997.
20. A. B. Benitez, M. Beigi, and S.-F. Chang, Using relevance
feedback in content-based image metasearch, IEEE Internet
Computing, 2 (4): 5969, 1998.
21. A. F. Smeaton, P. Over, and W. Kraaij, Evaluation campaigns
and TRECVid, MIR06: Proc. 8th ACM Internat. Workshop on
Multimedia Information Retrieval, Santa Barbara, California,
2006, pp. 321330.
22. J. C. Russ, The Image Processing Handbook, 5th Ed. Boca
Raton, FL: CRC Press, 2006.
23. R. C. Gonzalez and R. E. Woods, Digital Image Processing, 3rd
Ed. Englewood Cliffs, NJ: Prentice Hall, 2007.
24. W.-Y Ma and B. S. Manjunath, A texture thesaurus for brows-
ing large aerial photographs, J. Am. Soc. Informa. Sci. 49 (7):
633648, 1998.
25. M. Turk and A. Pentland, Eigen faces for recognition, J.
Cognitive Neuro-science, 3 (1): 7186, 1991.
26. J. B. Tenenbaum, V. deSilva, and J. C. Langford, A global
geometric framework for nonlinear dimensionality reduction,
Science, 290: 23192323, 2000.
27. S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduc-
tion by locally linear embedding, Science, 290: 23232326,
2000.
28. L. Yang, Alignment of overlapping locally scaled patches for
multidimensional scaling and dimensionality reduction, IEEE
Trans. Pattern Analysis Machine Intell., In press.
29. A. D. Bimbo and P. Pala, Content-based retrieval of 3Dmodels,
ACM Trans. Multimedia Computing, Communicat. Applicat.,
2 (1): 2043, 2006.
30. H. Du and H. Qin, Medial axis extraction and shape manipula-
tionof solidobjects usingparabolic PDEs, Proc. 9thACMSymp.
Solid Modeling and Applicat., Genoa, Italy, 2004, pp. 2535.
31. M. Hilaga, Y. Shinagawa, T. Kohmura, and T. L. Kunii, Topol-
ogy matching for fully automatic similarity estimation of 3D
shapes, Proc. ACM SIGGRAPH 2001, Los Angeles, CA, 2001,
pp. 203212.
32. C. G. Snoek and M. Worring, Multimodal video indexing: a
reviewof the state-of-the-art, Multimedia Tools and Applicat.,
25 (1): 535, 2005.
33. J.-K. Wu, Content-based indexing of multimedia databases,
IEEE Trans. Knowledge Data Engineering, 9 (6): 978989,
1997.
34. Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, Relevance
feedback: A power tool in interactive content-based image
retrieval, IEEE Trans. Circuits Systems Video Technol., 8
(5): 644655, 1998.
35. M. Kobayashi andK. Takeda, Informationretrieval onthe web,
ACM Comput. Surv., 32 (2): 144173, 2000.
36. Y. ChenandJ. Z. Wang, Aregion-basedfuzzy feature matching
approach to content-based image retrieval, IEEE Trans. Pat-
tern Analysis and Machine Intelligence, 24 (9): 12521267,
2002.
37. H. Samet, Foundations of Multidimensional and Metric Data
Structures, San Francisco, CA: Morgan Kaufmann, 2006.
10 VISUAL DATABASE
38. J. B. MacQueen, Some methods for classication and analysis
of multivariate observations, Proc. 5th Berkeley Symp.
Mathematical Statistics and Probability, Berkeley, CA,
1967, pp. 281297.
39. T. Kohonen, Self-Organizating Maps, 3rd Ed. Berlin: Springer,
2000.
40. R. Ramakrishnan and J. Gehrke, Database Management Sys-
tems, 3rd Ed. New York: McGraw-Hill, 2002.
41. S. Chaudhuri and L. Gravano, Optimizing queries over multi-
media repositories, Proc. 1996 ACM SIGMOD Internat. Conf.
Management of Data (SIGMOD96), Montreal, Quebec,
Canada, 1996, pp. 91102.
42. J. R. Smith and S.-F. Chang, VisualSEEk: A fully automated
content-based image query system, Proc. 4th ACM
Internat. Conf. Multimedia, Boston, MA, 1996, pp. 8798.
43. P. Aigrain, H. Zhang, and D. Petkovic, Content-based repre-
sentation and retrieval of visual media: A state-of-the-art
review, Multimedia Tools and Applicat., 3 (3): 179202, 1996.
44. Y. Rui, T. S. Huang, and S.-F. Chang, Image retrieval: current
techniques, promising directions and open issues, J. Visual
Communicat. Image Represent., 10 (1): 3962, 1999.
45. A. Yoshitaka and T. Ichikawa, A survey on content-based
retrieval for multimedia databases, IEEE Trans. Knowledge
and Data Engineering, 11 (1): 8193, 1999.
46. R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image retrieval: Ideas,
inuences, and trends of the new age, ACM Comput. Surv.
In press.
47. R. Datta, J. Li, and J. Z. Wang, Content-based image retrieval:
approaches and trends of the new age, Proc. 7th ACM SIGMM
Internat. WorkshoponMultimediaInformationRetrieval (MIR
2005), 2005, pp. 253262.
48. N. Sebe, M. S. Lew, X. S. Zhou, T. S. Huang, and E. M. Bakker,
The state of the art in image and video retrieval, Proc. 2nd
Internat. Conf. Image and Video Retrieval, Lecture Notes in
Computer Science2728, Urbana-Champaign, IL, 2003. pp. 18.
49. M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, Content-based
multimedia information retrieval: State of the art and chal-
lenges, ACM Trans. Multimedia Computing, Communicat.
Applicat., 2 (1): 119, 2006.
50. M. D. Marsicoi, L. Cinque, and S. Levialdi, Indexing pictorial
documents by their content: A survey of current techniques,
Image and Vision Computing, 15 (2): 119141, 1997.
51. W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, Face
recognition: a literature survey, ACM Comput. Surv., 35 (4):
399458, 2003.
52. H. Mu ller, N. Michoux, D. Bandon, and A. Geissbuhler, A
review of content-based image retrieval systems in medical
applications - clinical benets and future directions, Internat.
J. Medical Informatics, 73 (1): 123, 2004.
53. C.-C. Chen, H. D. Wactlar, J. Z. Wang, and K. Kiernan, Digital
imagery for signicant cultural and historical materials - an
emerging research eld bridging people, culture, and technol-
ogies, Internat. J. Digital Libraries, 5 (4): 275286, 2005.
54. B. M. Mehtre, M. S. Kankanhalli, and W. F. Lee, Shape
measures for content based image retrieval: a comparison,
Information Processing and Management, 33 (3): 319337,
1997.
55. A. Gupta and R. Jain, Visual information retrieval, Commun.
ACM, 40 (5): 7079, 1997.
56. S. Mukherjea, K. Hirata, andY. Hara, Amore: aworldwide web
image retrieval engine, World Wide Web, 2 (3): 115132,
1999.
57. A. Pentland, R. W. Picard, and S. Sclaroff, Photobook: Tools for
content-based manipulation of image databases. Technical
Report TR-255, MIT Media Lab, 1996.
58. M. Ortega, Y. Rui, K. Chakrabarti, K. Porkaew, S. Mehrotra,
andT. S. Huang, Supporting ranked booleansimilarity queries
in MARS, IEEE Trans. Knowledge Data Engineering, 10 (6):
905925, 1998.
59. W.-Y. Ma and B. S. Manjunath, Netra: a toolbox for navigating
large image databases, Multimedia Systems, 7 (3): 184198,
1999.
60. M. Ortega-Binderberger, S. Mehrotra, K. Chakrabarti, and K.
Porkaew, WebMARS: a multimedia search engine. Technical
Report TR-DB-00-01, Informationand Computer Science, Uni-
versity of California at Irvine, 2000.
61. M. S. Lew, Next-generation web searches for visual content,
IEEE Computer, 33 (11): 4653, 2000.
62. M. L. Kher, D. Ziou, andA. Bernardi, Image retrieval fromthe
world wide web: Issues, techniques, and systems, ACM Com-
put. Surv., 36 (1): 3567, 2004.
63. N. Vasconcelos, From pixels to semantic spaces: advances in
content-based image retrieval, IEEE Computer, 40 (7): 2026,
2007.
64. X. S. Zhou and T. S. Huang, Relevance feedback in image
retrieval: a comprehensive review, Multimedia Systems, 8
(6): 536544, 2003.
65. H. Samet, The quadtree and related hierarchical data struc-
tures, ACM Comput. Surv., 16 (2): 187260, 1984.
66. C. Bhm, S. Berchtold, and D. A. Keim, Searching in high-
dimensional spaces: Index structures for improving the per-
formanceof multimediadatabases, ACMComput. Surv., 33(3):
322373, 2001.
67. J. Han and M. Kamber, Data Mining: Concepts and Techni-
ques, 2nd Ed. San Francisco, CA: Morgan Kaufmann,
2005.
LI YANG
Western Michigan University
Kalamazoo, Michigan
TOSIYASU L. KUNII
Kanazawa Institute of
Technology
Tokyo, Japan
VISUAL DATABASE 11
Foundation and
Theory
A
ALGEBRAIC GEOMETRY
INTRODUCTION
Algebraic geometry is the mathematical study of geometric
objects by means of algebra. Its origins go back to the
coordinate geometry introduced by Descartes. A classic
example is the circle of radius 1 in the plane, which is
the geometric object dened by the algebraic equation
x
2
y
2
= 1. This generalizes to the idea of a systems of
polynomial equations in many variables. The solution sets
of systems of equations are called varieties and are the
geometric objects to be studied, whereas the equations and
their consequences are the algebraic objects of interest.
In the twentieth century, algebraic geometry became
much more abstract, with the emergence of commutative
algebra (rings, ideals, and modules) and homological
algebra (functors, sheaves, and cohomology) as the founda-
tional language of the subject. This abstract trend culmi-
nated in Grothendiecks scheme theory, which includes not
only varieties but also large parts of algebraic number
theory. The result is a subject of great power and scope
Wiles proof of Fermats Last Theorem makes essential
use of schemes and their properties. At the same time,
this abstraction made it difcult for beginners to learn
algebraic geometry. Classic introductions include Refs. 1
and 2, both of which require a considerable mathematical
background.
As the abstract theory of algebraic geometry was being
developed in the middle of the twentieth century, a parallel
development was taking place concerning the algorithmic
aspects of the subject. Buchbergers theoryof Grobner bases
showed how to manipulate systems of equations system-
atically, so (for example) one candetermine algorithmically
whether two systems of equations have the same solutions
over the complex numbers. Applications of Grobner bases
are described in Buchbergers classic paper [3] and now
include areas such as computer graphics, computer vision,
geometric modeling, geometric theorem proving, optimiza-
tion, control theory, communications, statistics, biology,
robotics, coding theory, and cryptography.
Grobner basis algorithms, combined with the emer-
gence of powerful computers and the development of com-
puter algebra (see SYMBOLIC COMPUTATION), have led to
different approaches to algebraic geometry. There are
now several accessible introductions to the subject, includ-
ing Refs. 46.
In practice, most algebraic geometry is done over a eld,
and the most commonly used elds are as follows:
v
The rational numbers Qusedinsymbolic computation.
v
The real numbers R used in geometric applications.
v
The complex numbers C used in many theoretical
situations.
v
The nite eld F
q
withq = p
m
elements (p prime) used
in cryptography and coding theory.
In what follows, k will denote a eld, which for concrete-
ness can be taken to be one of the above. We now explore
the two main avors of algebraic geometry: afne and
projective.
AFFINE ALGEBRAIC GEOMETRY
Given a eld k, we have n-dimensional afne space k
n
,
which consists of all n-tuples of elements of k. In some
books, k
n
is denoted A
n
(k). The corresponding algebraic
object is the polynomial ring k[x
1
, . . . , x
n
] consisting of all
polynomials in variables x
1
, . . . , x
n
with coefcients in k.
Bypolynomial, we meananite sumof terms, eachof which
is an element of k multiplied by a monomial
x
a
1
1
x
a
2
2
x
a
n
n
where a
1
, . . . , a
n
are non-negative integers. Polynomials
can be added and multiplied, and these operations are
commutative, associative, and distibutive. This is why
k[x
1
, . . . , x
n
] is called a commutative ring.
Given polynomials f
1
,. . ., f
s
in k[x
1
, . . . , x
n
], the afne
variety V(f
1
, . . . , f
s
) consists of all points (u
1
, . . . , u
n
) in k
n
that satisfy the system of equations
f
1
(u
1
; . . . ; u
n
) = = f
s
(u
1
; . . . ; u
n
) = 0:
Some books (such as Ref. 1) call V(f
1
, . . . , f
s
) an afne
algebraic set.
The algebraic object corresponding to anafne variety is
called an ideal. These arise naturally from a system of
equations f
1
= = f
s
= 0 as follows. Multiply the rst
equation by a polynomial h
1
, the second by h
2
, and so on.
This gives the equation
h = h
1
f
1
h
s
f
s
= 0;
which is called a polynomial consequence of the original
system. Note that h(u
1
; . . . ; u
n
) = 0 for every (u
1
, . . . , u
n
)
in V(f
1
, . . . , f
s
). The ideal f
1
, . . . , f
s
) consists of all poly-
nomial consequences of the system f
1
= = f
s
= 0.
Thus, elements of f
1
, . . . , f
s
) are linear combinations of
f
1
, . . . , f
s
, where the coefcients are allowed to be arbitrary
polynomials.
A general denition of ideal applies to any commutative
ring. The Hilbert Basis Theorem states that all ideals in a
polynomial ring are of the form f
1
, . . . , f
s
). We say that
f
1
, . . . , f
s
is a basis of f
1
, . . . , f
s
) and that f
1
, . . . , f
s
) is gen-
erated by f
1
, . . . , f
s
. This notion of basis differs from how
the term is used in linear algebra because linear indepen-
dence fails. For example, x, y is a basis of the ideal x, y) in
k[x, y], even though y x (x) y = 0.
A key result is that if V(f
1
, . . . , f
s
) = V{g
1
, . . . , g
t
) when-
ever f
1
, . . . , f
s
) = g
1
, . . . , g
t
). This is useful in practice
because switching to a different basis may make it easier
to understand the solutions of the equations. From the
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
theoretical point of view, this shows that an afne variety
depends on the ideal I generated by the dening equations,
so that the afne variety can be denoted V(I). Thus, every
ideal gives an afne variety.
We can also reverse this process. Given an afne variety
V, let I(V) consist of all polynomials that vanishonall points
of V. This satises the abstract denition of ideal. Thus,
everyafnevarietygives anideal, andone canshowthat we
always have
V = V(I(V)):
However, the reverse equality may fail. In other words,
there are ideals I such that
I ,=I(V(I)):
An easy example is provided by I = x
2
) in k|x|, because
I(V(I)) = x) ,=I. Hence, the correspondence between
ideals and afne varieties is not a perfect match. Over
the complex numbers, we will see below that there is
nevertheless a nice relation between I and I(V(I)).
One can prove that the union and intersection of afne
varieties are again afne varieties. In fact, given ideals
I and J, one has
V(I) V(J) = V(I J)
V(I) V(J) = V(I J);
where
I J = g [ g is in both I and J
I J = g h [ g is in I and h is in J
are the intersection and sum of I and J (note that I J
and I J are analgous to the intersection and sum of
subspaces of a vector space). In this way, algebraic opera-
tions on ideals correspond to geometric operations on vari-
eties. This is part of the ideal-variety correspondence
explained in Chapter 4 of Ref. 4.
Sometimes an afne variety can be written as a union of
strictly smaller afne varieties. For example,
V((x y)(x
2
y
2
1)) = V(x y) V(x
2
y
2
1)
expresses the afne variety V((x y)(x
2
y
2
1)) as the
union of the line y = x and the unit circle (Fig. 1).
In general, an afne variety is irreducible if it has no
such decomposition. In books such as Ref. 1, varieties are
always assumed to be irreducible.
One can show that every afne variety V can be written
as
V = V
1
V
m
where each V
i
is irreducible and no V
i
is contained in V
j
for j ,=i. We say that the V
i
are the irreducible components
of V. Thus, irreducible varieties are the building blocks
out of which all varieties are built. Algebraically, the above
decomposition means that the ideal of V can be written as
I(V) = P
1
P
m
where each P
i
is prime (meaning that if a product ab lies in
P
i
, then so does a or b) and no P
i
contains P
j
for j ,=i. This
again illustrates the close connection between the algebra
and geometry. (For arbitrary ideals, things are a bit more
complicated: The above intersection of prime ideals has to
be replaced with what is called a primary decomposition
see Chapter 4 of Ref. 4).
Every variety has a dimension. Over the real numbers
R, this corresponds to our geometric intuition. But over the
complex numbers C, one needs to be careful. The afne
space C
2
has dimension 2, even though it looks four-
dimensional from the real point of view. The dimension
of a variety is the maximum of the dimensions of its
irreducible components, and irreducible afne varieties
of dimensions 1, 2, and 3 are called curves, surfaces, and
3-folds, respectively. An afne variety in k
n
is called a
hypersurface if every irreducible component has dimension
n 1.
PROJECTIVE ALGEBRAIC GEOMETRY
One problem with afne varieties is that intersections
sometimes occur at innity. An example is given by the
intersection of a hyperbola with one of its asymptotes in
Fig. 2. (Note that a line has a single point at innity.)
Points at innity occur naturally in computer graphics,
where the horizon in a perspective drawing is the line at
innity where parallel lines meet. Adding points at in-
nity to afne space leads to the concept of projective space.
1
1
x
y
Figure 1. Union of a line and a circle.
x
y
meet at
infinity
meet at the same
point at infinity
, is
in I. Then f
BNER BASES
Buchberger introduced Grobner bases in 1965 in order to
do algorithmic computations on ideals in polynomial
rings. For example, suppose we are given polynomials f,
f
1
; . . . ; f
s
k[x
1
; . . . ; x
n
[, where k is a eld whose elements
can be represented exactly on a computer (e.g., k is a nite
eld or the eld of rational numbers). From the point of
view of pure mathematics, either f lies in the ideal
f
1
, . . . , f
s
) or it does not. But from a practical point of
view, one wants an algorithm for deciding which of
these two possibilities actually occurs. This is the ideal
membership question.
In the special case of two univariate polynomials f, g
in k[x], f lies in g) if and only if f is a multiple of g, which
we can decide by the division algorithm from high-school
algebra. Namely, dividing g into f gives f = qg r, where
the remainder r has degree strictly smaller than the degree
of g. Then f is a multiple of g if and only if the remainder
is zero. This solves the ideal membership question in our
special case.
To adapt this strategy to k[x
1
, . . . , x
n
], we rst need to
order the monomials. In k[x], this is obvious: The mono-
mials are 1, x, x
2
, etc. But there are many ways to do this
when there are two or more variables. A monomial order >
is an order relation onmonomials U, V, W,. . . ink[x
1
, . . . , x
n
]
with the following properties:
1. Given monomials U and V, exactly one of U >V,
U = V, or U <V is true.
2. If U >V, then UW >VW for all monomials W.
3. If U,=1, then U >1; i.e., 1 is the least monomial with
respect to >.
These properties imply that >is a well ordering, mean-
ing that any strictly decreasing sequence with respect to
> is nite. This is used to prove termination of various
algorithms. An example of a monomial order is lexico-
graphic order, where
x
a
1
1
x
a
2
2
. . . x
a
n
n
>x
b
1
1
x
b
2
2
. . . x
b
n
n
provided
a
1
>b
1
; or a
1
= b
1
and a
2
>b
2
; or a
1
= b
1
;
a
2
= b
2
and a
3
>b
3
; etc:
Other important monomial orders are graded lexico-
graphic order and graded reverse lexicographic order.
These are described in Chapter 2 of Ref. 4.
Now x a monomial order >. Given a nonzero poly-
nomial f, we let lt(f ) denote the leading term of f, namely
the nonzero term of f whose monomial is maximal with
respect to > (in the literature, lt(f) is sometimes called
the initial term of f, denoted in(f)). Given f
1
, . . . , f
s
, the
division algorithm produces polynomials q
1
, . . . , q
s
and
r such that
f = q
1
f
1
q
s
f
s
r
where every nonzero term of r is divisible by none of
lt(f
1
), . . . , lt(f
s
). The remainder r is sometimes called the
normal formof f with respect to f
1
, . . . , f
s
. When s = 1 and f
and f
1
are univariate, this reduces to the high-school
division algorithm mentioned earlier.
In general, multivariate division behaves poorly. To
correct this, Buchberger introduced a special kind of
basis of an ideal. Given an ideal I and a monomial order,
its ideal of leading terms lt(I) (or initial ideal in(I)is the
ideal generated by lt(f) for all f in I. Then elements
g
1
, . . . , g
t
of I form a Grobner basis of I provided that
lt(g
1
), . . . , lt(g
t
) form a basis of lt(I). Buchberger showed
that a Grobner basis is in fact a basis of I and that, given
generators f
1
, . . . , f
s
of I, there is an algorithm (the
Buchberger algorithm) for producing the corresponding
Grobner basis. Adescriptionof this algorithmcanbe found
in Chapter 2 of Ref. 4.
The complexity of the Bucherger algorithm has been
studied extensively. Examples are known where the input
polynomials have degree _ d, yet the corresponding
Grobner basis contains polynomials of degree 2
2
d
. Theo-
retical results show that this doubly exponential behavior
is the worst that can occur (for precise references, see
Chapter 2 of Ref. 4). However, there are many geometric
situations where the complexity is less. For example, if the
equations have only nitely many solutions over C, then
the complexity drops to a single exponential. Further-
more, obtaining geometric information about an ideal,
such as the dimension of its associated variety, often
has single exponential complexity. When using graded
ALGEBRAIC GEOMETRY 5
reverse lexicographic order, complexity is related to the
regularity of the ideal. This is discussed inRef. 7. Belowwe
will say more about the practical aspects of Grobner basis
computations.
Using the properties of Grobner bases, one gets the
following ideal membership algorithm: Given f, f
1
, . . . , f
s
,
use the Buchberger algorithm to compute a Grobner basis
g
1
, . . . , g
t
of f
1
, . . . , f
s
) and use the division algorithm
to compute the remainder of f on division by g
1
, . . . , g
t
.
Then f is in the ideal f
1
, . . . , f
s
) if and only if the remainder
is zero.
Another important use of Grobner bases occurs in elimi-
nation theory. For example, in geometric modeling, one
encounters surfaces inR
3
parametrizedby polynomials, say
x = f (s; t); y = g(s; t); z = h(s; t):
To obtain the equation of the surface, we need to elim-
inate s, t from the above equations. We do this by consider-
ing the ideal
x f (s; t); y g(s; t); z h(s; t))
in the polynomial ring R[s, t, x, y, z] and computing a
Grobner basis for this ideal using lexicographic order,
where the variables to be eliminated are listed rst. The
Elimination Theorem (see Chapter 3 of Ref. 4) implies that
the equation of the surface is the only polynomial in the
Grobner basis not involving s, t. In practice, elimination is
often done by other methods (such as resultants) because of
complexity issues. See also the entry on SURFACE MODELING.
Our nal application concerns a system of equations
f
1
= = f
s
= 0 in n variables over C. Let I = f
1
; . . . ; f
s
),
and compute a Grobner basis of I with respect to any
monomial order. The Finiteness Theorem asserts that the
following are equivalent:
1. The equations have nitely many solutions in C
n
.
2. The Grobner basis contains elements whose leading
terms are pure powers of the variables (i.e., x
1
to a
power, x
2
to a power, etc.) up to constants.
3. The quotient ring C[x
1
, . . . , x
n
]=I is anite-dimensional
vector space over C.
The equivalence of the rst two items gives analgorithm
for determining whether there are nitely many solutions
over C. From here, one can nd the solutions by several
methods, including eigenvalue methods and homotopy con-
tinuation. These and other methods are discussed in Ref. 8.
The software PHCpack (9) is a freely available implemen-
tation of homotopy continuation. Using homotopy techni-
ques, systems with10
5
solutions have been solved. Without
homotopy methods but using a robust implementation of
the Buchberger algorithm, systems with 1000 solutions
have been solved, and in the context of computational
biology, some highly structured systems with over 1000
equations have been solved.
However, although solving systems is an important
practical application of Grobner basis methods, we want
to emphasize that many theoretical objects in algebraic
geometry, suchas Hilbert polynomials, free resolutions (see
below), and sheaf cohomology (also discussed below), can
also be computed by these methods. As more and more of
these theoretical objects are nding applications, the abil-
ity to compute them is becoming increasingly important.
Grobner basis algorithms have been implemented in
computer algebra systems such as Maple (10) and Mathe-
matica (11). For example, the solve command in Maple and
Solve command in Mathematica make use of Grobner basis
computations. We should also mention CoCoA(12), Macau-
lay 2 (13), and Singular (14), which are freely available on
the Internet. These powerful programs are used by
researchers in algebraic geometry and commutative alge-
bra for a wide variety of experimental and theoretical
computations. With the help of books such as Ref. 5 for
Macaulay 2, Ref. 15 for CoCoA, and Ref. 16 for Singular,
these programs can be used by beginners. The program
Magma (17) is not free but has a powerful implementation
of the Buchberger algorithm.
MODULES
Besides rings, ideals, and quotient rings, another impor-
tant algebraic structure to consider is the concept of module
over a ring. Let R denote the polynomial ring k[x
0
, . . . , x
n
].
Then saying that M is an R-module means that M has
addition and scalar multiplication with the usual proper-
ties, except that the scalars are now elements of R. For
example, the free R-module R
m
consists of m-tuples of
elements of R. We can clearly add two such m-tuples and
multiply an m-tuple by an element of R.
A more interesting example of an R-module is given by
anideal I = f
1
; . . . ; f
s
) inR. If we choose the generating set
f
1
, . . . , f
s
to be as small as possible, we get a minimal basis of
I. But when s _2, f
1
, . . . , f
s
cannot be linearly independent
over R, because f
j
f
i
( f
i
) f
j
= 0 when i ,= j. To see
how badly the f
i
fail to be independent, consider
R
s
I 0;
where the rst arrow is dened using dot product with
(f
1
, . . . , f
s
) and the second arrow is a standard way of
saying the rst arrow is onto, which is true because
I = f
1
; . . . ; f
s
). The kernel or nullspace of the rst arrow
measures the failure of the f
i
to be independent. This kernel
is an R-module and is called the syzygy module of f
1
, . . . , f
s
,
denoted Syz( f
1
, . . . , f
s
).
The Hilbert Basis Theorem applies here so that there
are nitely many syzygies h
1
, . . . , h
in Syz (f
1
, . . . , f
s
) such
that every syzygy is a linear combination, with coefcients
in R, of h
1
, . . . , h
. Each h
i
is a vector of polynomials; if we
assemble these into a matrix, then matrix multiplication
gives a map
R
R
s
whose image is Syz(f
1
, . . . , f
s
). This looks like linear algebra,
except that we are working over a ring instead of a eld. If
we think of the variables inR = k[x
1
; . . . ; x
n
[ as parameters,
then we are doing linear algebra with parameters.
6 ALGEBRAIC GEOMETRY
The generating syzgyies h
i
may fail to be independent,
so that the above map may have a nonzero kernel. Hence
we can iterate this process, although the Hilbert Syzygy
Theorem implies that kernel is eventually zero. The result
is a collection of maps
0 R
t
R
R
s
I 0;
where at eachstage, the image of one map equals the kernel
of the next. We say that this is a free resolution of I. By
adapting Grobner basis methods to modules, one obtains
algorithms for computing free resolutions.
Furthermore, when I is a homogeneous ideal, the whole
resolution inherits a graded structure that makes it
straightforward to compute the Hilbert polynomial of I.
Given what we knowabout Hilbert polynomials, this gives
an algorithm for determining the dimension and degree
of a projective variety. A discussion of modules and free
resolutions can be found in Ref. 18.
Although syzygies may seem abstract, there are situa-
tions in geometric modeling where syzygies arise naturally
as moving curves and moving surfaces (see Ref. 19). This
and other applications showthat algebra needs to be added
to the list of topics that fall under the rubric of applied
mathematics.
LOCAL PROPERTIES
In projective space P
n
(k), let U
i
denote the Zariski
open subset where x
i
,=0. Earlier we noted that U
0
looks
the afne space k
n
; the same is true for the other U
i
.
This means that P
n
(k) locally looks like afne space.
Furthermore, if V is a projective variety in P
n
(k), then
one can show that V
i
= V U
i
is a afne variety for all i.
Thus, every projective variety locally looks like an afne
variety.
In algebraic geometry, one can get even more local.
For example, let p = [u
0
; . . . ; u
n
[ be a point of P
n
(k). Then
let O
p
consist of all rational functions on P
n
(k) dened at p.
Then O
p
is clearly a ring, and the subset consisting of
those functions that vanish at p is a maximal ideal. More
surprising is the fact that this is the unique maximal ideal
of O
p
. We call O
p
the local ringof P
n
(k) at p, andingeneral, a
commutative ring with a unique maximal ideal is called a
local ring. In a similar way, a point p of an afne or
projective variety V has a local ring O
V,p
.
Many important properties of a variety at a point are
reected in its local ring. As an example, we give the
denition of multiplicity that occurs in Bezouts Theorem.
Recall the statement: Distinct irreducible curves in P
2
(C)
of degrees m and n intersect at mn points, counted
with multiplicity. By picking suitable coordinates, we
can assume that the points of intersection lie in C
2
and
that the curves are dened by equations f = 0 and g = 0 of
degrees mand n, respectively. If pis a point of intersection,
then its multiplicity is given by
mult(p) = dim
C
O
p
= f ; g); O
p
= local ring of P
2
(C) at p;
and the precise version of Bezouts Theorem states that
mn =
f (p)=g(p)=0
mult(p):
A related notion of multiplicity is the HilbertSamuel
multiplicity of an ideal in O
p
, which arises in geometric
modeling when considering the inuence of a basepoint
on the degree of the dening equation of a parametrized
surface.
SMOOTH AND SINGULAR POINTS
In multivariable calculus, the gradient \ f =
@ f
@x
i
@ f
@y
j
is perpendicular to the level curve dened by f (x; y) = 0.
Whenone analzyes this carefully, one is led to the following
concepts for a point on the level curve:
v
A smooth point, where \ f is nonzero and can be used
to dene the tangent line to the level curve.
v
A singular point, where \ f is zero and the level curve
has no tangent line at the point.
These concepts generalize to arbitrary varieties. For any
variety, most points are smooth, whereas othersthose
in the singular locusare singular. Singularities can be
important. For example, when one uses a variety to des-
cribe the possible states of a robot arm, the singularities of
the variety often correspond to positions where the motion
of the arm is less predictable (see Chapter 6 of Ref. 4 and
the entry on ROBOTICS).
A variety is smooth or nonsingular when every point
is smooth. When a variety has singular points, one can
use blowing up to obtain a new variety that is less
singular. When working over an algebraically closed eld
of characteristic 0 (meaning elds that contain a copy of
Q), Hironaka proved in 1964 that one can always nd a
sequence of blowing up that results in a smooth variety.
This is called resolution of singularities. Resolution of
singularities over a eld of characteristic p (elds that
contain a copy of F
p
) is still an open question. Reference 20
gives a nice introduction to resolution of singularities.
More recently, various groups of people have gured out
how to do this algorithmically, and work has been done
on implementing these algorithms, for example, the soft-
ware desing described in Ref. 21. We also note that singu-
larities can be detected numerically using condition
numbers (see Ref. 22).
SHEAVES AND COHOMOLOGY
For an afne variety, modules over its coordinate ring play
an important role. For a projective variety V, the corre-
sponding objects are sheaves of O
V
-modules, where O
V
is
the structure sheaf of V. Locally, V looks like an afne
variety, and with a suitable hypothesis called quasi-
coherence, a sheaf of O
V
-modules locally looks like a module
over the coordinate ring of an afne piece of V.
ALGEBRAIC GEOMETRY 7
Fromsheaves, one is led to the idea of sheaf cohomology,
which (roughly speaking) measures how the local pieces of
the sheaf t together. Given a sheaf F on V, the sheaf
cohomology groups are denoted H
i
(V,F). We will see below
that the sheaf cohomology groups are used in the classica-
tion of projective varieties. For another application of sheaf
cohomology, consider a nite collection Vof points in P
n
(k).
From the sheaf point of view, V is dened by an ideal sheaf
I
V
. In interpolation theory, one wants to model arbitrary
functions on Vusing polynomials of a xed degree, say m. If
m is too small, this may not be possible, but we always
succeed if mis large enough. Aprecise description of which
degrees m work is given by sheaf cohomology. The ideal
sheaf I
V
has a twist denoted I
V
(m). Then all functions on V
come from polynomials of degree m if and only if
H
1
(P
n
(k); I
V
(m)) = 0. We also note that vanishing theo-
rems for sheaf cohomology have been used in geometric
modeling (see Ref. 23).
References 1, 2, and 24 discuss sheaves and sheaf
cohomology. Sheaf cohomology is part of homological alge-
bra. An introduction to homological algebra, including
sheaves and cohomology, is given in Ref. 5.
SPECIAL VARIETIES
We next discuss some special types of varieties that have
been studied extensively.
1. Elliptic Curves and Abelian Varieties. Beginning
with the middle of the eighteenth century, elliptic
integrals have attracted a lot of attention. The study
of these integrals led to both elliptic functions and
elliptic curves. The latter are often described by an
equation of the form
y
2
= ax
3
bx
2
cx d;
where ax
3
bx
2
cx d is a cubic polynomial with
distinct roots. However, to get the best properties,
one needs to work in the projective plane, where the
above equation is replaced with the homogeneous
equation
y
2
z = ax
3
bx
2
z cxz
2
dz
3
:
The resulting projective curve E has an extra struc-
ture: Given two points on E, the line connecting them
intersects E at a third point by Bezouts Theorem.
This leads to agroupstructure onEwhere the point at
innity is the identity element.
Over the eld of rational numbers Q, elliptic curves
have a remarkably rich theory. The group structure is
related to the BirchSwinnerton-Dyer Conjecture, and
Wiless proof of Fermats Last Theoremwas a corollary
of his solution of a large part of the Taniyama
Shimura Conjecture for elliptic curves over Q. On
the other hand, elliptic curves over nite elds are
used in cryptography (see Ref. 25). The relation
between elliptic integrals and elliptic curves has
been generalized to Hodge theory, which is described
in Ref. 24. Higher dimensional analogs of elliptic
curves are called abelian varieties.
2. Grassmannians and Schubert Varieties. In P
n
(k), we
use homogeneous coordinates [u
0
, . . . , u
n
], where
[u
0
; . . . ; u
n
[ = [v
0
; . . . ; v
n
[ if both lie on the same line
through the origin in k
n+1
. Hence points of P
n
(k)
correspond to one-dimensional subspaces of k
n+1
.
More generally, the Grassmannian G(N, m)(k) con-
sists of all m-dimensional subspaces of k
n
. Thus,
G(n 1; 1)(k) = P
n
(k).
Points of G(N, m)(k) have natural coordi-
nates, which we describe for m = 2. Given a two-
dimensional sub-space W of k
N
, consider a 2 N
matrix
u
1
u
2
. . . u
N
v
1
v
2
. . . v
N
_ _
whose rows give a basis of W. Let p
i j
; i < j, be the
determinant of the 2 2 matrix formed by the ith
and jth columns. The M =
N
2
_ _
numbers p
ij
are the
Plucker coordinates of W. These give a point in
P
M1
(k) that depends only on W and not on the
chosen basis. Furthermore, the subspace W can be
reconstructed from its Plu cker coordinates. The
Plu cker coordinates satisfy the Plucker relations
p
i j
p
kl
p
ik
p
jl
p
il
p
jk
= 0;
and any set of numbers satisfying these relations
comes froma subspace W. It follows that the Plu cker
relations dene G(N, 2)(k) as a projective variety in
P
M1
(k). In general, G(N, m)(k) is a smooth projec-
tive variety of dimension m(N m).
The Grassmannian G(N, m)(k) contains interest-
ing varieties called Schubert varieties. The Schuberi
calculus describes how these varieties intersect.
Using the Schubert calculus, one can answer ques-
tions such as how many lines in P
3
(k) intersect four
lines in general position? (The answer is two.) An
introduction to Grassmannians and Schubert var-
eties can be found in Ref. 26.
The question about lines in P
3
(k) is part of enum-
erative algebraic geometry, which counts the num-
ber of geometrically interesting objects of various
types. Bezouts Theorem is another result of enu-
merative algebraic geometry. Another famous enu-
merative result states that a smooth cubic surface in
P
3
(C) contains exactly 27 lines.
3. Rational and Unirational Varieties. An irreducible
variety Vof dimensionnover Cis rational if there is a
one-to-one rational parametrization UV, where U
is aZariski opensubset of C
n
. The simplest example of
a rational variety is P
n
(C). Many curves and surfaces
that occur in geometric modeling are rational.
More generally, an irreducible variety of dimen-
sion n is unirational if there is a rational parametri-
zation UV whose image lls up most of V, where U
is a Zariski open subset of C
m
, m_n. For varieties of
dimension 1 and 2, unirational and rational coincide,
8 ALGEBRAIC GEOMETRY
but in dimensions 3 and greater, they differ. For
example, a smooth cubic hypersurface in P
4
(C) is
unirational but not rational.
A special type of rational variety is a toric variety.
In algebraic geometry, a torus is (C
+
)
n
, which is the
Zariski open subset of C
n
where all coordinates are
nonzero. A toric variety V is an n-dimensional irre-
ducible variety that contains a copy of (C
+
)
n
as a
Zariski open subset in a suitably nice manner.
Both C
n
and P
n
(C) are toric varieties. There are
strong relations between toric varieties and poly-
topes, and toric varieties also have interesting
applications in geometric modeling (see Ref. 27),
algebraic statistics, and computational biology (see
Ref. 28). The latter includes signicant applications
of Grobner bases.
4. Varieties over Finite Fields. A set of equations den-
ing a projective variety V over F
p
also denes V as a
projective varietyover F
p
m for everym_1. As P
n
(F
p
m)
is nite, we let N
m
denote the number of points of V
when regarded as lying in P
n
(F
p
m). To study the
asymptotic behavior of N
m
as m gets large, it is
convenient to assemble the N
m
into the zeta function
Z(V; t) = exp
m=1
N
m
t
m
=m
_ _
:
The behavior of Z(V, t) is the subject of some deep
theorems in algebraic geometry, including the
Riemann hypothesis for smooth projective varieties
over nite elds, proved by Deligne in 1974.
Suppose for example that V is a smooth curve.
The genus g of V is dened to be the dimension of
the sheaf cohomology group H
1
(V, O
V
). Then the
Riemann hypothesis implies that
[N
m
p
m
1[ _ 2 g p
m=2
:
Zeta functions, the Riemann hypothesis, and other
tools of algebraic geometry such as the Riemann
Roch Theorem have interesting applications in
algebraic coding theory. See Ref. 29 and the entry
on ALGEBRAIC CODING THEORY. References 18 and 30
discuss aspects of coding theory that involve Grobner
bases.
CLASSIFICATION QUESTIONS
One of the enduring questions in algebraic geometry
concerns the classication of geometric objects of various
types. Here is a brief list of some classication questions
that have been studied.
1. Curves. For simplicity, we work over C. The main
invariant of smooth projective curve is its genus g,
dened above as the dimension of H
1
(V, O
V
). When
the genus is 0, the curve is P
1
(C), and whenthe genus
is 1, the curve is anelliptic curve E. After a coordinate
change, the afne equation can be written as
y
2
= x
3
ax b; 4a
3
27b
2
,=0:
The j-invariant j(E) is dened to be
j(E) =
2
8
3
3
a
3
4a
3
27b
2
and two elliptic curves over C are isomorphic as
varieties if and only if they have the same j-invariant.
It follows that isomorphism classes of elliptic curves
correspondto complex numbers; one says that Cis the
moduli space for elliptic curves. Topologically, all
elliptic curves look like a torus (the surface of a
donut), but algebraically, they are the same if and
only if they have the same j-invariant.
Now consider curves of genus g _2 over C. Topo-
logically, these look like a surface with g holes, but
algebraically, there is a moduli space of dimension
3g 3 that records the algebraic structure. These
moduli spaces and their compactications have
been studied extensively. Curves of genus g _2
also have strong connections with non-Euclidean
geometry.
2. Surfaces. Smooth projective surfaces over C have a
richer structrure and hence a more complicated clas-
sication. Such a surface S has its canonical
bundle v
S
, which is a sheaf of O
S
-modules that
(roughly speaking) locally looks like multiples of
dxdy for local coordinates x, y. Then we get the
associated bundle v
m
S
, which locally looks like multi-
ples of (dxdy)
m
. The dimension of the sheaf cohomol-
ogy groupH
0
(S, v
m
S
) grows like apolynomial inm, and
the degree of this polynomial is the Kodaira
dimensionk of S, where the zero polynomial has
degree . Using the Kodaira dimension, we get
the following Enriques-Kodaira classication:
k = : Rational surfaces and ruled surfaces over
curves of genus > 0.
k = 0: K3 surfaces, abelian surfaces, and Enriques
surfaces.
k = 1: Surfaces mapping to a curve of genus _ 2 whose
generic ber is an elliptic curve.
k = 2: Surfaces of general type.
One can also dene the Kodaira dimension for curves,
where the possible values k = , 0, 1 correspond to the
classication by genus g = 0, 1 or _ 2. One difference in the
surface case is that blowing up causes problems. One needs
to dene the minimal model of a surface, which exists in
most cases, and then the minimal model gets classied by
describing its moduli space. These moduli spaces are well
understood except for surfaces of general type, where many
unsolved problems remain.
To say more about how this classication works, we
need some terminology. Two irreducible varieties are
birational if they have Zariski open subsets that are iso-
morphic. Thus, a variety over C is rational if and only if it
ALGEBRAIC GEOMETRY 9
is birational to P
n
(C), and two smooth projective surfaces
are birational if and only if they have the same minimal
model. As for moduli, consider the equation
a x
4
0
x
4
1
x
4
2
x
4
3
_ _
x
0
x
1
x
2
x
3
= 0:
This denes a K3 surface in P
3
(C) provided a,=0. As we
vary a, we get different K3 surfaces that can be deformed
into each other. This (very roughly) is what happens in a
moduli space, although a lot of careful work is needed to
make this idea precise.
The EnriquesKodaira classication is described in
detail in Ref. 31. This book also discuss the closely related
classication of smooth complex surfaces, not necessarily
algebraic.
1. Higher Dimensions. Recall that a three-fold is a vari-
ety of dimension3. As inthe surface case, one uses the
Kodaira dimension to break up all three-folds into
classes, this time according to k = , 0, 1, 2, 3. One
new feature for three-folds is that although minimal
models exist, they may have certain mild singulari-
ties. Hence, the whole theory is more sophisticated
than the surface case. The general strategy of the
minimal model program is explained in Ref. 32.
2. Hilbert Schemes. Another kind of classication ques-
tion concerns varieties that live in a xed ambient
space. For example, what sorts of surfaces of small
degree exist in P
4
(C)? There is also the Hartshorne
conjecture, which asserts that a smooth variety V of
dimension n in P
N
(C), where N<
3
2
n, is a complete
intersection, meaning that V is dened by a systemof
exactly N n equations.
Ingeneral, one canclassify all varieties in P
n
(C) of
given degree and dimension. One gets a better clas-
sicationby lookingat all varieties withgivenHilbert
polynomial. This leads to the concept of a Hilbert
scheme. There are many unansweredquestions about
Hilbert schemes.
3. Vector Bundles. Avector bundle of rank r on a variety
Vis asheaf that locallylooks like afreemodule of rank
r. For example, the tangent planes to a smooth sur-
face formits tangent bundle, which is a vector bundle
of rank 2.
Vector bundles of rank 1 are called line bundles or
invertible sheaves. When V is smooth, line bundles
canbedescribedinterms of divisors, whichare formal
sums a
1
D
1
a
m
D
m
, where a
i
is an integer and
D
i
is an irreducible hypersurface. Furthermore, line
bundles are isomorphic if andonlyif their correspond-
ing divisors are rationally equivalent. The set of iso-
morphism classes of line bundles on V forms the
Picard group Pic(V).
There has also been a lot of work classifying vector
bundles on P
n
(C). For n = 1, a complete answer is
known. For n>2, one classies vector bundles E
according to their rank r and their Chern classes
c
i
(E). One important problem is understanding how
to compactify the corresponding moduli spaces.
This involves the concepts of stable and semistable
bundles. Vector bundles also have interesting con-
nections with mathematical physics (see Ref. 33).
4. Algebraic Cycles. Given an irreducible variety V of
dimension n, a variety W contained in V is called a
subvariety. Divisors on V are integer combinations of
irreducible subvarieties of dimension n 1. More
generally, an m-cycle on V is an integer combination
of irreducible subvarieties of dimension m. Cycles are
studied using various equivalence relations, includ-
ing rational equivalence, algebraic equivalence,
numerical equivalence, and homological equivalence.
The Hodge Conjecture concerns the behavior of cycles
under homological equivalence, whereas the Chow
groups are constructed using rational equivalence.
Algebraic cycles are linked to other topics in
algebraic geometry, including motives, intersection
theory, and variations of Hodge structure. An intro-
duction to some of these ideas can be found in Ref. 34.
REAL ALGEBRAIC GEOMETRY
Inalgebraic geometry, the theory usually works best over C
or other algebraically closed elds. Yet many applications
of algebraic geometry deal withreal solutions of polynomial
equations. We will explore several aspects of this question.
When dealing with equations with nitely many solu-
tions, there are powerful methods for estimating the num-
ber of solutions, including a multivariable version of
Bezouts Theorem and the more general BKK bound,
both of which deal with complex solutions. But these
bounds candiffer greatly fromthe number of real solutions.
An example from Ref. 35 is the system
axyz
m
bx cy d = 0
a
/
xyz
m
b
/
x c
/
y d
/
= 0
a
//
xyz
m
b
//
x c
//
y d
//
= 0
where mis a positive integer and a, b, . . . , c
//
,d
//
are random
real coefcients. The BKK bound tells us that there are
m complex solutions, and yet there are at most two real
solutions.
Questions about the number of real solutions go back to
Descartes Rule of Signs for the maximum number of posi-
tive and negative roots of a real univariate polynomial.
There is also Sturms Theorem, which gives the number of
real roots in an interval. These results now have multi-
variable generalizations. Precise statements can be found
in Refs. 18 and 30.
Real solutions also play an important role in enumera-
tive algebraic geometry. For example, a smooth cubic
surface S dened over R has 27 lines when we regard
S as lying in P
3
(C). But how many of these lines are
real? In other words, how many lines lie on S when it is
regarded as lying in P
3
(R)? (The answer is 27, 15, 7, or 3,
depending on the equation of the surface.) This and other
examples from real enumerative geometry are discussed
in Ref. 35.
10 ALGEBRAIC GEOMETRY
Over the real numbers, one can dene geometric
objects using inequalities as well as equalities. For exam-
ple, a solid sphere of radius 1 is dened by x
2
y
2
z
2
_1.
In general, a nite collection of polynomial equations and
inequalities dene what is known as a semialgebraic
variety. Inequalities arise naturally when one does quan-
tier elimination. For example, given real numbers a
and b, the question
Does there exist x in R with x
2
bx c = 0?
is equivalent to the inequality
b
2
4c _0
by the quadratic formula. The theory of real quantier
elimination is due to Tarksi, although the rst practical
algorithmic version is Collins cylindrical algebraic
decomposition. A brief discussion of these issues appears
in Ref. 30. Semialgebraic varieties arise naturally in
robotics and motion planning, because obstructions like
oors and walls are dened by inequalities (see ROBOTICS).
SCHEMES
An afne variety Vis the geometric object corresponding to
the algebraic object given by its coordinate ring k[V]. More
generally, given any commutative ring R, Grothendieck
dened the afne scheme Spec (R) to be the geometric object
corresponding to R. The points of Spec(R) correspond to
prime ideals of R, and Spec(R) also has a structure sheaf
O
Spec(R)
that generalizes the sheaves O
V
mentioned earlier.
As an example, consider the coordinate ring C[V] of an
afne variety V in C
n
. We saw earlier that the points of V
correspond to maximal ideals of C[V]. As maximal ideals
are prime, it follows that Spec(C[V]) contains a copy of V.
The remaining points of Spec(C[V]) correspond to the other
irreducible varieties lying inV. Infact, knowing Spec(C[V])
is equivalent to knowing V in a sense that can be made
precise.
Afne schemes have good properties with regard to
maps between rings, and they can be patched together to
get more general objects calledschemes. For example, every
projective variety has a natural scheme structure. One way
to see the power of schemes is to consider the intersection of
the curves in C
2
dened by f = 0 and g = 0, as in our
discussion of Bezouts Theorem. As varieties, this intersec-
tion consists of just points, but if we consider the intersec-
tion as a scheme, then it has the additional structure
consisting of the ring O
p
/f, g) at every intersection point
p. So the schemetheoretic intersection knows the multi-
plicities. See Ref. 36for anintroductionto schemes. Scheme
theory is also discussed in Refs. 1 and 2.
BIBLIOGRAPHY
1. R. Hartshorne, Algebraic Geometry, NewYork: Springer, 1977.
2. I. R. Shafarevich, Basic Algebraic Geometry, New York:
Springer, 1974.
3. B. Buchberger, Grobner bases: An algorithmic method in poly-
nomial ideal theory, in N. K. Bose (ed.), Recent Trends in
Multidimensional Systems Theory, Dordrecht: D. Reidel, 1985.
4. D. Cox, J. Little, and D. OShea, Ideals, Varieties and
Algorithms, 3rd ed., New York: Springer, 2007.
5. H. Schenck, Computational Algebraic Geometry, Cambridge:
Cambridge University Press, 2003.
6. K. Smith, L. Kahanpa a , P. Keka la inen, and W. Traves, An
Invitation to Algebraic Geometry, New York: Springer, 2000.
7. D. Bayer and D. Mumford, What can be computed in algebraic
geometry? in D. Eisenbud and L. Robbiano (eds.), Computa-
tional Algebraic Geometry and Commutative Algebra,
Cambridge: Cambridge University Press, 1993.
8. A. Dickenstein and I. Emiris, Solving Polynomial Systems,
New York: Springer, 2005.
9. PHCpack, a general purpose solver for polynomial systems by
homotopy continuation. Available: http://www.math.uic.edu/
~jan/PHCpack/phcpack.html.
10. Maple. Available: http://www.maplesoft.com.
11. Mathematica. Available: http://www.wolfram.com.
12. CoCoA, Computational Commutative Algebra. Available:
http://www.dima.unige.it.
13. Macaulay 2, a software for system for research in algebraic
geometry. Available: http://www.math.uiuc.edu/Macaulay2.
14. Singular, a computer algebra system for polynomial computa-
tions. Available: http://www.singular.uni-kl.de.
15. M. Kreuzer and L. Robbiano, Computational Commutative
Algebra 1, New York: Springer, 2000.
16. G.-M. Greuel and G. Pster, A Singular Introduction of Com-
mutative Algebra, New York: Springer, 2002.
17. Magma, The Magma Computational Algebra System. Avail-
able: http://magma.maths.usyd.edu.au/magma/.
18. D. Cox, J. Little, and D. OShea, Using Algebraic Geometry,
2nd ed., New York: Springer, 2005.
19. T. W. Sederberg and F. Chen, Implicitization using moving
curves and surfaces, in S. G. Mair and R. Cook (eds.), Proceed-
ings of the 22nd Annual Conference on Computer graphics and
interactive techniques (SIGGRAPH1995), New York: ACM
Press, 1995, pp. 301308.
20. H. Hauser, The Hironaka theorem on resolution of singula-
rities (or: a proof we always wantedto understand), Bull. Amer.
Math. Soc., 40: 323403, 2003.
21. G. Bodna r and J. Schicho, Automated resolution of singula-
rities for hypersurfaces, J. Symbolic Computation., 30: 401
429, 2000. Available: http://www.rise.uni-linz.ac.at/projects/
basic/adjoints/blowup.
22. H. Stetter, Numerical Polynomial Algebra, Philadelphia:
SIAM, 2004.
23. D. Cox, R. Goldman, and M. Zhang, On the validity of impli-
citizationby moving quadrics for rational surfaces withno base
points, J. Symbolic Comput., 29: 419440, 2000.
24. P. Grifths and J. Harris, Principles of Algebraic Geometry,
New York: Wiley, 1978.
25. N. Koblitz, A Course in Number Theory and Cryptography,
2nd ed., New York: Springer, 1994.
26. S. L. Kleiman and D. Laksov, Schubert calculus, Amer. Math.
Monthly, 79: 10611082, 1972.
27. R. Goldman and R. Krasauskas (eds.), Topics in Algebraic
GeometryandGeometric Modeling, Providence, RI: AMS, 2003.
28. L. Pachter and B. Sturmfels (eds.), Algebraic Statistics for
Computational Biology, Cambridge: Cambridge University
Press, 2005.
ALGEBRAIC GEOMETRY 11
29. C. Moreno, Algebraic Curves over Finite Fields, Cambridge:
Cambridge University Press, 1991.
30. A. M. Cohen, H. Cuypers, and H. Sterk (eds.), Some Tapas of
Computer Algebra, New York: Springer, 1999.
31. W. P. Barth, C. A. Peters, and A. A. van de Ven, Compact
Complex Surfaces, New York: Springer, 1984.
32. C. Cadman, I. Coskun, K. Jarbusch, M. Joyce, S. Kova cs,
M. Lieblich, F. Sato, M. Szczesny, and J. Zhang, A rst
glimpse at the minimal model program, in R. Vakil (ed.),
Snowbird Lectures in Algebraic Geometry, Providence, RI:
AMS, 2005.
33. V. S. Vardarajan, Vector bundles and connections in physics
and mathematics: Some historical remarks, in V. Lakshmibai,
V. Balaji, V. B. Mehta, K. R. Nagarajan, K. Paranjape,
P. Sankaran, and R. Sridharan (eds), A Tribute to C. S. Sesha-
dri, Basel: Birkha user-Verlag, 2003, pp. 502541.
34. W. Fulton, Introduction to Intersection Theory in Algebraic
Geometry, Providence, RI: AMS, 1984.
35. F. Sottile, Enumerative real algebraic geometry, inS. Basuand
L. Gonzalez-Vega (eds.), Algorithmic and quantitative real
algebraic geometry (Piscataway, NJ, 2001), Providence, RI:
AMS, 2003, pp. 139179.
36. D. Eisenbud and J. Harris, The Geometry of Schemes,
New York: Springer, 2000.
DAVID A. COX
Amherst College
Amherst, Massachusetts
12 ALGEBRAIC GEOMETRY
C
COMPUTATIONAL COMPLEXITY THEORY
Complexity theory is the part of theoretical computer
science that attempts to prove that certaintransformations
from input to output are impossible to compute using a
reasonable amount of resources. Theorem 1 below illus-
trates the type of impossibility proof that can sometimes
be obtained (1); it talks about the problem of determining
whether alogic formulainacertainformalism(abbreviated
WS1S) is true.
Theorem 1. Any circuit of AND, OR, and NOT gates
that takes as input a WS1S formula of 610 symbols and
outputs a bit that says whether the formula is true must
have at least 10
125
gates.
This is a very compelling argument that no such circuit
will ever be built; if the gates were eachas small as aproton,
such a circuit would ll a sphere having a diameter of 40
billion light years! Many people conjecture that somewhat
similar intractability statements hold for the problem of
factoring 1000-bit integers; many public-key cryptosys-
tems are based on just such assumptions.
It is important to point out that Theorem 1 is specic
to a particular circuit technology; to prove that there is
no efcient way to compute a function, it is necessary to
be specic about what is performing the computation.
Theorem 1 is a compelling proof of intractability precisely
because every deterministic computer that can be pur-
chased today can be simulated efciently by a circuit con-
structed with AND, OR, and NOT gates. The inclusion of the
word deterministic in the preceding paragraph is signi-
cant; some computers are constructed with access to
devices that are presumed to provide a source of random
bits. Probabilistic circuits (which are allowed to have some
small chance of producing an incorrect output) might be a
more powerful model of computing. Indeed, the intract-
ability result for this class of circuits (1) is slightly weaker:
Theorem 2. Any probabilistic circuit of AND, OR, and
NOT gates that takes as input a WS1S formula of 614
symbols and outputs a bit that says whether the formula
is true (with error probability at most 1/3) must have at
least 10
125
gates.
The underlying question of the appropriate model of
computationto use is central to the questionof howrelevant
the theorems of computational complexity theory are. Both
deterministic and probabilistic circuits are examples of
classical models of computing. In recent years, a more
powerful model of computing that exploits certain aspects
of the theory of quantum mechanics has captured the
attention of the research communities in computer science
and physics. It seems likely that some modication of
theorems 1 and 2 holds even for quantum circuits. For
the factorization problem, however, the situation is
different. Although many people conjecture that classical
(deterministic or probabilistic) circuits that compute the
factors of 1000-bit numbers must be huge, it is known that
small quantum circuits can compute factors (2). It remains
unknown whether it will ever be possible to build quantum
circuits, or to simulate the computation of such circuits
efciently. Thus, complexity theory based on classical
computational models continues to be relevant.
The three most widely studied general-purpose realis-
tic models of computation today are deterministic, prob-
abilistic, and quantum computers. There is also interest in
restrictedmodels of computing, suchas algebraic circuits or
comparison-based algorithms. Comparison-based models
arise in the study of sorting algorithms. A comparison-
based sorting algorithm is one that sorts n items and is
not allowed to manipulate the representations of those
items, other than being able to test whether one is greater
than another. Comparison-based sorting algorithms
require time V(nlog n), whereas faster algorithms are
sometimes possible if they are allowed to access the bits
of the individual items. Algebraic circuits operate under
similar restrictions; they cannot access the individual bits
of the representations of the numbers that are provided
as input, but instead they can only operate on those num-
bers via operations suchas , , and. Interestingly, there
is also a great deal of interest in unrealistic models of
computation, such as nondeterministic machines. Before
we explain why unrealistic models of computation are of
interest, let us see the general structure of an intract-
ability proof.
DIAGONALIZATION AND REDUCIBILITY
Any intractability proof has to confront a basic question:
Howcanone prove that there is not aclever algorithmfor
a certain problem? Here is the basic strategy that is used to
prove theorems 1 and 2. There are three steps.
Step 1 involves showing that there is program A that
uses roughly 2
n
bits of memory oninputs of size nsuchthat,
for every input length n, the function that A computes on
inputs of length n requires circuits as large as are required
by any function on n bits. The algorithm A is presented by
a diagonalization argument (so-called because of simi-
larity to Cantors diagonal argument from set theory).
The same argument carries throughessentially unchanged
for probabilistic and quantum circuits. The problem com-
puted by A is hard to compute, but this by itself is not very
interesting, because it is probably not a problem that any-
one would ever want to compute.
Step 2 involves showing that there is an efciently
computable function f that transforms any input instance
x for A into a WS1S formula f(i:e:; f (x) = f) with the pro-
perty that A outputs 1 on input x if and only if the
formula f is true. If there were a small circuit deciding
whether a formula is true, then there would be a small
circuit for the problemcomputed by A. As, by step1, there is
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
no such small circuit for A, it follows that there is no small
circuit deciding whether a formula is true.
Step 3 involves a detailed analysis of the rst two steps,
in order to obtain the concrete numbers that appear in
Theorem 1.
Let us focus onstep2. The functionf is calledareduction.
Any function g such that x is in Bif and only if g(x) is in Cis
said to reduce B to C. (This sort of reduction makes sense
when the computational problems B and C are problems
that require a yes or no answer; thus, they can be viewed
as sets, where x is in B if B outputs yes on input x. Any
computational problemcan be viewed as a set this way. For
example, computing a function h can be viewed as the set
{(x, i) : the ith bit of f(x) is 1}.)
Efcient reducibility provides a remarkably effective
tool for classifying the computational complexity of a great
many problems of practical importance. The amazing thing
about the proof of step 2 (and this is typical of many
theorems in complexity theory) is that it makes no use at
all of the algorithm A, other than the fact that A uses at
most 2
n
memory locations. Every problemthat uses at most
this amount of memory is efciently reducible to the pro-
blem of deciding whether a formula is true. This provides
motivation for a closer look at the notion of efcient redu-
cibility.
EFFICIENT COMPUTATION, POLYNOMIAL TIME
Many notions of efcient reducibility are studied in com-
putational complexity theory, but without question the
most important one is polynomial-time reducibility. The
following considerations explain why this notion of redu-
cibility arises.
Let us consider the informal notion of easy functions
(functions that are easy to compute). Here are some con-
ditions that one might want to satisfy, if one were trying to
make this notion precise:
v
If f and g are easy, then the composition f o g is also
easy.
v
If f is computable in time n
2
on inputs of length n, then
f is easy.
These conditions might seem harmless, but taken
together, they imply that some easy functions take
time n
100
to compute. (This is because there is an easy
function f that takes input of length n and produces output
of length n
2
. Composing this function with itself takes time
n
4
, etc.) At one level, it is clearly absurd to call a function
easy if it requires time n
100
to compute. However, this is
precisely what we do incomplexity theory! Whenour goal is
to show that certain problems require superpolynomial
running times, it is safe to consider a preprocessing step
requiring time n
100
as a negligible factor.
A polynomial-time reduction f is simply a function that
is computed by some program that runs in time bounded
by p(n) on inputs of length n, for some polynomial p. (That
is, for some constant k, the running time of the program
computing f is at most n
k
k on inputs of length n.) Note
that we have not been specic about the programming
language in which the program is written. Traditionally,
this is made precise bysayingthat f is computedbyaTuring
machine in the given time bound, but exactly the same
class of polynomial-time reductions results, if we use any
other reasonable programming language, with the usual
notion of running time. This is a side-benet of our overly
generous denition of what it means to be easy to
compute.
If f is a polynomial-time reduction of A to B, we denote
this A _
p
m
B. Note that this suggests anordering, where Bis
larger (i.e., harder to compute) than A. Any efcient
algorithm for B yields an efcient algorithm for A; if A
is hard to compute, then B must also be hard to compute.
If A _
p
m
B and B _
p
m
A, this is denoted A=
p
m
B.
One thing that makes complexity theory useful is that
naturally-arising computational problems tend to clump
together into a shockingly small number of equivalence
classes of the =
p
m
relation. Many thousands of problems
have been analyzed, and most of these fall into about a
dozen equivalence classes, with perhaps another dozen
classes picking up some other notably interesting groups
of problems.
Many of these equivalence classes correspond to inter-
esting time and space bounds. To explain this connection,
rst we need to talk about complexity classes.
COMPLEXITY CLASSES AND COMPLETE SETS
Acomplexity class is a set of problems that canbe computed
within certain resource bounds on some model of computa-
tion. For instance, P is the class of problems computable by
programs that run in time at most n
k
k for some constant
k; P stands for polynomial time. Another important
complexity class is EXP: the set of problems computed by
programs that runfor time at most 2
n
k
k
oninputs of length
n. P and EXP are both dened in terms of time complexity.
It is also interesting to bound the amount of memory used
by programs; the classes PSPACE and EXPSPACE consist
of the problems computed by programs whose space
requirements are polynomial and exponential in the input
size, respectively.
An important relationship exists between EXP and the
game of checkers; this example is suitable to introduce the
concepts of hardness and completeness.
Checkers is played on an 8-by-8 grid. When the rules are
adapted for play on a 10-by-10 grid, the game is known as
draughts (which can also be played on boards of other
sizes). Starting fromany game position, there is an optimal
strategy. The taskof ndinganoptimal strategyis anatural
computational problem. N N-Checkers is the function
that takes as input a description of a NNdraughts board
with locations of the pieces, and returns as output the move
that a given player should make, using the optimal strat-
egy. It is known that there is a program computing the
optimal strategy for N N-Checkers that runs in time
exponential in N
2
; thus, N N-Checkers EXP.
More interestingly, it is known that for every problem
AEXP; A _
p
m
N N-Checkers. We say that A _
p
m
N
N-Checkers is hard for EXP (3).
2 COMPUTATIONAL COMPLEXITY THEORY
More generally, if c is any class of problems, and B is a
problemsuch that A _
p
m
B for every Bc, then we say that
B is hard for c. If B is hard for c and Bc, then we say that
B is complete for c. Thus, in particular, N N-Checkers
is complete for EXP. This means that the complexity of N
N-Checkers is well understood, in the sense that the fastest
program for this problem cannot be too much faster than
the currently known program. Here is why: We know(via a
diagonalization argument) that there is some problem A in
EXP that cannot be computed by any program that runs in
time asymptotically less than 2
n
. As N N-Checkers is
complete for EXP, we know there is a reduction from A
to N N-Checkers computable in time n
k
for some k, and
thus N N-Checkers requires running time that is asymp-
totically at least 2
n
1=k
.
It is signicant to note that this yields onlyanasymptotic
lower bound on the time complexity of N N-Checkers.
That is, it says that the running time of anyprogramfor this
problem must be very slow on large enough inputs, but (in
contrast to Theorem 1) it says nothing about whether this
problemis difcult for agivenxedinput size. For instance,
it is still unknownwhether there couldbe ahandhelddevice
that computes optimal strategies for 100 100-Checkers
(although this seems very unlikely). To mimic the proof of
Theorem 1, it would be necessary to show that there is a
problemin EXPthat requires large circuits. Such problems
are known to exist in EXPSPACE; whether such problems
exist in EXP is one of the major open questions in computa-
tional complexity theory.
The complete sets for EXP (such as N N-Checkers)
constitute one of the important =
p
m
-equivalence classes;
many other problems are complete for PSPACE and EXP-
SPACE(and of course every nontrivial problemthat can be
solved in polynomial time is complete for P under _
p
m
reductions). However, this accounts for only a few of the
several =
p
m
-equivalence classes that arise when consider-
ing important computational problems. To understand
these other computational problems, it turns out to be
useful to consider unrealistic models of computation.
UNREALISTIC MODELS: NONDETERMINISTIC MACHINES
AND THE CLASS NP
Nondeterministic machines appear to be a completely
unrealistic model of computation; if one could prove this
to be the case, one would have solved one of the most
important open questions in theoretical computer science
(and even in all of mathematics).
A nondeterministic Turing machine can be viewed as a
program with a special guess subroutine; each time this
subroutine is called, it returns a random bit, zero or one.
Thus far, it sounds like anordinary programwitha random
bit generator, which does not sound so unrealistic. The
unrealistic aspect comes with the way that we dene
how the machine produces its output. We say that a non-
deterministic machine accepts its input (i.e., it outputs one)
if there is some sequence of bits that the guess routine
could return that causes the machine to output one; other-
wise it is saidto reject its input. If we viewthe guess bits as
independent coin tosses, then the machine rejects its input
if and only if the probability of outputting one is zero;
otherwise it accepts. If a nondeterministic machine runs
for t steps, the machine can ip t coins, and thus, a non-
deterministic machinecando the computational equivalent
of nding a needle in a haystack: If there is even one
sequence r of length t (out of 2
t
possibilities) such that
sequence r leads the machine to output one on input
x, then the nondeterministic machine will accept x, and
it does it in time t, rather than being charged time 2
t
for
looking at all possibilities.
A classic example that illustrates the power of nonde-
terministic machines is the Travelling Salesman Problem.
The input consists of a labeled graph, with nodes (cities)
and edges (listing the distances betweeneachpair of cities),
alongwithaboundB. The questionto be solvedis as follows:
Does there exist a cycle visiting all of the cities, having
length at most B? A nondeterministic machine can solve
this quickly, byusingseveral calls to the guess subroutine
to obtaina sequence of bits r that canbe interpreted as a list
of cities, and then outputting one if r visits all of the cities,
and the edges used sum up to at most B.
Nondeterministic machines can also be used to factor
numbers; given an n-bit number x, along with two other
numbers a and b with a<b, a nondeterministic machine
can accept whether there is a factor of x that lies between a
and b.
Of course, this nondeterministic program is of no use at
all in trying to factor numbers or to solve the Traveling
Salesman Problemon realistic computers. In fact, it is hard
to imagine that there will ever be an efcient way to
simulate a nondeterministic machine on computers that
one could actually build. This is precisely why this model is
so useful in complexity theory; the following paragraph
explains why.
The class NP is the class of problems that can be solved
by nondeterministic machines running in time at most
n
k
k on inputs of size n, for some constant k; NP stands
for Nondeterministic Polynomial time. The Traveling
Salesman Problem is one of many hundreds of very impor-
tant computational problems (arising in many seemingly
unrelated elds) that are complete for NP. Although it is
more than a quarter-century old, the volume by Garey and
Johnson (4) remains a useful catalog of NP-complete pro-
blems. The NP-complete problems constitute the most
important =
p
m
-equivalence class whose complexity is unre-
solved. If any one of the NP-complete problems lies in P,
thenP=NP. As explainedabove, it seems muchmore likely
that P is not equal to NP, which implies that any program
solving any NP-complete problemhas a worst-case running
time greater than n
100,000
on all large inputs of length n.
Of course, evenif Pis not equal to NP, we wouldstill have
the same situation that we face with N N-Checkers, in
that we would not be able to conclude that instances of
some xed size (say n = 1,000) are hard to compute. For
that, we would seem to need the stronger assumption that
there are problems inNPthat require very large circuits; in
fact, this is widely conjectured to be true.
COMPUTATIONAL COMPLEXITY THEORY 3
Although it is conjectured that deterministic machines
require exponential time to simulate nondeterministic
machines, it is worth noting that the situation is very
different when memory bounds are considered instead. A
classic theorem of complexity theory states that a nonde-
terministic machine using space s(n) can be simulated by a
deterministic machine in space s(n)
2
. Thus, if we dene
NPSPACE and NEXPSPACE by analogy to PSPACE and
EXPSPACE using nondeterministic machines, we obtain
the equalities PSPACE = NPSPACE and EXPSPACE =
NEXPSPACE.
We thus have the following six complexity classes:
P _NP _PSPACE_EXP_NEXP _EXPSPACE
Diagonalization arguments tell us that P ,=EXP; NP,=
NEXP; and PSPACE,=EXPSPACE. All other relationships
are unknown. For instance, it is unknown whether P =
PSPACE, and it is also unknown whether PSPACE =
NEXP (although at most one of these two equalities can
hold). Many in the community conjecture that all of these
classes are distinct, and that no signicant improvement
on any of these inclusions can be proved. (That is, many
people conjecture that there are problems in NP that
require exponential time on deterministic machines, that
there are problems in PSPACE that require exponential
time on nondeterministic machines, that there are pro-
blems in EXP that require exponential space, etc.) These
conjectures have remained unproven since they were rst
posed in the 1970s.
A THEORY TO EXPLAIN OBSERVED DIFFERENCES IN
COMPLEXITY
It is traditional to drawa distinction between mathematics
and empirical sciences such as physics. In mathematics,
one starts with a set of assumptions and derives (with
certainty) the consequences of the assumptions. In con-
trast, in a discipline such as physics one starts with exter-
nal reality and formulates theories to try to explain (and
make predictions about) that reality.
For some decades now, the eld of computational com-
plexity theory has dwelt in the uncomfortable region
between mathematics and the empirical sciences. Com-
plexity theory is a mathematical discipline; progress is
measured by the strength of the theorems that are proved.
However, despite rapid and exciting progress on many
fronts, the fundamental question of whether P is equal to
NP remains unsolved.
Until that milestone is reached, complexity theory can
still offer to the rest of the computing community some of
the benets of an empirical science, in the following sense.
All of our observations thus far indicate that certain pro-
blems (suchas the TravelingSalesmanProblem) are intrac-
tible. Furthermore, we can observe that with surprisingly
fewexceptions, natural and interesting computational pro-
blems can usually be shown to be complete for one of a
handful of well-studied complexity classes. Eventhoughwe
cannot currently prove that some of these complexity
classes are distinct, the fact that these complexity classes
correspond to natural or unnatural models of computation
gives us an intuitively appealing explanation for why these
classes appear to be distinct. That is, complexity theory
gives us a vocabulary and a set of plausible conjectures that
helps explain our observations about the differing compu-
tational difculty of various problems.
NP AND PROVABILITY
There are important connections between NP and mathe-
matical logic. One equivalent way of dening NP is to say
that a set A is in NP if and only if there are short proofs of
membership in A. For example, consider the Traveling
Salesman Problem. If there is a short cycle that visits all
cities, thenthere is a short proof of this fact: Simply present
the cycle and compute its length. Contrast this with the
task of trying to prove that there is not a short cycle that
visits all cities. For certain graphs this is possible, but
nobody has found a general approach that is signicantly
better than simply listing all (exponentially many) possible
cycles, and showing that all of themare too long. That is, for
NP-complete problems A, it seems to be the case that the
complement of A (denoted co-A) is not in NP.
The complexity class coNP is dened to be the set of all
complements of problems in NP; coNP = co A : ANP.
This highlights what appears to be a fundamental dif-
ference between deterministic and nondeterministic
computation. On a deterministic machine, a set and its
complement always have similar complexity. On nondeter-
ministic machines, this does not appear to be true (although
if one could prove this, one would have a proof that P is
different from NP).
To discuss the connections among NP, coNP, and logic
in more detail, we need to give some denitions related to
propositional logic. A propositional logic formula consists
of variables (which can take on the values TRUE and FALSE),
along with the connectives AND, OR, and NOT. A formula is
said to be satisable if there is some assignment of truth
values to the variables that causes it to evaluate to TRUE; it
is said to be a tautology if every assignment of truth values
to the variables causes it to evaluate to TRUE. SAT is the set
of all satisable formulas; TAUTis the set of all tautologies.
Note that the formula fis inSATif and only if NOTf is not
in TAUT.
SAT is complete for NP; TAUT is complete for coNP.
[This famous theorem is sometimes known as Cooks
Theorem (5) or the Cook-Levin Theorem (6).]
Logicians are interested in the question of how to prove
that a formula is a tautology. Many proof systems have
been developed; they are known by such names as resolu-
tion, Frege systems, andGentzencalculus. For some of these
systems, such as resolution, it is known that certain tautol-
ogies of n symbols require proofs of length nearly 2
n
(7). For
Frege systems and proofs in the Gentzen calculus, it is
widely suspected that similar bounds hold, although this
remains unknown. Most logicians suspect that for any
reasonable proof system, some short tautologies will
require verylongproofs. This is equivalent to the conjecture
that NP and coNP are different classes; if every tautology
had a short proof, then a nondeterministic machine could
4 COMPUTATIONAL COMPLEXITY THEORY
guess the proof and accept if the proof is correct. As TAUT
is complete for coNP, this would imply that NP = coNP.
The P versus NP question also has a natural inter-
pretation in terms of logic. Two tasks that occupy mathe-
maticians are as follows:
1. Finding proofs of theorems.
2. Reading proofs that other people have found.
Most mathematicians nd the second task to be con-
siderably simpler than the rst one. This can be posed as a
computational problem. Let us say that a mathematician
wants to prove a theorem f and wants the proof to be at
most 40 pages long. A nondeterministic machine can take
as input f followed by 40 blank pages, and guess a proof,
accepting if it nds a legal proof. If P = NP, the mathema-
tician can thus determine fairly quickly whether there is a
short proof. A slight modication of this idea allows the
mathematician to efciently construct the proof (again,
assuming that P = NP). That is, the conjecture that P is
different than NP is consistent with our intuition that
nding proofs is more difcult than verifying that a given
proof is correct.
In the 1990s researchers in complexity theory discov-
ered a very surprising (and counterintuitive) fact about
logical proofs. Any proof of a logic statement canbe encoded
in such a way that it can be veried by picking a fewbits at
random and checking that these bits are sufciently con-
sistent. More precisely, let us say that you want to be 99.9%
sure that the proof is correct. Then there is some constant k
and a procedure such that, no matter howlong the proof is,
the procedure ips O(log n) coins and picks k bits of the
encoding of the proof, and then does some computation,
with the property that, if the proof is correct, the procedure
accepts with probability one, and if the proof is incorrect,
then the procedure detects that there is a aw with prob-
ability at least 0.999. This process is known as a probabi-
listically checkable proof. Probabilisticallycheckable proofs
have been very useful in proving that, for many optimiza-
tion algorithms, it is NP-complete not only to nd an
optimal solution, but even to get a very rough approxima-
tion to the optimal solution.
Some problems in NP are widely believed to be intract-
able to compute, but are not believed to be NP-complete.
Factoring provides a goodexample. The problemof comput-
ing the prime factorization of a number can be formulated
in several ways; perhaps the most natural way is as the set
FACTOR ={(x, i, b): the ith bit of the encoding of the prime
factorization of x is b}. By making use of the fact that
primality testing lies in P (8), set FACTOR is easily seen
to lie in NPcoNP. Thus, FACTOR cannot be NP-complete
unless NP = coNP.
OTHER COMPLEXITY CLASSES: COUNTING,
PROBABILISTIC, AND QUANTUM COMPUTATION
Several other computational problems appear to be inter-
mediate in complexity between NP and PSPACE that are
related to the problem of counting how many accepting
paths a nondeterministic machine has. The class #P is the
class of functions f for whichthere is anNPmachine Mwith
the property that, for each string x, f(x) is the number of
guess sequences r that cause M to accept input x. #P is a
class of functions, instead of being a class of sets like all
other complexity classes that we have discussed. #P is
equivalent in complexity to the class PP (probabilistic
polynomial time) dened as follows. Aset Ais in PP if there
is an NP machine M such that, for each string x, x is in A if
andonly if more thanhalf of the guess sequences cause Mto
accept x. If weviewthe guess sequences as ips of afair coin,
this means that x is in A if and only the probability that M
accepts x is greater than one half. It is not hard to see that
both NP and coNP are subsets of PP; thus this is not a very
practical notion of probabilistic computation.
In practice, when people use probabilistic algorithms,
they want to receive the correct answer with high prob-
ability. The complexity class that captures this notion is
called BPP (bounded-error probabilistic polynomial time).
Some problems in BPP are not known to lie in P; a good
example of such a problem takes two algebraic circuits as
input and determines whether they compute the same
function.
Early in this article, we mentioned quantum computa-
tion. The class of problems that can be solved in polynomial
time with lowerror probability using quantummachines is
called BQP (bounded-error quantum polynomial time).
FACTOR (the problem of nding the prime factorization
of a number) lies in BQP (2). The following inclusions are
known:
P _BPP_BQP_PP _PSPACE
P _NP _PP
No relationship is known between NP and BQP or between
NP and BPP. Many people conjecture that neither NP nor
BQP is contained in the other.
In contrast, many people now conjecture that BPP = P,
because it has been proved that if there is any problem
computable in time 2
n
that requires circuits of nearly
exponential size, then there is an efcient deterministic
simulation of any BPP algorithm, which implies that P =
BPP (9). This theorem is one of the most important in a
eld that has come to be known as derandomization,
which studies how to simulate probabilistic algorithms
deterministically.
INSIDE P
Polynomial-time reducibility is a very useful tool for clar-
ifyingthe complexity of seeminglyintractible problems, but
it is of no use at all in trying to draw distinctions among
problems in P. It turns out that some very useful distinc-
tions can be made; to investigate them, we need more
rened tools.
Logspace reducibility is one of the most widely used
notions of reducibility for investigating the structure of
P; a logspace reduction f is a polynomial-time reduction
with the additional property that there is a Turing
machine computing f that has (1) a read-only input tape,
(2) a write-only output tape, and (3) the only other data
COMPUTATIONAL COMPLEXITY THEORY 5
structure it can use is a read/write worktape, where it
uses only O(log n) locations on this tape on inputs of length
n. If A is logspace-reducible to B, then we denote this by
A _
log
m
B. Imposing this very stringent memory restriction
seems to place severe limitations on polynomial-time com-
putation; many people conjecture that many functions
computable in polynomial time are not logspace-computa-
ble. However, it is also truethat the full power of polynomial
time is not exploitedinmost proofs of NP-completeness. For
essentially all natural problems that have been shownto be
complete for the classes NP, PP, PSPACE, EXP, and so on
using polynomial-time reducibility, it is known that
they are also complete under logspace reducibility. That
is, for large classes, logspace reducibility is essentially as
useful as polynomial-time reducibility, but logspace reduci-
bility offers the advantage that it can be used to nd
distinctions among problems in P.
Logspace-bounded Turing machines give rise to some
natural complexity classes inside P: If the characteristic
function of a set A is a logspace reduction as dened in the
preceding paragraph, then A lies in the complexity class L.
The analogous class, dened in terms of nondeterministic
machines, is known as NL. The class #P also has a logspace
analog, knownas #L. These classes are of interest primarily
because of their complete sets. Some important complete
problems for Lare the problemof determining whether two
trees are isomorphic, testing whether a graph can be
embedded in the plane, and the problem of determining
whether an undirected graph is connected (10). Determin-
ing whether a directed graph is connected is a standard
complete problem for NL, as is the problem of computing
the length of the shortest path between two vertices in a
graph. The complexity class #L characterizes the complex-
ity of computing the determinant of an integer matrix as
well as several other problems in linear algebra.
There are also many important complete problems for P
under logspace reducibility, such as the problem of evalu-
ating a Boolean circuit, linear programming, and certain
network ow computations. In fact, there is a catalog of P-
complete problems (11) that is nearly as impressive as the
list of NP-complete problems (4). Although many P-com-
plete problems have very efcient algorithms in terms of
time complexity, there is a sense in which they seem to be
resistent to extremely fast parallel algorithms. This is
easiest to explain in terms of circuit complexity. The size
of a Boolean circuit can be measured in terms of either the
number of gates or the number of wires that connect the
gates. Another important measure is the depth of the
circuit: the length of the longest path from an input gate
to the output gate. The problems in L, NL, and #L all have
circuits of polynomial size and very small depth (O(log
2
n)).
In contrast, all polynomial-size circuits for P-complete
problems seem to require a depth of at least n
1=k
.
Even a very small complexity class such as L has an
interesting structure inside it that can be investigated
using a more restricted notion of reducibility than _
log
m
that is dened in terms of very restricted circuits. Further
information about these small complexity classes can be
found in the textbook by Vollmer (12).
We have the inclusions L_NL_P _NP_PP _PSPACE.
Diagonalization shows that NL,=PSPACE, but no other
separations are known. In particular, it remains unknown
whether the large complexity class PP actually coincides
with the small class L.
TIME-SPACE TRADEOFFS
Logspace reducibility (and in general the notion of Turing
machines that have very limited memory resources) allows
the investigation of another aspect of complexity: the
tradeoff between time and space. Take, for example, the
problem of determining whether an undirected graph is
connected. This problem can be solved using logarithmic
space (10), but currently all space-efcient algorithms
that are known for this problem are so slow that they will
never be usedinpractice, particularly because this problem
can be solved in linear time (using linear space) using a
standard depth-rst-search algorithm. However, there is
no strong reason to believe that no fast small-space algo-
rithm for graph connectivity exists (although there have
beensome investigations of this problem, using restricted
models of computation, of the type that were discussed at
the start of this article).
Some interesting time-space tradeoffs have been proved
for the NP-complete problem SAT. Recall that it is still
unknown whether SAT lies in the complexity class L. Also,
althoughit is conjecturedthat SATis not solvable intime n
k
for any k, it remains unknown whether SAT is solvable in
time O(n). However, it is known that if SAT is solvable in
linear time, then any such algorithm must use much more
than logarithmic space. In fact, any algorithm that solves
SAT in time n
1.7
must use memory n
1/k
for some k(13,14).
CONCLUSION
Computational complexity theory has been very successful
in providing a framework that allows us to understand why
several computational problems have resisted all efforts to
nd efcient algorithms. In some instances, it has been
possible to prove very strong intractibility theorems, and in
many other cases, a widely believed set of conjectures
explains why certain problems appear to be hard to com-
pute. The eld is evolving rapidly; several developments
discussed here are only a few years old. Yet the central
questions (such as the infamous P vs. NP question) remain
out of reach today.
By necessity, a brief article such as this can touch on
only a small segment of a large eld such as computational
complexity theory. The reader is urged to consult the texts
listed below, for a more comprehensive treatment of the
area.
FURTHER READING
D.-Z. Du and K.-I. Ko, Theory of Computational Complexity. New
York: Wiley, 2000.
L. A. Hemaspaandra and M. Ogihara, The Complexity Theory
Companion. London: Springer-Verlag, 2002.
D. S. Johnson, A catalog of complexity classes, in J. van Leeuwen,
(ed.), Handbook of Theoretical Computer Science, Vol. A:
6 COMPUTATIONAL COMPLEXITY THEORY
Algorithms and Complexity. Cambridge, MA: MIT Press, 1990, pp.
69161.
D. Kozen, Theory of Computation. London: Springer-Verlag, 2006.
C. Papadimitriou, Computational Complexity. Reading, MA:
Addison-Wesley, 1994.
I. Wegener, Complexity Theory: Exploring the Limits of Efcient
Algorithms. Berlin: Springer-Verlag, 2005.
BIBLIOGRAPHY
1. L. Stockmeyer and A. R. Meyer, Cosmological lower bound
on the circuit complexity of a small problem in logic, J. ACM,
49: 753784, 2002.
2. P. Shor, Polynomial-time algorithms for prime factorization
and discrete logarithms on a quantum computer, SIAM J.
Comput. 26: 14841509, 1997.
3. J. M. Robson, Nby NCheckers is EXPTIMEcomplete, SIAMJ.
Comput.13: 252267, 1984.
4. M. R. Garey and D. S. Johnson, Computers and Intractability:
A Guide to the Theory of NP-Completeness. San Francisco, CA:
Freeman, 1979.
5. S. Cook, The complexity of theorem proving procedures, Proc.
3rdAnnual ACMSymposiumonTheory of Computing (STOC),
1971, pp. 151158.
6. L. Levin, Universal searchproblems, Problemy Peredachi Infor-
matsii, 9: 265266, 1973(inRussian). Englishtranslation: B. A.
Trakhtenbrot, A survey of Russian approaches to perebor
(brute-force search) algorithms, Ann. History Comput., 6:
384400, 1984.
7. A. Haken, The intractability of resolution, Theor. Comput. Sci.,
39: 297308, 1985.
8. M. Agrawal, N. Kayal, and N. Saxena, PRIMES is in P, Ann.
Math., 160: 781793, 2004.
9. R. Impagliazzo and A. Wigderson, P=BPP unless E has sub-
exponential circuits, Proc. 29th ACM Symposium on Theory of
Computing (STOC), 1997, pp. 220229.
10. O. Reingold, Undirected ST-connectivity in log-space, Proc.
37th Annual ACM Symposium on Theory of Computing
(STOC), 2005, pp. 376385.
11. R. Greenlaw, H. J. Hoover, and W. L. Ruzzo, Limits to Parallel
Computation: P-Completeness Theory. New York: Oxford Uni-
versity Press, 1995.
12. H. Vollmer, Introduction to Circuit Complexity. Berlin:
Springer-Verlag, 1999.
13. L. Fortnow, R. Lipton, D. van Melkebeek, and A. Viglas, Time-
space lower bounds for satisability, J. ACM, 52: 835865,
2005.
14. R. Williams, Better time-space lower bounds for SAT and
related problems, Proc. 20th Annual IEEE Conference on
Computational Complexity (CCC), 2005, pp. 4049.
ERIC ALLENDER
Rutgers University
Piscataway, New Jersey
COMPUTATIONAL COMPLEXITY THEORY 7
C
COMPUTATIONAL NUMBER THEORY
In the contemporary study of mathematics, number theory
stands out as a peculiar branch, for many reasons. Most
development of mathematical thought is concerned with
the identication of certain structures and relations in
these structures. For example, the study of algebra is
concerned with different types of operators on objects,
such as the addition and multiplication of numbers, the
permutation of objects, or the transformation of geometric
objectsand the study of algebra is concerned with the
classication of the many such types of operators.
Similarly, the study of analysis is concerned with the
properties of operators that satisfy conditions of continuity.
Number theory, however, is the studyof the properties of
those few systems that arise naturally, beginning with the
natural numbers (which we shall usually denote N), pro-
gressing to the integers (Z), the rationals (Q), the reals (R),
and the complex numbers (C). Rather than identifying very
general principles, number theory is concerned with very
specic questions about these few systems.
For that reason, for many centuries, mathematicians
thought of number theory as the purest form of inquiry.
After all, it was not inspired by physics or astronomy or
chemistry or other applied aspects of the physical uni-
verse. Consequently, mathematicians could indulge in
number theoretic pursuits while being concerned only
with the mathematics itself.
But number theory is a eld with great paradoxes. This
purest of mathematical disciplines, as we will see below,
has served as the source for arguably the most important
set of applications of mathematics in many years!
Another curious aspect of number theory is that it is
possible for a very beginning student of the subject to pose
questions that canbafe the greatest minds. One example is
the following: It is an interesting observation that there are
many triples of natural numbers (in fact, an innite num-
ber) that satisfy the equation x
2
y
2
= z
2
. For example,
3
2
4
2
= 5
2
, 5
2
12
2
= 13
2
, 7
2
24
2
= 25
2
, and so on.
However, one might easily be led to the question, can we
nd (nonzero) natural numbers x, y, and z such that
x
3
y
3
= z
3
? Or, indeed, such that x
n
+ y
n
= z
n
, for any
natural number n > 2 and nonzero integers x, y and z?
The answer to this simple question was announced by
the famous mathematician, Pierre Auguste de Fermat, in
his last written work in 1637. Unfortunately, Fermat did
not provide a proof but only wrote the announcement in the
margins of his writing. Subsequently, this simply stated
problem became known as Fermats Last Theorem, and
the answer eluded the mathematics community for 356
years, until 1993, when it was nally solved by Andrew
Wiles (1).
The full proof runs to over 1000 pages of text (no, it will
not be reproduced here) and involves mathematical techni-
ques drawnfroma wide variety of disciplines within mathe-
matics. It is thought to be highly unlikely that Fermat,
despite his brilliance, could have understood the true com-
plexity of his Last Theorem. (Lest the reader leave for
want of the answer, what Wiles proved is that there are no
possible nonzero x, y, z, and n >2 that satisfy the equation.)
Other questions that arise immediately in number the-
ory are even more problematic than Fermats Last Theo-
rem. For example, a major concern in number theory is the
study of prime numbersthose natural numbers that are
evenlydivisible onlybythemselves and1. For example, 2, 3,
5, 7, 11, and 13 are prime numbers, whereas 9, 15, and any
even number except 2 are not (1, by convention, is not
considered a prime).
One can easily observe that small even numbers can be
described as the sum of two primes: 2 2 = 4, 3 3 = 6,
3 5 = 8, 3 7 = 5 5 = 10, 5 7 = 12; 7 7 = 14,
andso on. One couldask, canall evennumbers be expressed
as the sumof two primes? Unfortunately, no one knows the
answer to this question. It is known as the Goldbach
Conjecture, and fame and fortune (well, fame, anyway)
await the personwho successfully answers the question(2).
Inthis article, we will attempt to describe some principal
areas of interest in number theory and then to indicate
what current research has shown to be an extraordinary
application of this purest form of mathematics to several
very current and very important applications.
Althoughthis will inno wayencompass all of the areas of
development in number theory, we will introduce:
Divisibility: At the heart of number theory is the study of
the multiplicative structure of the integers under multi-
plication. What numbers divide (i.e., are factors of ) other
numbers? What are all the factors of a given number?
Which numbers are prime?
Multiplicative functions: In analyzing the structure of
numbers and their factors, one is led to the consideration of
functions that are multiplicative: Inother words, a function
is multiplicative if f (a b) = f (a) f (b) for all a and b.
Congruence: Two integers a and b are said to be con-
gruent modulo n (where n is also an integer), and written
a=b (mod n), if their difference is a multiple of n; alterna-
tively, that a and b yield the same remainder when divided
(integer division) by n. The study of the integers under
congruence yields many interesting properties and is fun-
damental to number theory. The modular systems so devel-
oped are called modulo n arithmetic and are denoted either
Z=nZorZ
n
.
Residues: In Z
n
systems, solutions of equations (techni-
cally, congruences) of the form x
2
=a(mod n) are often stu-
died. In this instance, if there is a solution for x, a is called a
quadratic residue of n. Otherwise, it is called a quadratic
nonresidue of n.
Prime numbers: The prime numbers, with their special
property that they have no positive divisors other than
themselves and 1, have been of continuing interest to
number theorists. In this section, we will see, among other
things, an estimate of how many prime numbers there are
less than some xed number n.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Diophantine equations: The term Diophantine equation
is used to apply to a family of algebraic equations in a
number system such as Z or Q. A good deal of research in
this subject has been directed at polynomial equations with
integer or rational coefcients, the most famous of which
being the class of equations x
n
y
n
= z
n
, the subject of
Fermats Last Theorem.
Elliptic curves: A nal area of discussion in number
theory will be the theory of elliptic curves. Although gen-
erally beyond the scope of this article, this theory has been
so important in contemporary number theory that some
discussion of the topic is in order. An elliptic curve repre-
sents the set of points in some appropriate number system
that are the solutions to an equation of the form
y
2
= Ax
3
Bx
2
Cx D when A, B, C, D Z.
Applications: The nal sectionof this article will address
several important applications in business, economics,
engineering, andcomputingof number theory. It is remark-
able that this, the purest form of mathematics, has found
such important applications, often of theory that is hun-
dreds of years old, to very current problems in the afore-
mentioned elds!
It shouldperhaps be notedhere that manyof the results
indicated below are given without proof. Indeed, because
of space limitations, proofs will only be given when
they are especially instructive. Several references will
be given later in which proofs of all of the results cited
can be found.
DIVISIBILITY
Many questions arising in number theory have as
their basis the study of the divisibility of the integers.
An integer n is divisible by k if another integer m exists
such that k m = n. We sometimes indicate divisibility
of n by k by writing k[nor k[n if n is not divisible by k.
Afundamental result is the divisionalgorithm. Givenm,
nZ, with n > 0, unique integers c and d exist such that
m = c n d and 0 _ d < n.
Equally as fundamental is the Euclidean algorithm.
Theorem1. Let m, n Z, both m, n ,=0. Aunique integer
c exists satisfying c >0; c[m; c[n; and if d[m and d[n;
then d[c.
Proof. Consider d[d = ambn; \ a; bZ. Let c
+
be
the smallest natural number in this set. Then c
+
satises
the given conditions.
Clearly c
+
>0: c
+
[a, because by the division algorithm,
s and t there exist such that a = cs t with 0 _ t < u.
Thus, a = c
+
s t = ams bns t; thus, a(1ms)
b(ns) = t. As t < c
+
, this implies that t = 0, and thus
a = c
+
s or c
+
| a. Similarly c
+
| b. c
+
is unique because, if
c also meets the conditions, thenc
+
|c and c |c
+
, so c = c
+
.
The greatest common divisor of two integers m and n is
the largest positive integer [denoted GCD(m,n)] such that
GCD(m,n) | m and GCD (m,n) | n.
Theorem 2 (Greatest Common Divisor). The equation
am bn = r has integer solutions a; b =GCD(m; n)[r.
Proof. Let r ,= 0. It is not possible that GCD(m,n) | r
because GCD(m,n) | mand GCD(m,n) | n; thus, GCD(m,n)
|(am bn) = r. By Theorem1, this means that there exist
a
/
, b
/
such that a
/
m b
/
n = GCD(m,n).
Thus, a
//
= ra
/
/GCD(m,n), b
/ /
= rb
/
/GCD(m,n) represent
an integer solution to a
//
m b
/ /
n = r.
Arelated concept to the GCDis the least common multi-
ple (LCM). It can be dened as the smallest positive integer
that a and b both divide. It is also worth noting that
GCD(m,n) LCM(m,n) = m n.
Primes
A positive integer p > 1 is called prime if it is divisible only
by itself and 1. An integer greater than 1 and not prime is
called composite. Two numbers m and n with the property
that GCD(m,n) = 1 are said to be relatively prime (or
sometimes coprime).
Here are two subtle results.
Theorem 3. Every integer greater than 1 is a prime or a
product of primes.
Proof. Suppose otherwise. Let nbe the least integer that
is neither; thus, n is composite. Thus n = ab and a,b < n.
Thus, a and b are either primes or products of primes, and
thus so is their product, n.
Theorem 4. There are innitely many primes.
Proof (Euclid). If not, let p
1
, . . ., p
n
be a list of all the
primes, ordered from smallest to largest. Then consider
q = (p
1
p
2
. . . p
n
) 1. By the previous statement,
q = p
/
1
p
/
2
. . . p
/
k
(1)
p
/
1
must be one of the p
1
, . . ., p
n
, say p
j
(as these were all
the primes), but p
j
|q, because it has a remainder of 1. This
contradiction proves the theorem.
Theorem5 (Unique Factorization). Every positive integer
has a unique factorization into a product of primes.
Proof. Let
a = p
a
1
1
. . . p
a
k
k
= P
= q
b
1
1
. . . q
b
m
m
= Q (2)
For each p
i
, p
i
| Q = p
i
| q
b
s
s
for some s, 1 _ s _ m.
Since p
i
andq
s
are bothprimes, p
i
= q
s
. Thus, k = mand
Q = p
b
1
1
. . . p
b
k
k
. We only need to show that the a
i
and b
i
are
equal. Divide both decompositions Pand Qby p
a
i
i
. If a
i
,=b
i
,
on the one hand, one decomposition a= p
a
i
i
will contain p
i
,
and the other will not. This contradiction proves the
theorem.
What has occupied many number theorists was a quest
for a formula that would generate all of the innitely many
prime numbers.
For example, Marin Mersenne (1644) examined num-
bers of the form M
p
= 2
p
1, where p is a prime (3). He
2 COMPUTATIONAL NUMBER THEORY
discovered that some of these numbers were, in fact,
primes. Generally, the numbes he studied are known as
Mersenne numbers, and those that are prime are called
Mersenne primes. For example, M
2
= 2
2
1 = 3 is prime;
as are M
3
= 2
3
1 = 7; M
5
= 2
5
1 = 31; M
7
= 2
7
1 = 127. Alas, M
11
= 2
11
1 = 2047 = 23 89 is not.
Any natural number _2 ending in an even digit 0, 2, 4,
6, 8 is divisible by 2. Any number _5 ending in 5 is divisible
by 5. There are also convenient tests for divisibility by 3 and
9if the sumof the digits of a number is divisible by 3 or 9,
then so is the number; and by 11if the sumof the digits in
the even decimal places of a number, minus the sum of the
digits in the odd decimal places, is divisible by 11, then so is
the number.
Several other Mersenne numbers have also been deter-
mined to be prime. At the current writing, the list includes
42 numbers:
M
n
, where n = 2, 3, 5, 7, 13, 17, 19, 31, 61, 89, 107, 127,
521, 607, 1279, 2203, 2281, 3217, 4253, 4423, 9689, 9941,
11213, 19937, 21701, 23209, 44497, 86243, 110503, 132049,
216091, 756839, 859433, 1257787, 1398269, 2976221,
3021377, 6972593, 13466917, 20996011, 24036583,
25964951.
With current technology, particularly mathematical
software packages such as Mathematica or Maple, many
of these computations can be done rather easily. For exam-
ple, a one-line program in Mathematica, executing for 6.27
hours on a Pentium PC, was capable of verifying the
primality of all M
n
up to M
1000
, and thus determining
the rst 14 Mersenne primes (up to M
607
). It is not known
whether an innite number of the Mersenne numbers are
prime.
It is known that m
c
1 is composite if m > 2 or if c is
composite; for if c = de, we have:
(m
c
1) = (m
e
1)(m
e(d1)
m
e(d1)
. . . m
e(d1)
1)
(3)
We can also showthat m
c
1 is composite if mis odd or if c
has an odd factor. Certainly if m is odd, m
c
is odd, and
m
c
1 is even and thus composite. If c = d(2k 1), then
m
c
1 = (m
d
1)(m
2de
m
d(2e1)
m
d(2e2)
. . . 1)
(4)
and m
d
1 > 1.
Another set of numbers with interesting primality prop-
erties are the Fermat numbers, F
n
= 2
2
n
1. Fermats
conjecture was that they were primes. He was able
to verify this for F
1
= 5, F
2
= 17, F
3
= 257, and
F
4
= 2
16
1 = 65537. But then, Euler showed that
Theorem 6. 641|F
5
.
Proof. 641 = 2
4
5
4
= 5 2
7
1; thus 2
4
= 641 5
4
.
Since 2
32
= 2
4
2
28
= 641 2
28
(5 2
7
)
4
= 641 2
28
(641 1)
4
= 641k 1.
To this date, no other Fermat numbers F
n
with n > 4
have been shown to be prime. It has been determined that
the other Fermat numbers through F
20
are composite.
MULTIPLICATIVE FUNCTIONS
Functions that preserve the multiplicative structure of the
number systems studied innumber theory are of particular
interest, not only intrinsically, but also for their use in
determining other relationships among numbers.
Afunctionf dened onthe integers that takes values ina
set closed under multiplication is called a number theoretic
function. If the function preserves multiplication for num-
bers that are relatively prime (f(m) f(n) = f(m n)), it is
calleda multiplicative function; it is calledcompletely multi-
plicative if the restriction GCD(m, n) = 1 can be lifted.
Consider any number theoretic function f, and dene a
new function F by the sum of the values of f taken over the
divisors of n
F(n) =
X
d[n
f (d) =
X
d[n
f (n=d); the latter being
the former sumin reverse order
(5)
Theorem 7. If f is a multiplicative function, then so is F.
Proof. Let GCD(m, n) = 1; then d | mn can be uniquely
expressed as gh, where g|m,h|n, with GCD(g,h) = 1.
F(mn) =
X
d[mn
f (d) =
XX
g[mh[n
f (gh) =
XX
g[mh[n
f (g) f (h)
=
X
g[m
f (g) f (h
1
) . . .
X
g[m
f (g) f (h
k
)
= F(m)
X
h[n
f (h)
= F(m)F(n)
(6)
Two multiplicative functions of note are the divisor function
t(n) dened as the number of positive divisors of n and the
function s(n), dened as the sum of the positive divisors
of n.
Theorem 8. t and s are multiplicative.
Proof. Botht and sare the uppercase functions for the
obviously multiplicative functions 1(n) = 1 for all n, and
i(n) = n for all n. In other words,
t(n) =
X
d[n
1(d) and s(n) =
X
d[n
i(d) (7)
In order to compute t and s for a prime p, note that
t( p
n
) = 1 n(because the
divisors of p
n
are 1; p; p
2
; . . . ; p
n
) (8)
s( p
n
) = 1 p p
2
. . . p
n
= ( p
n1
1)=( p 1) (9)
Thus, for any n = p
1
a
n
1
p
2
a
n
2
. . . p
k
a
n
k
,
t(n) =
Y
k
i=1
(1 n
i
) and s(n) =
Y
k
i=1
( p
n
i
1
i
1)=( p
i
1)
(10)
COMPUTATIONAL NUMBER THEORY 3
Invery ancient times, numbers were sometimes considered
to have mystical properties. Indeed, the Greeks identied
numbers that they called perfect: numbers that were
exactly the sums of all their proper divisors, in other words,
that s(n) = n. A few examples are 6 = 1 2 3,
28 = 1 2 4 7 14, and 496 = 1 2 4 8
16 31 62 124 248. It is not known whether there
are any odd perfect numbers, or if an innite number of
perfect numbers exists.
Theorem 9. n is even and perfect =n = 2
p1
(2
p
1)
where both p and 2
p
1 are primes.
In other words, there is one perfect number for each
Mersenne prime.
Another multiplicative function of considerable impor-
tance is the Mobius function m.
m(1) = 1; m(n) = 0 if n has a square factor; m(p
1
, p
2
. . .
p
k
) = (1)
k
if p
1
, . . ., p
k
are distinct primes.
Theorem 10. m is multiplicative, and the uppercase
function is 0 unless n = 1, when it takes the value 1.
Theorem11. (Mo bius Inversion Formula). If f is a number
theoretic function and F(n) =
P
d[n
f (d), then
f (n) =
X
d=n
F(d)m(n=d) =
X
d=n
F(n=d)m(n) (11)
Proof.
X
d=n
m(d)F(n=d)
=
X
d
1
d
2
=n
m(d
1
)F(n=d
2
) (taking pairs d
1
d
2
= n) (12)
=
X
d
1
d
2
=n
[m(d
1
)
X
d[d
2
f (d)[ (definition of F) (13)
=
X
d
r
d[n
m(d
1
) f (d) (multiplying the terms in brackets) (14)
=
X
d[n
f (d)
X
d
1
[(n=d)
m(d
1
) (collecting multiples of f (d)) (15)
= f (n)
Theorem12. If Fis a multiplicative function, thenso is f.
A nal multiplicative function of note is the Euler func-
tionf(n). It is denedfor nto be the number of numbers less
than n and relatively prime to n. For primes p, f(p) = p1.
Theorem 13. f is multiplicative.
CONGRUENCE
The study of congruence leads to the denition of new
algebraic systems derived from the integers. These
systems, called residue systems, are interesting in and of
themselves, but they also have properties that allow for
important applications.
Consider integers a, b, n with n > 0. Note that a and b
couldbe positive or negative. We will saythat ais congruent
to b, modulo n [written a = b (mod n)] = n[(a b). Alter-
natively, a and b yield the same remainder whendivided by
n. In such a congruence, n is called the modulus and b is
called a residue of a.
An algebraic system can be dened by considering
classes of all numbers satisfying a congruence with xed
modulus. It is observed that congruence is an equivalence
relation and that the denition of addition and multiplica-
tion of integers can be extended to the equivalence classes.
Thus, for example, in the system with modulus 5 (also
called mod 5 arithmetic), the equivalence classes are {. . .,
5,0,5,. . .}, {. . .,4,1,6,. . .}, {. . .,3,2,7,. . .}, {. . .,2,3,8,. . .},
and {. . .,1,4,9,. . .}. It is customary to denote the class by
the (unique) representative of the class between0 andn1.
Thus the ve classes inmod5 arithmetic are denoted0, 1, 2,
3, 4. Formally, the mod n system can be dened as the
algebraic quotient of the integers Zand the subring dened
by the multiples of n (nZ). Thus, the mod n system is
often written Z/nZ. An alternative, and more compact
notation, is Z
n
.
Addition and multiplication are dened naturally in
Z
n
. Under addition, every Z
n
forms an Abelian group [that
is, the addition operation is closed, associative, and com-
mutative; 0 is an identity; and each element has an
additive inversefor any a, b = na always yields
a b = 0 (mod n)].
In the multiplicative structure, however, only the clo-
sure, associativity, commutativity, and identity (1) are
assured. It is not necessarily the case that each element
will have an inverse. In other words, the congruence ax =1
(mod n) will not always have a solution.
Technically, an algebraic system with the properties
described above is called a commutative ring with
identity. If it is also known that, if each (nonzero) ele-
ment of Z
n
has an inverse, the system would be called a
eld.
Theorem14. Let a, b, nbe integers withn>0. Thenax =
1 (mod n) has a solution = GCD(a; n) = 1. If x
0
is a solution,
then there are exactly GCD(a,n) solutions given by {x
0
,
x
0
n/GCD(a,n), x
0
2n/GCD(a,n), . . ., x
0
(GCD(a,n)
1)n/GCD(a,n)}.
Proof. This theorem is a restatement of Theorem 2.
For an element a in Z
n
to have an inverse, alternatively
to be a unit, by the above theorem, it is necessary and
sufcient for a and n to be relatively prime; i.e.,
GCD(a,n) = 1. Thus, by the earlier denition of the Euler
function, the number of units in Z
n
is f(n).
The set of units in Z
n
is denoted (Z
n
)
+
, and it is easily
veried that this set forms an Abelian group under multi-
plication. As an example, consider Z
12
or Z/12Z. Note that
(Z
12
)
+
= {1,5,7,11}, and that each element is its own
inverse: 1 1 = 5 5 = 7 7 = 11 11 = 1 (mod 12).
Further more, closure is observed because 5 7 =11, 5 11
= 7, and 7 11 = 5 (mod 12).
Theorem 15. If p is a prime number, then Z
p
is a eld
with p elements. If n is composite, Z
n
is not a eld.
4 COMPUTATIONAL NUMBER THEORY
Proof. If p is a prime number, every element a (Z
p
)
+
is
relatively prime to p; that is, GCD(a,p) = 1. Thus ax = 1
(mod p) always has a solution. As every element in (Z
p
)
+
has
an inverse, (Z
p
)
+
is a eld. If n is composite, there are
integers 1 < k, l < n such that kl = n. Thus, kl = 0 (mod
n), and so it is impossible that k could have an inverse.
Otherwise, l =(k
1
k) l =k
1
(k l) =k
1
0 =0 (mod n),
which contradicts the assumption that l < n.
One of the most important results of elementary number
theory is the so-called Chinese Remainder Theorem(4). It is
giventhis name because a versionwas originally derived by
the Chinese mathematicianSunTse almost 2000years ago.
The Chinese Remainder Theorem establishes a method of
solving simultaneously a system of linear congruences in
several modular systems.
Theorem 16 (Chinese Remainder). Given a system x = a
i
(mod n
i
), i = 1,2,. . .m. Suppose that, for all i = j,
GCD(n
i
,n
j
) = 1. Then there is a unique common solution
modulo n = n
1
n
2
. . .n
m
.
Proof (By construction). Let n
i
/
= n/n
i
, i = 1,. . .,m. Note
that GCD(n
i
,n
i
/
) = 1. Thus, an integer exists n
i
//
such that
n
i
/
n
i
/ /
= 1(mod n
i
). Then
x =a
1
n
1
/
n
1
//
a
2
n
2
/
n
2
//
a
m
n
m
/
n
m
//
(mod n) (16)
is the solution. As n
i
|n
j
if i ,=j, x =a
i
n
i
n
i
=a
i
(mod n
i
). The
solution is also unique. If both x and y are common
solutions, then xy = 0 (mod n).
An interesting consequence of this theoremis that there
is a 1 1 correspondence, preserved by addition and multi-
plication, of integers modulo n, and m-tuples of integers
modulo n
i
(5). Consider {n
1
,n
2
,n
3
,n
4
} = {7,11,13,17}, and
n = 17017. Then
95 (a
1
; a
2
; a
3
; a
4
) = (95 mod 7; 95 mod 11; 95 mod 13; 95
mod 17) = (4; 7; 4; 10) also 162(1; 8; 6; 9)
Performing addition and multiplication tuple-wise:
(4; 7; 4; 10) (1; 8; 6; 9) = (5 mod 7; 15mod 11; 10 mod 13;
19 mod 17) = (5; 4; 10; 2); (4; 7; 4; 10) (1; 8; 6; 9) = (4mod
7; 56 mod 11; 24 mod 13; 90 mod 17) = (4; 1; 11; 5)
Nowverifythat 95 162 = 257and95162 = 15,390are
represented by (5,4,10,2) and (4,1,11,5) by reducing each
number mod n
1
, . . ., n
4
.
Another series of important results involving the pro-
ducts of elements in modular systems are the theorems of
Euler, Fermat, and Wilson. Fermats Theorem, although
extremely important, is very easily proved; thus, it is some-
times called the Little Fermat Theorem in contrast to the
famous Fermats Last Theorem described earlier.
Theorem 17 (Euler). GCD If(a,n) = 1, then a
f(n)
=
1(mod n).
Theorem 18 (Fermat). If p is prime, then a
p
=a(mod p).
Proof (of Eulers Theorem). Suppose A = {a
1
, . . .,a
f(n)
} is a
list of the set of units in Z
n
. By denition, each of the a
i
has
an inverse a
i
1
. Now consider the product b = a
1
a
2
. . .a
f(n)
.
It also has an inverse, in particular b
1
= a
1
1
a
1
2
a
1
f
(n)
.
Choose anyof the units; suppose it is a. Nowconsider the set
A
/
= aa
1
; aa
2
; aa
f
(n)
. We need to showthat as a set, A
/
= A.
It is sufcient to show that the aa
i
are all distinct. As there
are f(n) of them, and they are all units, they represent all of
the elements of A.
Suppose that aa
i
=aa
j
for some i ,=j. Then, as a is a unit,
wecanmultiplybya
1
, yielding(a
1
a)a
i
=(a
1
a)a
j
or a
i
=a
j
(mod n), which is a contradiction. Thus, the aa
i
are all
distinct and A = A.
Now compute
Y
f(n)
i=1
(aa
i
) =b(mod n) because A = A
/
a
f(n)
b =b(mod n)
a
f
(n)
bb
1
=bb
1
(mod n) multiplying by b
1
a
f
(n)
=1(mod n)
(18)
Although the Chinese Remainder Theoremgives a solution
for linear congruences, one would also like to consider non-
linear or higher degree polynomial congruences. In the case
of polynomials in one variable, in the most general case,
f (x) =0(mod n) (19)
If n = p
a1
1
p
ak
k
is the factorizationof n, thenthe Chinese
Remainder Theorem assures solutions of (19) = each of
f (x) =0(mod p
a
i
i
) i = 1; 2; . . . ; k (20)
has a solution.
As f( p) = p 1 for p a prime, Fermats Theorem is a
direct consequence of Eulers Theorem.
Other important results in the theory of polynomial
congruences are as follows:
Theorem 19 (Lagrange). If f(x) is a nonzero polynomial of
degree n, whose coefcients are elements of Z
p
for aprime p,
then f(x) cannot have more than n roots.
Theorem 20 (Chevalley). If f(x
1
,. . .,x
n
) is a polynomial
with degree less than n, and if the congruence
f (x
1
; x
2
; . . . ; x
n
) =0(mod p)
has either zero or at least two solutions.
The Lagrange Theorem can be used to demonstrate the
result of Wilson noted above.
Theorem 21 (Wilson). If p is a prime, then (p 1)! = 1
(mod p).
Proof. If p = 2, the result is obvious. For p an odd prime,
let
f (x) = x
b1
(x 1)(x 2) . . . (x p 1) 1:
(17)
COMPUTATIONAL NUMBER THEORY 5
Consider any number 1 _ k _ p 1. Substituting k for x
causes the term(x 1)(x 2). . .(x p 1) to vanish; also, by
Fermats theorem, k
p1
=1(mod p). Thus, f (k) =0(mod p).
But k has degree less than p 1; and so by Lagranges
theorem, f(x) must be identically zero, which means all
of the coefcients must be divisible by p. The constant
coefcient is
1 ( p 1)!(mod p)
and thus
( p 1)! = 1(mod p)
QUADRATIC RESIDUES
Having considered general polynomial congruences,
now we restrict consideration to quadratics. The study
of quadratic residues leads to some useful techniques
as well as having important and perhaps surprising
results.
The most general quadratic congruence (in one vari-
able) is of the form ax
2
bx c = 0 (mod m). Such a
congruence can always be reduced to a simpler form. For
example, as indicated in the previous section, by the
Chinese remainder theorem, we can assume the modulus
is a prime power. As in the case p = 2 we can easily
enumerate the solutions, we will henceforth consider
only odd primes. Finally, we can use the technique of
completing the square from elementary algebra to
transform the general quadratic into one of the form x
2
= a (mod p).
If p a, then if x
2
= a (mod p) is soluble, a is called a
quadratic residue mod p; if not, a is called a quadratic
nonresidue mod p.
Theorem22. Exactly one half of the integers a, 1 _a _p
1, are quadratic residues mod p.
Proof. Consider the set of mod p values QR = {1
2
,
2
2
,. . ., ((p 1)/2)
2
}. Each of these values is a quadratic
residue. If the values in QR are all distinct, there are at
least (p 1)/2 quadratic residues. Suppose two of these
values, say t and u, were to have t
2
= u
2
(mod p). Then
since t
2
u
2
= 0 (mod p) = t u = 0 (mod p) or t u = 0
(mod p). Since t and u are distinct, the second case is not
possible; and since t and umust be both<(p 1)/2, so t u
< p and neither is the rst case. Thus there are at least
(p 1)/2 quadratic residues.
If x
0
solves x
2
= a (mod p), then so does p x
0
, since
p
2
2px
0
x
2
0
=x
2
0
=a(mod p), and p x
0
Tx
0
(mod p).
Thus we have found (p1)/2 additional elements of Z
p
,
which square to elements of QR and therefore p 1
elements overall. Thus there can be no more quadratic
residues outside of QR, so the result is proved.
An important number theoretic function to evaluate
quadratic residues is the Legendre symbol. For p an
oddprime, the Legendre symbol for a (writtenalternatively
(a / p) or (
a
p
)) is
a
p
= 1 if ais a quadratic residue mod p
=
0 if p[a
1 if ais a quadratic nonresidue mod p
One method of evaluating the Legendre symbol uses
Eulers criterion. If p is an odd prime and GCD(a,p) = 1,
then
a
p
= 1 =a
( p1)=2
=1(mod p)
Equivalently, (
a
p
) =a
( p1)=2
(mod p).
Here are some other characteristics of the Legendre
symbol:
Theorem 23. (i) (
ab
p
) = (
a
p
)(
b
p
); (ii) if a = b (mod p),
then (
a
p
) = (
b
p
); (iii) (
a
2
p
) = 1; (
1
p
) = 1; (iv) (
1
p
)(1= p)
= (1)
( p1)=2
.
Suppose we want to solve x
2
= 518 (mod 17). Then,
compute (
518
17
) = (
8
17
) = (
2
17
). But (
2
17
) = 1 since 6
2
=
36 =2(mod 17). Thus, x
2
= 518 (mod 17) is soluble.
Computation of the Legendre symbol is aided by the
following results. First, dene an absolute least residue
modulo p as the representation of the equivalence class of a
mod p, which has the smallest absolute value.
Theorem 24 (Gauss Lemma). Let GCD(a,p) = 1. If d is
the number of elements of {a,2a,. . .,(p 1)a} whose absolute
least residues modulo p are negative, then
a
p
= (1)
d
Theorem25. 2is aquadratic residue (respectively, quad-
ratic nonresidue) of primes of the form8k 1 (respectively,
8k 3). That is,
2
p
= (1)
( p
2
1)=8
Theorem26. (i) If k >1, p = 4k 3, and p is prime, then
2p 1 is also prime = 2
p
= 1 (mod 2p 1).
(ii) If 2p 1 is prime, then 2p 1 | M
p
, the pth Mers-
enne number, and M
p
is composite.
Aconcluding result for the computation of the Legendre
symbol is one that, by itself, is one of the most famousand
surprisingresults in all of mathematics. It is called
Gauss Law of Quadratic Reciprocity. What makes it so
astounding is that it manages to relate prime numbers and
their residues that seemingly bear no relationship to one
another (6).
|
{
z
}
6 COMPUTATIONAL NUMBER THEORY
Suppose that we have two odd primes, p and q. Then, the
Law of Quadratic Reciprocity relates the computation of
their Legendre symbols; that is, it determines the quadratic
residue status of each prime with respect to the other.
The proof, although derived fromelementary principles,
is long and would not be possible to reproduce here. Several
sources for the proof are listed below.
Theorem 27 (Gauss Law of Quadratic Reciprocity).
(
p
q
)(
q
p
) = (1)
( p1)(q1)=4
.
A consequence of the Law of Quadratic Reciprocity is as
follows.
Theorem 28. Let p and q be distinct odd primes, and a _
1. If p = q (mod 4a), then (
a
p
) = (
a
q
).
An extension of the Legendre symbol is the Jacobi
symbol. The Legendre symbol is dened only for primes
p. By a natural extension, the Jacobi symbol, also denoted
(
a
n
), is dened for any n > 0, assuming that the prime
factorization of n is p
1
. . .p
k
, by
(
a
n
) [jacobi symbol[ = (
a
p
1
)(
a
p
2
) (
a
p
k
)[Legendre symbols[
PRIME NUMBERS
The primes themselves have beena subject of muchinquiry
in number theory. We have observed earlier that there are
an innite number of primes, and that they are most useful
in nite elds and modular arithmetic systems.
One subject of interest has been the development of a
function to approximate the frequency of occurrence of
primes. This function is usually called p(n)it denotes
the number of primes less than or equal to n.
In addition to establishing various estimates for p(n),
concluding with the so-called Prime Number Theorem, we
will also state a number of famous unproven conjectures
involving prime numbers.
An early estimate for p(n), by Chebyshev, follows:
Theorem29 (Chebyshev). If n >1, then n/(8 log n) <p(n)
< 6n/log n.
The Chebyshev result tells us that, up to a constant
factor, the number of primes is of the order of n/log n. In
addition to the frequency of occurrence of primes, the great-
est gap between successive primes is also of interest (7).
Theorem 30 (Bertrands Partition). If n _ 2, there is a
prime p between n and 2n.
Two other estimates for series of primes are as follows.
Theorem 31.
X
p_n
(log p)= p = log n O(1).
Theorem 32.
X
p_n
1= p = log log x a O(1=log x)
Another very remarkable result involving the genera-
tion of primes is from Dirichlet.
Theorem 33 (Dirichlet). Let a and b be xed positive
integers suchthat GCD(a,b) = 1. Thenthere are aninnite
number of primes in the sequence {a bn | n = 1, 2,. . . }.
Finally, we have the best known approximation to the
number of primes (8).
Theorem34(Prime Number Theorem). p(n) ~n/logn. The
conjectures involving prime numbers are legion, and even
some of the simplest ones have proven elusive for mathe-
maticians. Afewexamples are the Goldbachconjecture, the
twin primes conjecture, the interval problem, the Dirichlet
series problem, and the Riemann hypothesis.
Twin Primes. Two primes p and q are called twins if
q = p 2. Examples are (p,q) = (5,7); (11,13); (17,19);
(521,523). If p
2
(n) counts the number of twin primes less
than n, the twin prime conjecture is that p
2
(n) as
n. It is known, however, that there are innitely many
pairs of numbers (p,q), where p is prime, q = p 2, and q
has at most two factors.
Goldbach Conjecture. As stated, this conjecture is that
every even number >2, is the sum of two primes. What is
known is that every large even number can be expressed as
p q, wherepisprimeandqhasatmosttwofactors. Also, itis
knownthateverylargeoddintegeristhesumof threeprimes.
Interval Problems. It was demonstrated earlier that
there is always a prime number between n and 2n. It is
not known, however, whether the same is true for other
intervals, for example, such as n
2
and (n 1)
2
.
Dirichlet Series. In Theorem 33, a series containing an
innite number of primes was demonstrated. It was not
known whether there are other series that have a greater
frequency of prime occurrences, at least until recent
research by Friedlander and Iwaniec (9), who showed
that series of the form {a
2
b
4
} not only have an innite
number of primes, but also that they occur more rapidly
than in the Dirichlet series.
Riemann Hypothesis. Although the connection to
prime numbers is not immediately apparent, the
Riemann hypothesis has been an extremely important
pillar in the theory of primes. It states that, for the complex
function
z(s) =
X
n=1
n
s
s = s it C
there are zeros at s = 2, 4, 6,. . ., and no more zeros
outside of the critical strip 0 _ s _ 1. The Riemann
hypothesis states that all zeros of z in the critical strip
lie on the line s = it.
Examples of important number theoretic problems
whose answer depends on the Riemann hypothesis are
(1) the existence of an algorithm to nd a nonresidue
mod p in polynomial time and (2) if n is composite, there
is at least one b for which neither b
t
=1(mod n) nor b
2
r
t
=
1(mod n). This latter example is important in algorithms
needed to nd large primes (10).
COMPUTATIONAL NUMBER THEORY 7
DIOPHANTINE EQUATIONS
The termDiophantine equation is used to apply to a family
of algebraic equations in a number system such as Z or Q.
To date, we have certainly observed many examples of
Diophantine equations. A good deal of research in this
subject has been directed at polynomial equations with
integer or rational coefcients, the most famous of which
being the class of equations x
n
y
n
= z
n
, the subject of
Fermats Last Theorem.
One result in this study, from Legendre, is as follows.
Theorem 35. Let a; b; c; Z such that (i) a > 0, b,
c < 0; (ii) a, b, and c are square-free; and (iii)
GCD(a,b) = GCD(b,c) = GCD(a,c) = 1. Then
ax
2
by
2
cz
2
= 0
has a nontrivial integer solution = ab QR(c); bc
QR(a) and ca QR(b).
Example. Consider the equation 3x
2
5y
2
7z
2
= 0.
With a = 3, b = 5, and c = 7, apply Theorem 35. Note
that ab =1 (mod 7), ac =1 (mod 5), and bc =1 (mod 3).
Thus, all three products are quadratic residues, and the
equation has an integer solution. Indeed, the reader may
verify that x = 3, y = 2, and z = 1 is one such solution.
Another result, which consequently proves Fermats
Last Theorem in the case n = 4, follows. (Incidentally, it
has also been long known that Fermats Last Theorem
holds for n = 3.)
Theorem 36. x
4
y
4
= z
2
has no nontrivial solutions in
the integers.
A nal class of Diophantine equations is known gener-
ically as Mordells equation: y
2
= x
3
k. In general, solu-
tions to Mordells equation in the integers are not known.
Two particular solutions are as follows.
Theorem 37. y
2
= x
3
m
2
jn
2
has no solution in the
integers if
(i) j = 4, m = 3 (mod 4) and pT 3 (mod 4) when p | n.
(ii) j = 1, m = 2 (mod 4), n is odd, and pT 3 (mod 4) when
p | n.
Theorem 38. y
2
= x
3
2a
3
3b
2
has no solution in the
integers if ab ,=0, aT1 (mod 3), 3 |b, a is odd if b is even, and
p = t
2
27u
2
is soluble in integers t and u if p | a and p =1
(mod 3).
ELLIPTIC CURVES
Many recent developments in number theory have come as
the byproduct of the extensive research performed in a
branch of mathematics known as elliptic curve theory (11).
An elliptic curve represents the set of points in some
appropriate number system that are the solutions to an
equation of the form y
2
= Ax
3
Bx
2
Cx D when A, B,
C, D Z.
A major result in this theory is the theorem of Mordell
and Weil. If K is any algebraic eld, and C(K) the points
with rational coordinates on an elliptic curve, then this
object C(K) forms a nitely generated Abelian group.
APPLICATIONS
In the modern history of cryptology, including the related
area of authentication or digital signatures, dating roughly
from the beginning of the computer era, there have been
several approaches that have had enormous importance for
business, government, engineering, and computing. In
order of their creation, they are (1) the Data Encryption
Standard or DES (1976); (2) the RivestShamirAdelman
Public Key Cryptosystem, or RSA (1978); (3) the Elliptic
Curve Public Key Cryptosystem or ECC (1993); and (4)
Rijndael or the Advanced Encryption Standard, or AES
(2001). RSA, DSS (dened below), and ECC rely heavily on
techniques described in this article.
Data Encryption Standard (DES)
DES was developed and published as a U.S. national stan-
dard for encryption in 1976 (12). It was designed to trans-
form 64-bit messages to 64-bit ciphers using 56-bit keys. Its
structure is important to understand as the model for many
other systems such as AES. In particular, the essence of
DES is a family of nonlinear transformations known as the
S-boxes. The S-box transformation design criteria were
never published, however, andso therewas oftenreluctance
in the acceptance of the DES. By the mid-1990s, however,
effective techniques to cryptanalyze the DES had been
developed, and so research began to nd better approaches.
Historically, cryptology required that both the sending
and the receiving parties possessed exactly the same infor-
mation about the cryptosystem. Consequently, that infor-
mation that they both must possess must be communicated
in some way. Encryption methods with this requirement,
such as DES and AES, are also referred to as private key
or symmetric key cryptosystems.
The Key Management Problem
Envision the development of a computer network consist-
ing of 1000 subscribers where eachpair of users requires a
separate key for private communication. (It might be
instructive to think of the complete graph on n vertices,
representing the users; with the n(n 1)/2 edges
corresponding to the need for key exchanges. Thus, in
the 1000-user network, approximately 500,000 keys must
be exchanged in some way, other than by the network!)
In considering this problem, Dife and Hellman asked
the following question: Is it possible to consider that a key
might be broken into two parts, k = (k
p
, k
s
), such that only
k
p
is necessary for encryption, while the entire key k = (k
p
,
k
s
) would be necessary for decryption (13)?
If it were possible to devise sucha cryptosystem, thenthe
following benets would accrue. First of all, as the informa-
tion necessary for encryption does not, a priori, provide an
attackerwithenoughinformationtodecrypt, thenthereisno
longer any reason to keep it secret. Consequently k
p
can be
8 COMPUTATIONAL NUMBER THEORY
made public to all users of the network. A cryptosystem
devisedinthiswayiscalledapublickeycryptosystem(PKC).
Furthermore, the key distribution problem becomes
much more manageable. Consider the hypothetical net-
work of 1000 users, as before. The public keys can be listed
inacentralizeddirectory available to everyone, because the
rest of the key is not usedfor encryption; the secret key does
not have to be distributed but remains with the creator of
the key; and nally, both parts, public and secret, must be
used for decryption.
Therefore, if we could devise a PKC, it would certainly
have most desirable features.
In 1978, Rivest, Shamir, and Adelman described a pub-
lic-key cryptosystembased on principles of number theory,
withthe security beingdependent onthe inherent difculty
of factoring large integers.
Factoring
Factoring large integers is in general a very difcult problem
(14). The best-knownasymptotic running-time solutiontoday
is O(e
((64n=9))
1=3
(log n)
2=3
) for an n-bit number. More concretely,
in early 2005, a certain 200-digit number was factored into
two 100-digit primes using the equivalent of 75 years of
computing time on a 2.2-GHz AMD Opteron processor.
RivestShamirAdelman Public Key Cryptosystem (RSA)
The basic idea of Rivest, Shamir, and Adelman was to take
two large prime numbers, p and q (for example, p and q each
being~10
200
, andtomultiplythemtogethertoobtainn = pq.
nis published. Furthermore, two other numbers, dande, are
generated, where dis chosenrandomly, but relatively prime
to the Euler function f(n) in the interval [max(p,q) 1,
n 1]. As we have observed, f(n) = (p 1) (q 1) (15).
Key Generation
1. Choose two 200-digit prime numbers randomly from
the set of all 200-digit prime numbers. Call these p
and q.
2. Compute the product n = pq. n will have approxi-
mately 400 digits.
3. Choose d randomly in the interval [max(p,q) 1, n
1], such that GCD(d, f(n)) = 1.
4. Compute e = d
1
(modulo f(n)).
5. Publish n and e. Keep p, q, and d secret.
Encryption
1. Divide the message into blocks such that the bit-
string of the message can be viewed as a 400-digit
number. Call each block, m.
2. Compute and send c = m
e
(modulo n).
Decryption
1. Compute c
d
=(m
e
)
d
=m
ed
=m
kf(n)1
=m
kf(n)
m=m
(modulo n)
Note that the result m
kf(n)
=1 used in the preceding line
is the Little Fermat Theorem.
Although the proof of the correctness and the security
of RSA are established, there are several questions about
the computational efciency of RSA that should be raised.
Is it Possible to Find Prime Numbers of 200 Decimal Digits
in a Reasonable Period of Time? The Prime Number Theo-
rem 34 assures us that, after a few hundred random selec-
tions, we will probably nd a prime of 200 digits.
We can never actually be certain that we have a prime
without knowing the answer to the Riemann hypothesis;
instead we create a test (the SolovayStrassen test), which
if the prime candidate passes, we assume the probability
that p is not a prime is very low (16). We choose a number
(say 100) numbers a
i
at random, which must be relatively
prime to p. For each ai, if the Legendre symbol
(
a
p
) = a
( p1)=2
i
(mod p), thenthe chance that p is not a prime
and passes the test is 1/2 in each case; if p passes all tests,
the chances are 1/2
100
that it is not prime.
Is it Possible to Find an e Which is Relatively Prime to
f(n)? Computing the GCD(e,f(n)) is relatively fast, as is
computingtheinverseof e modn. Hereisanexampleof a(3
2) arraycomputationthat determines bothGCDandinverse
[in the example, GCD(1024,243) and 243
1
(mod 1024)].
First, create a (3 2) array where the rst roware the given
numbers; the second and third rows are the (2 2) identity
matrix. In each successive column, subtract the largest
multiplemof columnklessthantherst rowentryof column
(k 1) to form the new column (for example, 1024 4
243 = 52). When 0 is reached in the row A[1,
+
], say A[1,n],
boththeinverse(if itexists) andtheGCDarefound. TheGCD
is the element A[1,n 1]mand if this value is 1, the inverse
exists andit is the valueA[3,n1] [24359=1(mod1024)].
Column 1 2 3 4 5 6 7
m 4 4 1 2 17
A[1,
+
] 1024 243 52 35 17 1 0
A[2,
+
] 1 0 1 4 5 14
A[3,
+
] 0 1 4 17 21 59
Is it Possible to Perform the Computation e d Where e
and d are Themselves 200-digit Numbers? Computing m
e
(mod n) consists of repeated multiplications and integer
divisions. Ina software package suchas Mathematica, such
a computation with 200-digit integers can be performed in
0.031 seconds of time on a Pentium machine.
Oneshortcutincomputingalargeexponentistomakeuse
of thefast exponentiation algorithm: Expresstheexponent
as a binary, e = b
n
b
n1
. . .b
0
. Then compute m
e
as follows:
ans = m
for i = n 1 to 0do
ans = ans ans
if b
i
= 1 then ans = ans x
end;
The result is ans. Note that the total number of multi-
plications is proportional to the log of e.
Example: Compute x
123
.
COMPUTATIONAL NUMBER THEORY 9
123 = (1111011)
binary
. So n = 6. Each arrow below
represents one step in the loop. All but the fourth pass
through the loop require squaring and multiplying by x,
since the bits processed are 111011.
ans = x x
2
x = x
3
x
6
x = x
7
x
14
x
= x
15
x
30
x
60
x = x
61
x
122
x = x
123
In a practical version of the RSA algorithm, it is recom-
mended by Rivest, Shamir, and Adelman that the primes p
and q be chosen to be approximately of the same size, and
each containing about 100 digits.
The other calculations necessary in the development of
an RSA cryptosystem have been shown to be relatively
rapid. Except for nding the primes, the key generation
consists of two multiplications, two additions, one selection
of a random number, and the computation of one inverse
modulo another number.
The encryption and decryption each require at most 2
log
2
n multiplications (inother words, one application of the
Fast Exponentiation algorithm) for each message block.
DIGITAL SIGNATURES
It seems likely that, in the future, an application similar to
public key cryptology will be even more widely used. With
vastly expanded electronic communications, the require-
ments for providing a secure way of authenticating an
electronic messagea digital signaturewill be required
far more often than the requirement for transmitting
information in a secure fashion.
As with public key cryptology, the principles of number
theory have been essential in establishing methods of
authentication.
The authentication problem is the following: Given a
message m, is it possible for a user uto create a signature,
s
u
; dependent on some information possessed only by u, so
that the recipient of the message (m,s
u
) could use some
public information for u (a public key), to be able to deter-
mine whether the message was authentic.
Rivest, Shamir, and Adelman showed that their public
keyencryptionmethodcouldalsobeusedforauthentication.
For example, for A to send a message to B that B can
authenticate, assume an RSA public key cryptosystem has
been established with encryption and decryption keys e
A
,
d
A
, e
B
, d
B
, and a message m to authenticate. A computes
and sends c =m
e
B
d
A
(mod n). B both decrypts and authenti-
cates by computing (c
e
A
d
B
) =(m
e
B
d
A
)
eAd
B
=m
e
B
d
A
e
A
d
B
=
m
e
B
d
B
=m(mod n).
However, several other authors, particularly El Gamal
(17) and Ong et al. (18), developed more efcient solutions
to the signature problem. More recently, in 1994, the
National Institute of Standards and Technology, an agency
of the United States government, established such a
method as a national standard, now called the DSS or
Digital Signature Standard (19).
The DSSspecies analgorithmto be usedincases where
a digital authentication of information is required.
We assume that a message m is received by a user. The
objective is to verify that the message has not been altered,
and that we can be assured of the originators identity. The
DSS creates for each user a public and a private key. The
private key is used to generate signatures, and the public
key is used in verifying signatures.
The DSS has two parts. The rst part begins with a
mechanism for hashing or collapsing any message down to
one of a xed length, say 160 bits. This hashing process is
determined by a secure hash algorithm (SHA), which is
very similar to a (repeated) DES encryption. The hashed
message is combined with public and private keys derived
from two public prime numbers. A computation is made by
the signatory using both public and private keys, and the
result of this computation, also 160 bits, is appended to the
original message. The receiver of the message also performs
a separate computation using only the public key knowl-
edge, and if the receivers computation correctly computes
the appended signature, the message is accepted. Again, as
with the RSA, the critical number theoretic result for the
correctness of the method is the Little Fermat Theorem.
Elliptic Curve Public Key Cryptosystem (ECC)
One form of the general equation of an elliptic curve, the
familyof equations y
2
= x
3
ax b(a, bZ) has provento
be of considerable interest intheir applicationto cryptology
(20). Given two solutions p = (x
1
, y
1
) and Q = (x
2
, y
2
), a
third point R in the Euclidean plane can be found by the
following construction: Find a third point of intersection
between the line PQ and the elliptic curve dened by all
solutions to y
2
= x
3
ax b; then the point Ris the reec-
tion of this point of intersection in the x-axis. This point R
may also be calledP Q, because the points so foundfollow
the rules for an Abelian group by the Theorem of Mordell
and Weil, with the identity 0 = P + (P) being the point at
innity.
As an analogy to RSA, a problem comparable with the
difculty of factoring is the difculty of nding k when
kP = P P . . . P (k times) is given. Specically, for a
cryptographic elliptic curve application, one selects anellip-
tic curve over a Galois eld GF(p) or GF(2
m
), rather thanthe
integers. Then, as a public key, one chooses a secret k and a
public p and computes kP = Q, withQalso becoming public.
One application, for example, is a variation of the widely
used DifeHellman secret sharing scheme (13). This
scheme enables two users to agree ona commonkeywithout
revealing any information about the key. If the two parties
are Aand B, each chooses at randoma secret integer x
A
and
x
B
. Then, Acomputes y
A
= x
A
Pinapublic elliptic curve with
x
P = (2,0)
R=P+Q=(2.16,1.19 )
P
Q
y
Q =(1/4,(15)/4)
y
2
= x
2
4x
Figure 1. An example of addition of points on the elliptic curve
y
2
= x
3
4x.
10 COMPUTATIONAL NUMBER THEORY
base point P. Similarly, B computes y
B
= x
B
P. A sends y
A
to
Bandvice versa. Then, Acomputes x
A
y
B
P, and the resulting
point Q = (x
A
y
B
)P = (x
B
y
A
)P is the shared secret.
For computational reasons, it is advisedto choose elliptic
curves over specic elds. For example, a standard called
P192 uses GF(p), where p = 2
192
2
64
1 for efcient
computation. Thus, x
A
, x
B
, y
A
, y
B
must all be less than p.
Advanced Encryption Standard or Rijndael (AES)
After it was generally accepted that a new U.S. standard
was needed for symmetric key encryption, an international
solicitationwas made, withanextensive reviewprocess, for
a new national standard. In late 2001, a system known as
Rijndael, createdbyVincent RijmenandJoanDaemen, was
selected as the U.S. Advanced Encryption Standard, or
AES (21).
Although this method has some similarities to the DES,
it has been widely accepted, particularly since the
nonlinear S-boxes have, in contrast to the DES, a very
understandable, although complex structure. The S-box
in AES is a transformation derived from an operator in
one of the so-called Galois elds GF(2
8
). It is a well-known
result of algebra that all nite elds have several elements
that is a prime power, that all nite elds with the same
number of elements are isomorphic, and that a nite eld
with p
n
elements can be represented as the quotient eld of
the ring of polynomials whose coefcients are mod p num-
bers, modulo an irreducible polynomial whose highest
power is n.
BIBLIOGRAPHY
1. A. Wiles, Modular elliptic-curves and Fermats Last Theorem.
Ann. Math.141, 443551, 1995.
2. Y. Wang. (ed.), Goldbach conjecture, Series in Pure
Mathematics volume 4, Singapore: World Scientic, 1984,
pp. xi 311.
3. H. de Coste, La vie du R P Marin Mersenne, theologien,
philosophe et mathematicien, de lOrdre der Peres Minim
(Paris, 1649).
4. O. Ore, Number Theory and Its History, New York: Dover
Publications, 1976.
5. M. Abdelguer, A. Dunham, W. Patterson, MRA: A Computa-
tional Technique for Security in High-Performance Systems,
Proc. of IFIP/Sec 93, International Federation of Information
Processing Societies, World Symposiumon Computer Security,
Toronto, Canada, 1993, pp. 381397.
6. E. W. Weisstein. Quadratic Reciprocity Theorem. From Math-
WorldA Wolfram Web Resource. http://mathworld.wolfram.-
com/QuadraticReciprocityTheorem.html.
7. Wikipedia, Proof of Bertrands postulate, http://www.
absoluteastronomy.com/encyclopedia/p/pr/proof_of_bertrands_
postulate.htm.
8. A. Selberg An elementary proof of the prime number
theorem, Ann. Math., 50: 305313, 1949. MR 10,595b
[[HW79, sections 22.1516] gives a slightly simpler, but less
elementary version of Selbergs proof.]
9. J. Friedlander andH. Iwaniec, Using a parity-sensitive sieve to
count prime values of a polynomial, Proc. Nat. Acad. Sci., 94:
10541058, 1997.
10. S. Smale, Mathematical problems for the next century. Math.
Intelligencer 20(2): 715, 1998.
11. J. H. Silverman, The Arithmetic of Elliptic Curves, New York:
Springer, 1994.
12. National Bureau of Standards, Data Encryption Standard,
Federal Information Processing Standards Publication 46,
January 1977.
13. W. Dife and M. E. Hellman, New Directions in Cryptography,
IEEE Trans. on Information Theory, IT-22, 1976, pp. 644654.
14. C. Pomerance, Analysis and comparison of some integer factor-
ing algorithms, in H. W. Lenstra, Jr., and R. Tijdeman, (eds.),
Computational Number Theory: Part 1, Math. Centre,
Tract 154, Math. Centre, The Netherlands: Amsterdam,
1982, pp. 89139.
15. R. L. Rivest, A. Shamir, and L. Adelman, AMethod for Obtain-
ing Digital Signatures and Public-Key Cryptosystems, Comm.
of the ACM, 1978, pp. 120126.
16. R. Solovay and V. Strassen, A fast monte-carlo tests for prim-
ality, SIAM Journal on Computing, 6: 8485, 1977.
17. T. El Gamal, Apublic keycryptosystemandasignature scheme
based on discrete logarithms, Proc. of Crypto 84, New York:
Springer, 1985, pp. 1018.
18. H. Ong, C. P. Schnorr, and A. Shamir, An efcient signature
scheme based on polynomial equations, Proc. of Crypto 84,
New York: Springer, 1985, pp. 3746.
19. National Institute of Standards and Technology, Digital Sig-
nature Standard, Federal Information Processing Standards
Publication 186, May 1994.
20. N. Koblitz, Elliptic curve cryptosystems, Mathematics of Com-
putation 48: 203209, 1987.
21. J. Daemen and V. Rijmen, The Design of Rijndael: AESThe
Advanced Encryption Standard, Berun, Germany: Springer-
Verlag, 2002.
FURTHER READING
D. M. Burton, Elementary Number Theory, Boston: Allyn and
Bacon, 1980.
H. Cohen, A Second Course in Number Theory, New York: Wiley,
1962.
L. E. Dickson, AHistoryof the Theoryof Numbers, 3vols., Washing-
ton, D.C.: Carnegie Institution, 191923.
G. H. Hardy and E. M. Wright, An Introduction to the Theory of
Numbers, 3rd ed., Oxford: Clarendon Press, 1954.
K. Ireland and M. Rosen, A Classical Introduction to Modern
Number Theory, 2nd ed., New York: Springer-Verlag, 1990.
I. Niven and H. S. Zuckerman, An Introduction to the Theory of
Numbers, 4th ed., New York: Wiley, 1980.
W. Patterson, Mathematical Cryptology, Totowa, NJ: Rowmanand
Littleeld, 1987.
H. E. Rose, ACourse inNumber Theory, Oxford: Oxford University
Press, 1988.
H. N. Shapiro, Introduction to the Theory of Numbers, New York:
Wiley-Interscience, 1983.
H. Stark, An Introduction to Number Theory, Chicago, IL:
Markham, 1970.
WAYNE PATTERSON
Howard University
Washington, D.C.
COMPUTATIONAL NUMBER THEORY 11
C
CONVEX OPTIMIZATION
MOTIVATION AND A BRIEF HISTORY
An optimization problem can be stated in the so-called
standard form as follows:
minimize f x : R
n
!R
subject to gx 0; g : R
m
!R
n
(1)
representing the minimization of a function f of n variables
under constraints specied by inequalities determined by
functions g g
1
; g
2
; . . . ; g
m
T
. The functions f and g
i
are, in
general, nonlinear functions. Note that inequalities can
be handled under this paradigmby multiplying each side by
1 and equalities by representing themas a pair of inequal-
ities. The maximization of an objective function function f(x)
can be achieved by minimizing f(x). The set F
fxjgx 0g that satises the constraints on the nonlinear
optimization problem is known as the feasible set or the
feasible region. If F covers all of (a part of) R
n
, then the
optimization is said to be unconstrained (constrained).
The above standard form may not directly be applicable to
design problems where multiple conicting objectives must
be optimized. In such a case, multicriterion optimization
techniques andPareto optimalitycanbe usedto identifynon-
inferior solutions (1). In practice, however, techniques to
map the problemto the formin Equation (1) are often used.
When the objective function is a convex function and the
constraint set is a convex set (both terms will be formally
dened later), the optimization problem is known as a
convex programming problem. This problem has the
remarkable property of unimodality; i.e., any local mini-
mumof the problemis also a global minimum. Therefore, it
does not require special methods to extract the solution out
of local minima in a quest to nd the global minimum.
Although the convex programming problem and its unim-
odality property have been known for a long time, it is only
recently that efcient algorithms for solving of these pro-
blems have been proposed. The genesis of these algorithms
can be traced back to the work of Karmarkar (2) that
proposed a polynomial-time interior-point technique for
linear programming, aspecial case of convex programming.
Unlike the simplex method for linear programming, this
technique was found to be naturally extensible from the
problemof linear programming to general convex program-
ming formulations. It was shown that this method belongs
to the class of interior penalty methods proposed by Fiacco
and McCormick (3) using barrier functions. The work of
Renegar (4) showed that a special version of the method of
centers for linear programming is polynomial. Gonzaga (5)
showed similar results with the barrier function associated
with a linear programming problem, with a proof of
polynomial-time complexity. Nesterov and Nemirovsky
(6) introduced the concept of self-concordance, studying
barrier methods in their context. Further improvements
in the computational complexity using approximate solu-
tions and rank-one updates were shown in the work of
Vaidya (7). The work of Ye (8) used a potential function
to obtain the same complexity as Renegars work without
following the central path too closely.
DEFINITIONS OF CONVEXITY
Convex Sets
Denition. A set C2R
n
is said to be a convex set if, for
every x
1
, x
2
2C and every real number a; 0 a 1, the
point a x
1
1 ax
2
2C.
This denition can be interpreted geometrically as stat-
ing that a set is convex if, given two points in the set, every
point on the line segment joining the two points is also a
member of the set. Examples of convex and nonconvex sets
are shown in Fig. 1.
Elementary Convex Sets
Ellipsoids. An ellipsoid Ex; B; r 2R
n
centered at point
x 2R
n
is given by the equation
fyjy x
T
By x r
2
g
If B is a scalar multiple of the unit matrix, then the
ellipsoid is called a hypersphere. The axes of the ellipsoid
are given by the eigenvectors, and their lengths are related
to the corresponding eigenvalues of B.
Hyperplanes. A hyperplane in n dimensions is given by
the region
c
T
x b; c 2R
n
; b2R:
Half-Spaces. A half-space in n dimensions is dened by
the region that satises the inequality
c
T
x b; c 2R
n
; b2R:
Polyhedra. A (convex) polyhedron is dened as an inter-
section of half-spaces and is given by the equation
P fxjAx bg; A2R
mn
; b2R
m
;
corresponding to a set of m inequalities a
T
i
x b
i
; a
i
2R
n
. If
the polyhedron has a nite volume, it is referred to as a
polytope. An example of a polytope is shown in Fig. 2.
Some Elementary Properties of Convex Sets
Property. The intersection of convex sets is a convex set.
The union of convex sets is not necessarily a convex set.
Property. (Separating hyperplane theorem) Given two
nonintersecting convex sets A and B, a separating hyper-
plane c
T
x b exists such that A lies entirely within the
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
half-space c
T
x b and B lies entirely within the half-space
c
T
x b. This is pictorially illustrated in Fig. 3.
Property. (Supporting hyperplane theorem) Given a
convex set C and any point x on its boundary, a supporting
hyerplane c
T
x b exists such that the convex set C lies
entirely within the half-space c
T
x b. This is illustrated in
Fig. 4.
Denition. The convex hull of m points, x
1
; x
2
. . . ;
x
m
2R
n
, denoted co(x
1
, x
2
, . . ., x
m
), is dened as the set of
points y 2R
n
such that
y
X
i1 to m
a
i
x
i
a
i
0 8i
X
i1 to m
a
i
1:
The convex hull is thus the smallest convex set that con-
tains the m points. An example of the convex hull of ve
points in the plane is shown by the shaded region in Fig. 5.
If the set of points x
i
is of nite cardinality (i.e., mis nite),
then the convex hull is a polytope. Hence, a polytope is also
often described as the convex hull of its vertices.
Convex Functions
Denition. A function f dened on a convex set Vis said
to be a convex function if, for every x
1
, x
2
2V, and every
a; 0 a 1,
fa x
1
1 ax
2
afx
1
1 a fx
2
T
x x
0
8 x; x
0
2S;
where r f corresponds to the gradient of f with respect to
the vector x. Strict convexity corresponds to the case in
which the inequality is strict.
Property. Afunctionf(x) is convexover the convexset Sif
and only if
r
2
f x
0
0 8x
0
2S
i.e., its Hessian matrix is positive semidenite over S. For
strict convexity, r
2
f x
0
must be positive denite.
Property. If f(x) and g(x) are two convex functions on
the convex set S, then the functions f g and max(f, g) are
convex over S.
Denition. The level set of a function f(x) is the set
dened by f x c where c is a constant.
An example of the level sets of f x; y x
2
y
2
is shown
in Fig. 7. Observe that the level set for f x; y c
1
is
contained in the level set of f x; y c
2
for c
1
<c
2
.
y
x
Figure 2. An example convex polytope in two dimensions as an
intersection of five half-spaces.
A
B
Figure 3. A separating hyperplane (line) in two dimensions
between convex sets A and B.
S
Figure 4. A supporting hyperplane (line) in two dimensions at
the bounary point of a convex set S.
X
1
X
1
X
2
X
2
Convex set Nonconvex set
Figure 1. Convex and nonconvex sets.
2 CONVEX OPTIMIZATION
Property. If f is a convex function in the space S, then
the level set of f is a convex set in S. This is a very useful
property, and many convex optimization algorithms
depend on the fact that the constraints are dened by an
intersection of level sets of convex functions.
Denition. Afunction g dened on a convex set Vis said
to be a concave function if the function f g is convex.
The function g is strictly concave if g is strictly convex.
For a fuller treatment on convexity properties, the
reader is referred to Ref. (9).
CONVEX OPTIMIZATION
Convex Programming
Denition. A convex programming problem is an
optimization problem that can be stated as follows:
minimize f x
such that x 2s
(2)
where f is a convex function and S is a convex set. This
problemhas the propertythat anylocal minimumof f over S
is a global minimum.
Comment. The problemof maximizing aconvex function
over a convex set does not have the above property. How-
ever, it can be shown (10) that the maximumvalue for such
a problem lies on the boundary of the convex set.
For a convex programming problem of the type in
Equation (2), we may state without loss of generality that
the objective function is linear. To see this, note that
Equation (2) may equivalently be written as fmin t :
f x t; gx 0g.
Linear programming is a special case of nonlinear opti-
mization and, more specically, a special type of convex
programming problemwhere the objective and constraints
are all linear functions. The problem is stated as
mnimize c
T
x
subject to Ax b; x 0
where A2R
mn
; b2R
m
; c 2R
n
; x 2R
n
:
(3)
The feasible region for a linear program corresponds to a
polyhedron in R
n
. It can be shown that an optimal solution
to a linear program must necessarily occur at one of the
vertices of the constraining polyhedron. The most com-
monly used technique for solution of linear programs,
the simplex method (11), is based on this principle and
operates by a systematic search of the vertices of the poly-
hedron. The computational complexity of this method can
showanexponential behavior for pathological cases, but for
most practical problems, it has been observed to grow
linearly with the number of variables and sublinearly
with the number of constraints. Algorithms with polyno-
mial time worst-case complexity do exist; these include
Karmarkars method (2), and the ShorKhachiyan ellipsoi-
dal method (12). The computational times of the latter,
however, are often seen to be impractical from a practical
standpoint.
Inthe remainder of this section, we will describe various
methods used for convex programming and for mapping
problems to convex programs.
Path-Following Methods
This class of methods proceeds iteratively by solving a
sequence of unconstrained problems that lead to the solu-
tion of the original problem. In each iteration, a technique
based on barrier methods is used to nd the optimal solu-
tion. If we denote the optimal solutioninthe kthiterationas
x
k
, then the path x
1
, x
2
, . . . in R
n
leads to the optimal
x
1
x
2
x
3
x
4
x
5
Figure 5. An example showing the convex hull of five points.
x
1
x
2
x
2
x
1
Convex function Nonconvex function
g(x)
x
f(x)
Figure 6. Convex and nonconvex functions.
x
y
c=1 c=2 c=3
Figure 7. Level sets of f x; y x
2
y
2
.
CONVEX OPTIMIZATION 3
solution, and hence, this class of techniques are known as
path-following methods.
Barrier Methods. The barrier technique of Fiacco and
McCormick (3) is a general technique to solve any con-
strained nonlinear optimization problem by solving a
sequence of unconstrained nonlinear optimization pro-
blems. This method may be used for the specic case of
minimizing a convex function over a convex set S described
by an intersection of convex inequalities
S fxjg
i
x 0; i 1; 2; . . . ; mg
where each g
i
(x) is a convex function. The computation
required of the method is dependent on the choice of the
barrier function. In this connection, the logarithmic bar-
rier function (abbreviated as the log barrier function) for
this set of inequalities is dened as
Fx
X
n
i1
logg
i
x for x 2S
0 otherwise
8
<
:
Intuitively, the idea of the barrier is that any iterative
gradient-based method that tries to minimize the barrier
function will be forced to remain in the feasible region, due
to the singularity at the boundary of the feasible region. It
can be shown that Fx is a convex function over S and its
value approaches 1. as x approaches the boundary of S.
Intuitively, it can be seen that Fx becomes smallest when
x is, in some sense, farthest away from all boundaries of S.
The value of x at which the function Fx is minimized is
called the analytic center of S.
Example. For a linear programming problem of the type
described in Equation (3), with constraint inequalities
described by a
T
i
x b
i
, the barrier function in the feasible
region is given by
Fx
X
i1to n
logb
i
a
T
i
x
The value of (b
i
a
T
i
x) represents the slack in the ith
inequality, i.e., the distance between the point x and the
corresponding constraint hyperplane. The log barrier func-
tion, therefore, is a measure of the product of the distances
froma point x to eachhyperplane, as showninFig. 8(a). The
value of Fx is minimized when
Q
i1to n
b
i
a
T
i
x is max-
imized. Coarsely speaking, this occurs whenthe distance to
each constraint hyperplane is sufciently large.
As a cautionary note, we add that although the analytic
center is a good estimate of the center in the case where all
constraints present an equal contribution to the boundary,
the presence of redundant constraints canshift the analytic
center. The effect on the analytic center of repeating the
constraint p ve times is shown in Fig. 8(b).
We will now consider the convex programming problem
specied as
minimize fx
such that g
i
x 0; i 1; 2; . . . ; m
where each g
i
(x) is a convex function. The traditional
barrier method (3) used to solve this problem would
formulate the corresponding unconstrained optimization
problem
minimize Ba a fx Fx
and solve this problem for a sequence of increasing (con-
stant) values of the parameter a. When a is zero, the
problem reduces to nding the center of the convex set
constraining the problem. As a increases, the twin objec-
tives of minimizing f(x) and remaining in the interior of
the convex set are balanced. As a approaches 1, the
problem amounts to the minimization of the original objec-
tive function, f.
In solving this sequence of problems, the outer iteration
consists of increasing the values of the parameter a. The
inner iteration is used to minimize the value of B(a) at that
value of a, using the result of the previous outer iterationas
an initial guess. The inner iteration can be solved using
Newtons method (10). For positive values of a, it is easy to
see that B(a) is a convex function. The notion of a central
path for a linearly constrained optimization problem is
shown in Fig. 9.
Method of Centers. Given a scalar value t >f(x*), the
method of centers nds the analytic center of the set
of inequalities f x t and g
i
x 0 by minimizing the
Figure 8. (a) Physical meaning of
the barrier function for the feasible
region of a linear program. (b) The
effect of redundant constraints on the
analytic center.
Analytic center Analytic center
p
p
(a) (b)
4 CONVEX OPTIMIZATION
function
logt f x Fx
where Fx is the log barrier function dened earlier. The
optimal value x* associated with solving the optimization
problem associated with nding the analytic center for
this barrier function is found and the value of t is updated
to be a convex combination of t and f(x*) as
t ut 1 u f x*; u >0
The Self-Concordance Property and Step Length. The
value of u above is an adjustable parameter that would
affect the number of Newton iterations required to nd the
optimum value of the analytic center. Depending on the
value of u chosen, the technique is classied as a short-step,
medium-step, or long-step (possibly with even u >1)
method. For a short-step method, one Newton iteration
is enough, whereas for longer steps, further Newton itera-
tions may be necessary.
Nesterov andNemirovsky (17) introduced the idea of the
self-concordance condition, dened as follows.
Denition. A convex function w : S!R is self-
concordant with parameter a0 (a-selfconcordant) on
S; if w is three times continuously differentiable on S and
for all x 2S and h 2R
m
, the following inequality holds:
jD
3
wxh; h; hj 2 a
1=2
D
2
wxh; h
3=2
where D
k
wxh
1
; h
2
; . . . h
k
denotes the value of the kth
differential of w taken at x along the collection of directions
h
1
; h
2
; . . . h
k
.
By formulating logarithmic barrier functions that are
self-concordant, proofs of the polynomial-time complexity
of various interior point methods have been shown. An
analysis of the computational complexity in terms of the
number of outer and inner (Newton) iterations is presented
in Refs. (6) and (13).
Other Interior Point Methods
Afne Scaling Methods. For a linear programming pro-
blem, the nonnegativity constraints x 0 are replaced by
constraints of the type kX
1
x x
c
k <b 1, representing
an ellipsoid centered at x
c
. The linear program is then
relaxed to the following form whose feasible region is con-
tained in that of the original linear program:
minfc
T
x : Ax b; kX
1
x x
c
k <b 1g
Note that the linear inequalities in Equation (3) have been
converted to equalities by the addition of slack variables.
This form has the following closed-form solution:
xb x b
XP
AX
Xc
kP
AX
Xck
where P
AX
1 XA
T
AX
2
A
T
1
AX. The updated value
of x is used in the next iteration, and so on. The search
direction XP
AX
X
c
is called the primal afne scaling direc-
tion and corresponds to the scaled projected gradient with
respect to the objective function, with scaling matrix X.
Depending on the value of b, the method may be a short-
step or a long-step (with b>1) method, and convergence
proofs under various conditions are derived. For details of
the references, the reader is referred to Ref. (13).
For a general convex programming problem of the type
(note that the linear objective function form is used here),
fmin f y b
T
y : g
i
y 0g
The constraint set here is similarly replaced by the ellip-
soidal constraint (y y
c
T
Hy y
c
b
2
, where y
c
is the
center of the current ellipsoid, y is the variable over which
the minimization is being performed, and H is the Hessian
of the log-barrier function
P
i1 to n
logg
i
y. The pro-
blem now reduces to
fmin b
T
y : y y
c
T
Hy y
c
b
2
g
which has a closed-form solution of the form
yb y b
H
1
b
b
T
H
1
b
p
This is used as the center of the ellipsoid in the next
iteration. The procedure continues iteratively until conver-
gence.
Potential-Based Methods. These methods formulate a
potential function that provides a coarse estimate of how
far the solution at the current iterate is from the optimal
value. At each step of a potential reduction method, a
direction and step size are prescribed; however, the poten-
tial may be minimized further by the use of a line search
(large steps). This is in contrast with a path-following
=0
(analytic center)
=10
=100
optimum
Figure 9. Central path for a linearly constrained convex
optimization problem.
CONVEX OPTIMIZATION 5
method that must maintain proximity to a prescribed
central path. An example of a potential-based technique
is one that uses a weighted sum of the gap between the
value of the primal optimization problem and its dual, and
of the log-barrier functionvalue as the potential. For amore
detailed descriptionof this and other potential-based meth-
ods, the reader is referred to Refs. (6) and (13).
Localization Approaches
Polytope Reduction. This method begins with a polytope
P 2 R
n
that contains the optimal solution x
opt
. The polytope
P may, for example, be initially selected to be an n-dimen-
sional box described by the set
fxjX
min
xi X
max
g;
where x
min
and x
max
are the minimum and maximum
values of each variable, respectively. In each iteration,
the volume of the polytope is shrunk while keeping x
opt
within the polytope, until the polytope becomes sufciently
small. The algorithm proceeds iteratively as follows.
Step 1. A center x
c
deep in the interior of the polytope P
is found.
Step 2. The feasibility of the center x
c
is determined by
verifying whether all of the constraints of the optimization
problem are satised at x
c
. If the point x
c
is infeasible, it is
possible to nd a separating hyperplane passing through x
c
that divides P into two parts, such that the feasible region
lies entirely in the part satisfying the constraint
c
T
x b
where c rg
p
x
T
is the negative of the gradient of a
violatedconstraint, g
p
, andb c
T
x
c
. The separating hyper-
planeabovecorresponds to the tangent plane to the violated
constraint. If the point x
c
lies within the feasible region,
then a hyperplane c
T
x b exists that divides the polytope
into two parts such that x
opt
is contained in one of them,
with c r f x
T
being the negative of the gradient of
the objective function and b being dened as b c
T
x
c
once
again. This hyperplane is the supportinghyperplane for the
set f x f x
c
andthus eliminates fromthe polytope aset
of points whose value is larger thanthe value at the current
center. In either case, a new constraint c
T
x b is added to
the current polytope to give anewpolytope that has roughly
half the original volume.
Step 3. Go to Step 1 and repeat the process until the
polytope is sufciently small.
Note that the center in Step 1 is ideally the center of
gravity of the current polytope because a hyperplane pas-
sing through the center of gravity is guaranteed to reduce
the volume of the polytope by a factor of (1 1=e) in each
iteration. However, because nding the center of gravity is
prohibitively expensive in terms of computation, the ana-
lytic center is an acceptable approximation.
Example. The algorithmis illustrated by using it to solve
the following problem in two dimensions:
minimize f x
1
; x
2
such thatx
1
; x
2
2S
where S is a convex set and f is a convex function. The
shaded region in Fig. 10(a) is the set S, and the dotted lines
show the level curves of f. The point x
opt
is the solution to
this problem. The expected solution region is rst bounded
by a rectangle with center x
c
, as shown in Fig. 10(a). The
feasibility of x
c
is then determined; in this case, it can be
seen that x
c
is infeasible. Hence, the gradient of the con-
straint function is used to construct a hyperplane through
x
c
suchthat the polytope is divided into two parts of roughly
equal volume, one of which contains the solution x
c
. This is
xc
xopt
xc
xopt
xc
xopt
xopt
xc
(a)
(c) (d)
(b)
f decreasing
Figure 10. Cutting plane approach.
6 CONVEX OPTIMIZATION
illustrated in Fig. 10(b), where the region enclosed in
darkened lines corresponds to the updated polytope. The
process is repeated on the new smaller polytope. Its center
lies inside the feasible region, andhence, the gradient of the
objective function is used to generate a hyperplane that
further shrinks the size of the polytope, as shown in Fig.
10(c). The result of another iteration is illustrated in Fig.
10(d). The process continues until the polytope has been
shrunk sufciently.
Ellipsoidal Method. The ellipsoidal method begins with
a sufciently large ellipsoid that contains the solution to
the convex optimization problem. In each iteration, the size
of the ellipsoid is shrunk, while maintaining the invariant
that the solution lies within the ellipsoid, until the ellipsoid
becomes sufciently small. The process is illustrated in
Fig. 11.
The kthiteration consists of the following steps, starting
fromthe fact that the center x
k
of the ellipsoid E
k
is known.
Step 1. Incase the center is not inthe feasible region, the
gradient of the violated constraint is evaluated; if it is
feasible, the gradient of the objective function is found.
In either case, we will refer to the computed gradient as
rhx
k
.
Step 2. A new ellipsoid containing the half-ellipsoid
given by
E
k
\fxjrhx
k
T
x rhx
k
T
x
k
g
is computed. This new ellipsoid E
k1
and its center x
k1
are given by the following relations:
X
k1
X
k
1
n 1
A
k
~g
k
A
k1
n
2
n
2
1
A
k
2
n 1
A
k
~g
k
~g
k
T
A
k
where
~g
k
rhx
k
rhx
k
T
A
k
rhx
k
q
Step 3. Repeat the iterations in Steps 1 and 2 until the
ellipsoid is sufciently small.
RELATED TECHNIQUES
Quasiconvex Optimization
Denition. Afunction f : S!R, where Sis a convex set,
is quasiconvex if every level set L
a
fxj f x ag is a
convex set. A function g is quasiconcave if g is quasi-
convex over S. A function is quasilinear if it is both
quasiconvex and quasiconcave.
Examples. Clearly, any convex (concave) function is also
quasiconvex (quasiconcave).
Any monotone function f : R!R is quasilinear.
The linear fractional function f x a
T
x b=
c
T
x d, where a; c; x 2R
n
, is quasilinear over the half-
space fxjc
T
x d>0g.
Other Characterizations of Quasiconvexity. A function f
dened on a convex set V is quasiconvex if, for every x
1
,
x
2
2V,
1. For every a; 0 a 1; f ax
1
1 ax
2
max
f x
1
; f x
2
.
2. If f is differentiable, f x
1
f x
2
)x
2
x
1
T
r f x
1
0.
Property. If f, g are quasiconvex over V, then the
functions af for a>0, and max(f,g) are also quasiconvex
over V. The composed function g(f(x)) is quasiconvex pro-
vided g is monotone increasing. In general, the function
f g is not quasiconvex over V.
As in the case of convex optimization, the gradient of a
quasiconvex objective can be used to eliminate a half-space
from consideration. The work in Ref. (14) presents an
adaptation of the ellipsoidal method to solve quasiconvex
optimization problems.
Semidenite Programming
The problem of semidenite programming (15) is stated as
follows:
minimize c
T
x
subject to Fx 0
where Fx 2R
mm
; c; x 2R
n
(4)
Here, Fx F
0
F
1
x
1
. . . F
n
X
n
is an afne matrix
function of x and the constraint Fx 0 represents the fact
that this matrix functionmust be positive semidenite, i.e.,
z
T
Fxz 0 for all z 2R
n
. The constraint is referred to as a
linear matrix inequality. The objective function is linear,
and hence convex, and the feasible regionis convex because
if Fx 0 and Fy 0, then for all 0 l 1, it can be
readily seen that lFx 1 lFy 0.
A linear program is a simple case of a semidenite
program. To see this, we can rewrite the constraint set;
Ax b (note that the here is a component-wise
inequality and is not related to positive semideniteness),
f(x
k
)
x
k
k
k+1
Figure 11. The ellipsoidal method.
CONVEX OPTIMIZATION 7
as Fx diagAx b; i.e., F
0
diagb; F
j
diag
a
j
; j 1; . . . ; n, where A a
1
a
2
. . . a
n
2R
mn
.
Semidenite programs may also be used to represent
nonlinear optimization problems. As an example, consider
the problem
minimize
c
T
x
2
d
T
x
subject to Ax b0
where we assume that d
T
x >0 in the feasible region. Note
that the constraints here represent, as inthe case of alinear
program, component-wise inequalities. The problemis rst
rewritten as minimizing an auxiliary variable t subject to
the original set of constraints and a new constraint
c
T
x
2
d
T
x
t
The problem may be cast in the form of a semidenite
program as
minimize t
subject to
diagAx b 0 0
0 t c
T
x
0 c
T
x d
T
x
2
4
3
5
0
The rst constraint rowappears in a manner similar to the
linear programming case. The second and third rows use
Schur complements to represent the nonlinear convex
constraint above as the 2 2 matrix inequality
t c
T
x
c
T
x d
T
x
0
The two tricks shown here, namely, the reformulation of
linear inequations and the use of Schur complements, are
often used to formulate optimization problems as semide-
nite programs.
Geometric Programming
Denition. A posynomial is a function h of a positive
variable x 2R
m
that has the form
hx
X
j
g
j
Y
i1 to n
x
ai; j
i
where the exponents ai; j 2R and the coefcients
g
j
>0; g
j
2R.
For example, the function f x; y; z 7:4 x
2:6 y
3:18
z
4:2
3
p
y
2
y
1:4
z
5
p
is a posynomial in the vari-
ables x, y, and z. Roughly speaking, a posynomial is a
function that is similar to a polynomial, except that the
coefcients g
j
must be positive, and an exponent ai; j
could be any real number and not necessarily a positive
integer, unlike a polynomial.
A posynomial has the useful property that it can be
mapped onto a convex function through an elementary
variable transformation, xi e
zi
(16). Such functional
forms are useful because in the case of an optimization
problem where the objective function and the constraints
are posynomial, the problem can easily be mapped onto a
convex programming problem.
For some geometric programming problems, simple
techniques based on the use of the arithmetic-geometric
inequality may be used to obtain simple closed-form solu-
tions to the optimization problems (17). The arithmetic
geometric inequality states that if u
1
, u
2
; . . . ; u
n
>0, then
the arithmetic mean is no smaller than the geometric
mean; i.e.,
u
1
u
2
. . . u
n
=nu
1
u
2
. . . u
n
1=n
;
with equality occurring if and only if u
1
u
2
. . . u
n
.
A simple illustration of this technique is in minimizing
the outer surface area of an open box of xed volume (say,
four units) and sides of length x
1
, x
2
, x
3
. The problemcan be
stated as
minimize x
1
x
2
2 x
1
x
3
2 x
1
x
3
subject to x
1
x
2
x
3
4
By setting u
1
x
1
; x
2
; u
2
2x
1
x
3
; u
3
2x
1
x
3
; andapplying
the condition listed above, the minimum value of the objec-
tive function is 3u
1
u
2
u
3
1=3
34 x
2
1
x
2
2
x
2
3
12. It is
easily veried that this corresponds to the values
x
1
2; x
2
2, and x
3
1. We add a cautionary note that
some, but not all, posynomial programming problems may
be solved using this simple solution technique. For further
details, the reader is referred to Ref. (17).
An extension of posynomials is the idea of generalized
posynomials (18), which are dened recursively as follows:
A generalized posynomial of order 0, G
0
(x), is simply a
posynomial, dened earlier; i.e.,
G
0
x
X
j
g
j
Y
i1to n
x
ai; j
i
A generalized posynomial of order k >1 is dened as
G
k
x
X
j
l
j
Y
i1to n
G
k1;i
x
ai; j
where G
k1;i
x is a generalized posynomial of order less
than or equal to k 1, each exponent ai; j 2R, and the
coefcients g
j
>0; g
j
2R.
Like posynomials, generalized posynomials can be
transformed to convex functions under the transform
xi e
zi
, but with the additional restriction that
ai; j 0 for each i, j. For a generalized posynomial of
order 1, observe that the term in the innermost bracket,
G
0,i
(x), represents a posynomial function. Therefore, a
generalized posynomial of order 1 is similar to a posyno-
mial, except that the place of the x(i) variables is taken by
posynomial. In general, a generalized posynomial of order
k is similar to a posynomial, but x(i) in the posynomial
equation is replaced by a generalized posynomial of a lower
order.
8 CONVEX OPTIMIZATION
Optimization Involving Logarithmic Concave Functions
A function f is a logarithmic concave (log-concave) function
if log(f) is a concave function. Logconvex functions are
similarly dened. The maximization of a log-concave func-
tion over a convex set is therefore a unimodal problem; i.e.,
any local minimum is a global minimum. Log-concave
functional forms are seen among some common probability
distributions on R
n
, for example:
(a) The Gaussian or normal distribution
f x
1
2p
n
detS
p e
0:5xx
c
T
S
1
xx
c
where S0.
(b) The exponential distribution
f x
Y
i1 to n
li
!
e
l1x1l2x2...lnxn
The following properties are true of log-concave func-
tions:
1. If f and g are log-concave, then their product f.g is log-
concave.
2. If f(x) is log-concave, thenthe integral
R
S
f xdx is log-
concave provided S is a convex set.
3. If f(x) and g(x) are log-concave, then the convolution
R
S
f x ygydy is log-concave if S is a convex set
(this follows from Steps 1 amd 2).
Convex Optimization Problems in Computer Science and
Engineering
There has been an enormous amount of recent interest in
applying convex optimization to engineering problems,
particularly as the optimizers have grown more efcient.
The reader is referred to Boyd and Vandenberghes lecture
notes (19) for a treatment of this subject. In this section, we
present a sampling of computer science and computer
engineering problems that can be posed as convex pro-
grams to illustrate the power of the technique in practice.
Design Centering
While manufacturing any system, it is inevitable that
process variations will cause design parameters, such as
component values, to waver fromtheir nominal values. As a
result, after manufacture, the system may no longer meet
some behavioral specications, such as requirements on
the delay, gain, and bandwidth, which it has been designed
to satisfy. The procedure of design centering attempts to
select the nominal values of design parameters to ensure
that the behavior of the system remains within specica-
tions, with the greatest probability and thus maximize the
manufacturing yield.
The random variations in the values of the design para-
meters are modeled by a probability density function,
cx; x
c
: R
n
!0; 1, with a mean corresponding to the
nominal value of the design parameters. The yield of the
manufacturing process Y as a function of the mean x
c
is
given by
Yx
c
Z
F
cx; x
c
dx
where F corresponds to the feasible regionwhere all design
constraints are satised.
A common assumption made by geometrical design
centering algorithms for integrated circuit applications is
that F is a convex bounded body. Techniques for approx-
imating this body by a polytope P have been presented in
Ref. (20). When the probability density functions that
represent variations in the design parameters are Gaus-
sian in nature, the design centering problem can be posed
as a convex programming problem. The design centering
problem is formulated as (21)
maximize Yx
c
Z
P
cx; x
c
dx
where Pis the polytope approximationto the feasible region
R. As the integral of a logconcave function over a convex
region is also a log-concave function (22), the yield function
Y(x) is log-concave, and the above problem reduces to a
problemof maximizing alog-concave functionover aconvex
set. Hence, this can be transformed into a convex program-
ming problem.
VLSI Transistor and Wire Sizing
Convex Optimization Formulation. Circuit delays ininte-
grated circuits often have to be reduced to obtain faster
response times. Given the circuit topology, the delay of a
combinational circuit can be controlled by varying the sizes
of transistors, giving rise to an optimization problem of
nding the appropriate area-delay tradeoff. The formal
statement of the problem is as follows:
minimize Area
subject to Delay T
spec
(5)
The circuit area is estimated as the sumof transistor sizes;
i.e.,
Area
X
i1to n
x
i
where x
i
is the size of the ith transistor and nis the number
of transistors in the circuit. This is easily seen to be a
posynomial function of the x
i
s. The circuit delay is esti-
mated using the Elmore delay estimate (23), which calcu-
lates the delay as maximumof pathdelays. Eachpathdelay
is a sum of resistance-capacitance products. Each resis-
tance term is of the form a/x
i
, and each capacitance term is
of the type Sb
i
x
i
, with the constants a and b
i
being positive.
As a result, the delays are posynomial functions of the x
i
s,
and the feasible region for the optimization problem is an
CONVEX OPTIMIZATION 9
intersection of constraints of the form
posynomial function inx
i
s t
spec
As the objective and constraints are both posynomial
functions in the x
i
s, the problem is equivalent to a convex
programming problem. Various solutions to the problem
have been proposed, for instance, in Refs. (24) and (25).
Alternative techniques that use curve-tted delay models
have been used to set up a generalized posynomial formu-
lation for the problem in Ref. (21).
Semidenite Programming Formulation. In Equation (5),
the circuit delay may alternatively be determined from the
dominant eigenvalue of a matrix G
1
C, where Gand Care,
respectively, matrices representing the conductances (cor-
responding to the resistances) and the capacitances
referred to above. The entries in both G and C are afne
functions of the x
i
s. The dominant time constant can be
calculated as the negative inverse of the largest zero of the
polynomial dets C G. It is also possible to calculate it
using the following linear matrix inequality:
T
dom
minfTjTGC0g
Note that the 0 here refers to the fact that the matrix
must be positive denite. To ensure that T
dom
T
max
for a
specied value of T
max
, the linear matrix inequality T
max
Gx Cx 0 must be satised. This sets up the problem
in the form of a semidenite program as follows (26):
minimize
X
n
i1
l
i
x
i
subject to T
max
Gx Cx 0
x
min
x x
max
Largest Inscribed Ellipsoid in a Polytope
Consider a polytope in R
n
given by P fxja
T
i
x b
i
; i
1; 2; . . . ; Lg into which the largest ellipsoid E, described
as follows, is to be inscribed:
E fB y djkyk 1g; B B
T
>0
The center of this ellipsoid is d, and its volume is propor-
tional to det(B). The objective here is to nd the entries in
the matrix Bandthe vector d. To ensure that the ellipsoidis
contained within the polytope, it must be ensured that for
all y such that kyk 1,
a
T
i
By d b
i
Therefore, it must be true that sup
kyk1
a
T
i
B y a
T
i
d b
i
,
or in other words, kBa
i
k b
i
a
T
i
d. The optimization pro-
blem may now be set up as
maximize log det B
subject to B B
T
>0
kBa
i
k b
i
a
i
d; i 1; 2; . . . ; L
This is a convex optimizationproblem(6) inthe variables B
and d, with a total dimension of nn 1=2 variables cor-
responding to the entries in B and n variables correspond-
ing to those in d.
CONCLUSION
This overviewhas presented an outline of convex program-
ming. The use of specialized techniques that exploit the
convexity properties of the problemhave led to rapid recent
advances in efcient solution techniques for convex pro-
grams, which have been outlined here. The applications of
convex optimization to real problems to engineering design
have been illustrated.
BIBLIOGRAPHY
1. W. Stadler, Multicriteria Optimization in Engineering and in
the Sciences, New York: Plenum Press, 1988.
2. N. Karmarkar, A new polynomial-time algorithm for linear
programming, Combinatorica, 4: 373395, 1984.
3. A. V. Fiacco and G. P. McCormick, Nonlinear Programming,
New York: Wiley, 1968.
4. J. Renegar, A Polynomial Time Algorithm, based on Newtons
method, for linear programming, Math. Programming, 40:
5993, 1988.
5. C. C. Gonzaga, An algorithm for solving linear programming
problems in O(nL) operations, in N. Meggido, (ed.), Progress in
Mathematical Programming: Interior Point and Related Meth-
ods,, New York: Springer-Verlag, 1988, PP. 128,.
6. Y. Nesterov and A. Nemirovskii, Interior-Point Polynomial
Algorithms in Convex Programming, SIAMStudies in Applied
Mathematics series, Philadelphia, PA: Society for Industrial
and Applied Mathematics, 1994.
7. P. M. Vaidya, An algorithm for linear programming which
requires Omnn
2
mn
1:5
nL arithmetic operations,
Math. Programming, 47: 175201, 1990.
8. Y. Ye, An O(n
3
L) potential reduction algorithm for linear
programming, Math. Programming, 50: 239258, 1991.
9. R. T. Rockafellar, Convex Analysis, Princeton, NJ:Princeton
University Press, 1970.
10. D. G. Luenberger, Linear and Nonlinear Programming, Read-
ing, MA: Addison-Wesley, 1984.
11. P. E. Gill, W. Murray and M. H. Wright, Numerical Linear
Algebra and Optimization, Reading, MA: Addison-Wesley,
1991.
12. A. J. Schrijver, Theory of Linear and Integer Programming,
New York: Wiley, 1986.
13. D. den Hertog, Interior Point Approach to Linear, Quadratic
and Convex Programming, Boston, MA: Kluwer Academic
Publishers, 1994.
14. J. A. dos Santos Gromicho, Quasiconvex Optimization and
Location Theory, Amsterdam, The Netherlands:Thesis Pub-
lishers, 1995.
15. L. Vandenberghe and S. Boyd, Semidenite programming,
SIAM Rev., 38 (1): 4995, 1996.
16. J. G. Ecker, Geometric programming: methods, computations
and applications, SIAM Rev., 22 (3): 338362, 1980.
17. R. J. Dufn and E. L. Peterson, Geometric Programming:
Theory and Application, New York: Wiley, 1967.
10 CONVEX OPTIMIZATION
18. K. Kasamsetty, M. Ketkar, andS. S. Sapatnekar, Anewclass of
convexfunctions for delaymodelingandtheir applicationto the
transistor sizing problem, IEEE Trans. on Comput.-Aided
Design, 19 (7): 779778, 2000.
19. S. Boyd and L. Vandenberghe, Introduction to Convex Optimi-
zationwithEngineeringApplications, Lecture notes, Electrical
Engineering Department, Stanford University, CA, 1995.
Available http://www-isl.stanford.edu/~boyd.
20. S. W. Director and G. D. Hachtel, The simplicial approxima-
tionapproachto designcentering, IEEETrans. Circuits Syst.,
CAS-24 (7): 363372, 1977.
21. S. S. Sapatnekar, P. M. Vaidya and S. M. Kang, Convexity-
based algorithms for design centering, IEEE Trans. Comput.-
Aided Des. Integr. Circuits Syst., 13 (12): 15361549, 1994.
22. A. Prekopa, Logarithmic concave measures and other topics,
in M. Dempster, (ed.), Stochastic Programming, London,
England: Academic Press, 1980, PP. 6382.
23. S. S. Sapatnekar and S. M. Kang, Design Automation for
Timing-driven Layout Synthesis, Boston, MA:Kluwer Aca-
demic Publishers, 1993.
24. J. Fishburn and A. E. Dunlop, TILOS: a posynomial program-
ming approach to transistor sizing, Proc. IEEE Int. Conf.
Comput.-Aided Des., 1985, PP. 326328.
25. S. S. Sapatnekar, V. B. Rao, P. M. Vaidya and S. M. Kang, An
exact solution to the transistor sizing problem using convex
optimization, IEEE Trans. Comput.-Aided Des. Integr. Cir-
cuits Syst., 12 (11): 16211632, 1993.
26. L. Vandenberghe, S. Boyd and A. El Gamal, Optimal wire and
transistor sizing for circuits withnon-tree topology, Proc. IEEE
Int. Conf. Comput.-Aided Des., 1997, PP. 252259.
SACHIN S. SAPATNEKAR
University of Minnesota
Minneapolis, Minnesota
CONVEX OPTIMIZATION 11
F
FRACTALS
Even though no consensus exists on a mathematical deni-
tion of what constitutes a fractal, it is usually clear what
one means by a fractal. A fractal object has strong sca-
ling behavior. That is, some relation exists between the
behavior at one scale and at (some) ner scales.
Benoit Mandelbrot suggested dening a fractal as an
object that has a fractional dimension (see the section on
Fractal Dimension). This denitioncaptures the idea that a
fractal has complex behavior at all size scales.
Figure 1 illustrates some geometric fractals (that is, the
behavior is geometrical). The rst two images are exactly
self-similar fractals. Both of them consist of unions of
shrunken copies of themselves. In the rst fractal (the
one that looks like a complicated plus sign), ve copies of
the large gure combine together, and in the second fractal
(the twindragon tile), it only requires two copies. The third
fractal in the gure is not exactly self-similar. It is made up
of two distorted copies of the whole. The fourth fractal is an
image of a Romanesco Broccoli, which is an amazingly
beautiful vegetable! The fractal structure of this plant is
evident, but it is clearly not an exact fractal.
In Fig. 2, the well-known Mandelbrot Set is explored by
successively magnifying a small portion from one frame
into the next frame. The rst three images in the series
indicate the region that has been magnied for the next
image. These images illustrate the fact that the Mandelbrot
Set has innitely ne details that look similar (but not
exactly the same) at all scales.
One amazing feature of many fractal objects is that the
process by whichthey are dened is not overly complicated.
The fact that these seemingly complex systems can arise
from simple rules is one reason that fractal and chaotic
models have been used in many areas of science. These
models provide the possibility to describe complex interac-
tions among many simple agents. Fractal models cannot
efciently describe all complex systems, but they are
incredibly useful in describing some of these systems.
THE MANDELBROT SET
The Mandelbrot Set (illustrated in the rst frame of Fig. 2),
named after Benoit Mandelbrot (who coined the termfrac-
tal and introduced fractals as a subject), is one of the most
famous of all fractal objects. The Mandelbrot Set is de-
nitely not an exactly self-similar fractal. Zooming in on the
set continues to reveal details at every scale, but this detail
is never exactly the same as at previous scales. It is con-
formally distorted, so it has many features that are the
same, but not exactly the same.
Although the denition of the Mandelbrot Set might
seem complicated, it is (in some sense) much less compli-
cated than the Mandelbrot Set itself. Consider the
polynomial with complex coefcients Q
c
(x) = x
2
c, where
c =abi is some (xed) complexnumber. We are interested
in what happens when we iterate this polynomial starting
at x = 0. It turns out that one of two things happens,
depending on c. Either Q
n
c
(0) or it stays bounded for
all time. That is, either the iterates get larger and larger,
eventually approaching innitely large, or they stay
bounded and never get larger than a certain number.
The Mandelbrot Set is dened as
M = c : Q
n
c
(0)
that is, it is those c in the complex plane for which the
iteration does not tend to innity. Most images of the
Mandelbrot Set are brilliantly colored. This coloration is
really just an artifact, but it indicates howfast those points
that are not in the Mandelbrot Set tend to innity. The
Mandelbrot Set itself is usually colored black. In our gray-
level images of the Mandelbrot Set (Fig. 2), the Mandelbrot
Set is in black.
FRACTAL DIMENSION
Several parameters can be associated with a fractal object.
Of these parameters, fractal dimension (of which there are
several variants) is one of the most widely used. Roughly
speaking, this dimension measures the scaling behavior
of the object by comparing it to a power law.
An example will make it more clear. Clearly the line
segment L = [0,1] has dimension equal to one. One way to
think about this is that, if we scale L by a factor of s, the
size of Lchanges by afactor of s
1
. That is, if we reduce it by
a factor of 1/2, the new copy of L has 1/2 the length of the
original L.
Similarly the square S = [0,1] [0,1] has dimension
equal to two because, if we scale it by a factor of s, the size
(in this case area) scales by a factor of s
2
. How do we know
that the size scales by a factor of s
2
? If s =1/3, say, thenwe
see that we can tile S by exactly 9 = 3
2
reduced copies of S,
which means that each copy has size (1/3)
2
times the size
of the original.
For a fractal example, we rst discuss a simple method
for constructing geometric fractals using an initiator and
a generator. The idea is that we start with the initiator
and replace each part of it with the generator at each
iteration. We see this in Figs. 3 and 4. Figure 3 illustrates
the initiator and generator for the von Koch curve, and Fig.
4 illustrates the iterative stages in the construction, where
at each stage we replace each line segment with the gen-
erator. The limiting object is called the von Koch curve.
In this case, the von Koch curve K is covered by four
smaller copies of itself, each copy having been reduced in
size by a factor of three. Thus, for s = 1/3, we need four
reduced copies to tile K. This gives
size of original copy = 4 size of smaller copy
= 4 (1=3)
D
size of orginal copy
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
so 4(1/3)
D
=1 or D=log(3)/log(2). That is, for the von Koch
curve, if we shrink it by a factor of 1/3, then the size gets
reduced by a factor of (1=3)
log(4)=log(3)
= 1=4. In this sense,
the vonKochcurve has dimensionlog(4)=log(3) ~1:2619, so
it has a dimension that is fractional and is a fractal.
This way of computing a fractal dimension is very intui-
tive, but it only works for sets that are exactly self-similar
(and thus is called the similarity dimension). We need a
different method/denition for the fractal dimension of
objects that are not exactly self-similar. The diffusion-
limited aggregation (DLA) fractal shown later in Fig. 16
is a fractal, but it is not strictly self-similar.
One very commonway to estimate the dimensionof such
an object is the box counting method. To do this, cover the
image with a square grid (of side length e) and count how
many of these boxes are occupied by a point in the image.
Let N(e) be this number. Now repeat this for a sequence of
ner and ner grids (letting e tend to 0). We want to t the
relationN(e) =ae
D
, sowetakelogarithmsof bothsidestoget
log(N) = log(a) D log(e). To estimate D, we plot log(N(e))
versus log(e) and nd the slope of the least-squares
line.
For the object in Fig. 16, we have the data in Table 1,
which gives a fractal dimension of 1.60.
It is interesting to apply this method to exactly self-
similar objects, such as the von Koch curve. For this curve,
it is convenient to use boxes of size e =3
n
andto alignthem
so that they forma nice covering of the vonKochcurve (that
is, we do not necessarily have to have the boxes forma grid).
Assume that the initial lengths of the line segments in the
generator for the von Koch curve are all equal to 1/3. In this
case, it is easy to see that we require 4
n
boxes of side length
1/3
n
to cover the curve, so we solve
4
n
= a(1=3
n
)
D
=a = 1; D = log(4)=log(3)
as before.
Figure 1. Some fractal shapes.
Figure 2. Zooming into the Mandelbrot Set.
Figure 3. The initiator and generator for the von Koch curve.
Figure 4. Construction of the von Koch curve.
Table 1.
e N(e)
1/2 4
1/4 16
1/8 52
2
4
174
2
5
580
2
6
1893
2
7
6037
2
8
17556
2
9
44399
2
10
95432
2 FRACTALS
IFS FRACTALS
The Iterated Function Systems (IFS, for short) are a math-
ematical way of formalizing the notion of self-similarity. An
IFS is simply a collection of functions from some space to
itself, w
i
: X X. Usually it is required that these functions
are contractive, in that the distance between the images of
any two points is strictly less than the distance between the
original points. Mathematically, wis a contractionif there is
some number s with 0 _ s <1 and d(w(x); w(y)) _ sd(x; y)
for any x and y, where d(x; y) somehow measures the
distance between the points x and y. In this case, it is
well known(bythe BanachContractionTheorem) that there
is a unique xed point x with w(x) = x.
Given an IFS w
i
, the attractor of the IFS is the unique
set A(Ashould be nonempty and compact, to be technically
precise), which satises the self-tiling (or xed point)
condition
A = w
1
(A) w
2
(A) w
n
(A)
so that Ais made up of smaller copies of itself (smaller by
the individual w
i
, that is). Under some general conditions
this attractor always exists and is uniquely specied by the
functions w
i
in the IFS. Furthermore, one can recover the
attractor Aby starting withany set Band iterating the IFS.
The iteration proceeds by rst generating the n distorted
and smaller copies of B (distorted and shrunk by the
individual w
i
s) and combining them to get a new set B
1
B
1
= w
1
(B) w
2
(B) w
n
(B)
Repeatingthis, we get asequence of sets B
2
; B
3
; . . . ; B
n
; . . . ;
which will converge to the attractor A.
To illustrate, take the three contractions w
1
, w
2
, w
3
given by
w
1
(x; y) = (x=2; y=2); w
2
(x; y) = (x=2 1=2; y=2);
w
3
(x; y) = (x=2; y=2 1=2)
Eachof these three maps shrinks all distances by a factor of
two (because we are reducing inboththe horizontal and the
vertical directionby a factor of 2). These maps are examples
of similarities, because they preserve all angles and lines
but only reduce lengths; geometric relationships within an
object are (mostly) preserved.
We thinkof these maps as acting onthe unit square, that
is, all points (x, y) with 0 _ x _ 1 and 0 _ y _ 1. In this case,
the action of each map is rather simple and is illustrated in
Fig. 5. For instance, mapw
2
takes whatever is inthe square,
shrinks it byafactor of twoineachdirection, andthenplaces
it in the lower right-hand corner of the square.
The attractor of this 3-map IFSis the Sierpinski Gasket,
which is illustrated in Fig. 6. The self-tiling property is
clear, because the Sierpinski Gasket is made up of three
smaller copies of itself. That is, S = w
1
(S) w
2
(S) w
3
(S).
Now, one amazing thing about IFS is that iterating the
IFS on any initial set will yield the same attractor in the
limit. We illustrate this in Fig. 7 with two different starting
images. The attractor of theIFSiscompletelyencodedbythe
IFSmaps themselves; no additional informationis required.
w3
w1
w2
Figure 5. Action of the three maps.
Figure 6. The Sierpinski Gasket.
Figure 7. Two different deterministic iterations.
FRACTALS 3
The Collage Theorem
GivenanIFSit is easyto compute the attractor of the IFSby
iteration. The inverse problem is more interesting, how-
ever. Given a geometric object, how does one come up with
an IFS whose attractor is close to the given object?
The Collage Theorem provides one answer to this ques-
tion. This theorem is a simple inequality but has strong
practical applications.
Collage Theorem. Suppose that T is a contraction with
contraction factor s < 1 and A is the attractor of T (so that
T(A) = A). Then for any B we have
d(A; B) _
d(T(B); B)
1 s
What does this mean? In practical terms, it says that, if we
want to ndanIFSwhose attractor is close to a givenset B,
then what we do is look for an IFSthat does not move the
set B very much. Consider the maple leaf in Fig. 8. We see
that we can collage the leaf with four transformed copies
of itself. Thus, the IFS that is represented by these four
transformations should have an attractor that is close to
the maple leaf. The second image in Fig. 8 is the attractor
of the given IFS, and it is clearly very similar to a maple
leaf.
As another example, it is easy to nd the collage to
generate the attractor in Fig. 9. It requires 11 maps in
the IFS.
However, the real power of the Collage Theorem comes
in when one wants to nd an IFS for more complicated self-
afne fractals or for fractal objects that are not self-similar.
One such application comes when using IFS for image
representation (see the section on Fractals and Image
Compression).
Fractal Functions
Geometric fractals are interesting and useful in many
applications as models of physical objects, but many times
one needs a functional model. It is easy to extend the IFS
framework to construct fractal functions.
There are several different IFS frameworks for con-
structing fractal functions, but all of them have a common
core, so we concentrate on this core. We illustrate the ideas
by constructing fractal functions on the unit interval; that
is, we construct functions f : [0; 1[ R. Take the three
mappings w
1
(x) = x/4, w
2
(x) = x/2 1/4; and w
3
(x) = x/4
3/4 and notice that [0; 1[ = w
1
[0; 1[ w
2
[0; 1[ w
3
[0; 1[, so
that the images of [0, 1] under the w
i
s tile [0, 1].
Choose three numbers a
1
, a
2
a
3
and three other numbers
b
1
, b
2
, b
3
and dene the operator T by
T( f )(x) = a
i
f (w
1
i
(x)) b
i
if x w
i
([0; 1[)
where f : [0; 1[ R is a function. Then clearly T(f ):
[0; 1[ R is also a function, so Ttakes functions to functions.
There are various conditions under which T is a con-
traction. For instance, if [a
i
[ <1 for each i, then T is
contractive in the supremum norm given by
| f |
sup
= sup
x[0; 1[
[ f (x)[
so T
n
( f ) converges uniformly to a unique xed point f for
any starting function f.
Figure 10 illustrates the limiting fractal functions in the
case where a
1
= a
2
= a
3
= 0:3 and b
1
= b
2
= b
3
= 1.
It is also possible to formulate contractivity conditions in
other norms, such as the L
p
norms. These tend to have
weaker conditions, so apply in more situations. However,
the type of convergence is clearly different (the functions
need not converge pointwise anywhere, for instance, and
may be unbounded).
The same type of construction can be applied for func-
tions of two variables. In this case, we can think of such
functions as grayscale images on a screen. For example,
Figure 8. Collage for maple leaf and attractor of the IFS.
Figure 9. An easy collage to nd.
Figure 10. The attractor of an IFS on functions.
4 FRACTALS
using the four geometric mappings
w
1
(x; y) =
x
2
;
y
2
; w
2
(x; y) =
x 1
2
;
y
2
;
w
3
(x; y) =
x
2
;
y 1
2
; w
4
(x; y) =
x 1
2
;
y 1
2
and the a and b values given by
a 0.5 0.5 0.4 0.5
b 80 40 132 0
we get the attractor function in Fig. 11. The value 255
represents white and 0 represents black.
FRACTALS AND IMAGE COMPRESSION
The idea of using IFS for compressing images occurred to
Barnsley and his co-workers at the Georgia Institute of
Technology in the mid-1980s. It seems that the idea arose
from the amazing ability of IFS to encode rather compli-
cated images with just a few parameters. As an example,
the fern leaf in Fig. 12 is dened by four afne maps, so it is
encoded by 24 parameters.
Of course, this fern leaf is an exact fractal and most
images are not. The idea is to try to nd approximate self-
similarities, because it is unlikely there will be many exact
self-similarities in a generic image. Furthermore, it is also
highly unlikely that there will be small parts of the image
that look like reduced copies of the entire image. Thus, the
idea is to look for small parts of the image that are similar
to large parts of the same image. There are manydifferent
ways to do this, so we describe the most basic here.
Given an image, we form two partitions of the image.
First, we partitionthe image into large blocks, we will call
these the parent blocks. Then we partition the image into
small blocks; here we take them to be one half the size of
the large blocks. Call these blocks child blocks. Figure 13
illustrates this situation. The blocks in this gure are made
large enough to be clearly observed and are much too large
to be useful in an actual fractal compression algorithm.
Giventhese two blockpartitions of the image, the fractal
block encoding algorithm works by scanning through all
the small blocks and, for each such small block, searching
among the large blocks for the best match. The likelihood of
nding a good match for all of the small blocks is not very
high. To compensate, we are allowed to modify the large
block. Inspired by the idea of an IFS on functions, the
algorithm is:
1. for SB in small blocks do
2. for LB in large blocks do
Figure 11. The attractor of an IFS on functions f : R
2
R.
Figure 12. The fern leaf.
Figure 13. The basic block decomposition.
FRACTALS 5
3. Downsample LB to the same size as SB
4. Use leastsquares to nd the best parameters a and b
for this combination of LB and SB. That is, to make
SB ~ aLB b.
5. Compute error for these parameters. If error is smal-
ler than for any other LB, remember this pair along
with the a and b.
6. end for
7. end for
At the end of this procedure, for each small block, we
have found an optimally matching large block along with
the a and b parameters for the match. This list of triples
(large block, a, b) forms the encoding of the image. It is
remarkable that this simple algorithm works! Figure 14
illustrates the rst, second, third, fourth, and tenth itera-
tions of the reconstruction. One can see that the rst stage
of the reconstruction is basically a downsampling of the
original image to the small block partition. The scheme
essentially uses the b parameters to store this coarse ver-
sionof the image andthenuses the aparameters alongwith
the choice of whichparent blockmatchedagivenchildblock
to extrapolate the ne detail in the image from this coarse
version.
STATISTICALLY SELF-SIMILAR FRACTALS
Many times a fractal object is not self-similar in the IFS
sense, but it is self-similar in a statistical sense. That is,
either it is created by some random self-scaling process
(so the steps from one scale to the next are random) or it
exhibits similar statistics fromone scale to another. In fact,
most naturally occurring fractals are of this statistical type.
The object in Fig. 15 is an example of this type of fractal.
It was created by taking randomsteps fromone scale to the
next. In this case, it is a subset of an exactly self-similar
fractal.
These types of fractals are well modeled by random IFS
models, where there is some randomness inthe choice of the
maps at each stage.
MORE GENERAL TYPES OF FRACTALS
IFS-type models are useful as approximations for self-
similar or almost self-similar objects. However, often
these models are too hard to t to a given situation. For
these cases we have a choicewe can either build some
other type of model governing the growth of the object OR
we can give up on nding a model of the interscale behavior
and just measure aspects of this behavior.
An example of the rst instance is a simple model for
DLA, which is illustrated in Fig. 16. In this growth model,
we start witha seed and successively allowparticles to drift
Figure 14. Reconstruction: 1, 2, 3, 4, and 10
iterations and the original image.
Figure 15. A statistically self-similar fractal. Figure 16. A DLA fractal.
6 FRACTALS
around until they stick to the developing object. Clearly
the resulting gure is fractal, but our model has no explicit
interscale dependence. However, these models allowone to
do simulation experiments to t data observed in the
laboratory. They also allow one to measure more global
aspects of the model (like the fractal dimension).
Fractal objects also frequently arise as so-called strange
attractors in chaotic dynamical systems. One particularly
famous example is the buttery shaped attractor in the
Lorentz system of differential equations, seen in Fig. 17.
These differential equations are a toy model of a weather
system that exhibits chaotic behavior. The attractor of
this system is an incredibly intricate ligree structure of
curves. The fractal nature of this attractor is evident by the
fact that, as you zoom in on the attractor, more and more
detail appears in an approximately self-similar fashion.
FRACTAL RANDOM PROCESSES
Since the introduction of fractional Brownian motion (fBm)
in 1968 by Benoit Mandelbrot and John van Ness, self-
similar stochastic processes have been used to model a
variety of physical phenomena (including computer net-
work trafc and turbulence). These processes have power
spectral densities that decay like 1/f
a
.
A fBm is a Gaussian process x(t) with zero mean and
covariance
E(x(t)x(s)[ = (s
2
=2) [[t[
2H
[s[
2H
[t s[
2H
[
andis completelycharacterizedbytheHurst exponent Hand
thevarianceE[x(1)
2
[ = s
2
. fBmis statisticallysell-similar in
the sense that, for any scaling a > 0, we have
x(at) = a
H
x(t)
where by equality we mean equality in distribution. (As an
aside and an indication of one meaning of H, the sample
paths ol fBm with parameter H are almost surely Holder
continuous with parameter H, so the larger the value of H,
the smoother the sample paths of the fBm). Because of this
scaling behavior, fBmexhibits very strong long-time depen-
dence. It is also clearly not a stationary (time-invariant)
process, which causes many problems with traditional
methods of signal synthesis, signal estimation, and para-
meter estimation. However, wavelet-based methods work
rather well, as the scaling and nite time behavior of
wavelets matches the scaling and nonstationarity of the
fBm. Withanappropriatelychosen(i.e., sufcientlysmooth)
wavelet, the wavelet coefcients of an fBm become a sta-
tionary sequence without the long range dependence prop-
erties. This aids inthe estimationof the Hurst parameter H.
Several generalizations of fBm have been dened,
including a multifractal Brownian motion (mBm). This
particular generalization allows the Hurst parameter to
be a changing function of time H(t) in a continuous way.
Since the sample paths of fBm have Holder continuity H
(almost surely), this is particularly interestingfor modeling
situations where one expects the smoothness of the sample
paths to vary over time.
FRACTALS AND WAVELETS
We briey mention the connection between fractals and
wavelets. Wavelet analysis has become a very useful part of
any data analysts toolkit. In many ways, wavelet analysis
is a supplement (and, sometimes, replacement) for Fourier
analysis; the wavelet functions replace the usual sine and
cosine basis functions.
The connection between wavelets and fractals comes
because wavelet functions are nearly self-similar functions.
The so-called scaling function is a fractal function, and
the mother wavelet is simply a linear combination of
copies of this scaling function. This scaling behavior of
wavelets makes it particularly nice for examining fractal
data, especially if the scaling in the data matches the
scaling in the wavelet functions. The coefcients that
come from a wavelet analysis are naturally organized in
a hierarchy of information fromdifferent scales, and hence,
doing a wavelet analysis can help one to nd scaling rela-
tions in data, if such relations exist.
Of course, wavelet analysis is much more than just an
analysis to nd scaling relations. There are many different
wavelet bases. This freedomin choice of basis gives greater
exibility than Fourier analysis.
FURTHER READING
M. G. Barnsley, Fractals Everywhere, New York: Academic Press,
1988.
M. G. Barnsley and L. Hurd, Fractal Image Compression,
Wellesley, Mass: A.K. Peters, 1993.
M. Dekking, J. Levy Vehel, E. Lutton, and C. Tricot (eds). Fractals:
Theory and Applications in Engineering, London: Springer,
1999.
K. J. Falconer, The Geometry of Fractal Sets, Cambridge, UK:
Cambridge University Press, 1986.
K. J. Falconer, The Fractal Geometry: Mathematical Foundations
and Applications, Toronto, Canada: Wiley, 1990.
J. Feder, Fractals, New York: Plenum Press, 1988.
Y. Fisher, Fractal Image Compression, Theory and Applications,
New York: Springer, 1995.
Figure 17. The Lorentz attractor.
FRACTALS 7
J. E. Hutchinson, Fractals andself-similarity, IndianaUniv. Math.
J. 30: 713747.
J. Levy Vehel, E. Lutton, and C. Tricot (eds). Fractals in Enginee-
ring, London: Springer, 1997.
J. Levy Vehel and E. Lutton (eds). Fractals in Engineering: New
Trends in Theory and Applications, London: Springer, 2005.
S. Mallat, A Wavelet Tour of Signal Processing, San Diego, CA:
Academic Press, 1999.
B. Mandelbrot, The Fractal Geometry of Nature, San Francisco,
CA: W. H. Freeman, 1983.
B. Mandelbrot and J. van Ness, Fractional Brownian motions,
fractional noises and applications, SIAM Review 10: 422437,
1968.
H. Peitgen and D. Saupe (eds). The Science of Fractal Images,
New York: Springer, 1998.
H. Peitgen, H. Ju rgens, and D. Saupe, Chaos and Fractals: New
Frontiers, of Science, New York: Springer, 2004.
D. Ruelle, Chaotic Evolution and Strange Attractors: The Statis-
tical Analysis of Time Series for Deterministic Nonlinear Systems,
Cambridge, UK: Cambridge University Press, 1988.
F. MENDIVIL
Acadia University
Wolfville, Nova Scotia, Canada
8 FRACTALS
G
GRAPH THEORY AND ALGORITHMS
GRAPH THEORY FUNDAMENTALS
A graph G(V; E) consists of a set V of nodes (or vertices)
anda set Eof pairs of nodes fromV, referred to as edges. An
edge may have associated with it a direction, in which
case, the graph is called directed (as opposed to undir-
ected), or a weight, in which case, the graph is called
weighted. Given an undirected graph G(V; E), two nodes
u; v V for which an edge e = (u; v) exists in E are said to
be adjacent and edge e is said to be incident on them. The
degree of a node is the number of edges adjacent to it. A
(simple) path is a sequence of distinct nodes (a
0
; a
1
; . . . ; a
k
)
of V such that every two nodes in the sequence are adja-
cent. A (simple) cycle is a sequence of nodes (a
0
, a
1
, . . ., a
k
,
a
0
) such that (a
0
; a
1
; . . . ; a
k
) is a path and a
k
; a
0
are
adjacent. A graph is connected if there is a path between
every pair of nodes. A graph G
/
(V
/
; E
/
) is a subgraph of
graph G(V, E) if V
/
_V and E
/
_E. A spanning tree of a
connected graph G(V, E) is a subgraph of Gthat comprises
all nodes of G and has no cycles. A graph G
/
(V
/
, E
/
) is a
subgraph of graph G(V, E) if V
/
_V and E
/
_E. An inde-
pendent set of a graph G(V, E) is a subset V
/
_V of nodes
such that no two nodes in V
/
are adjacent in G. Aclique of a
graph G(V, E) is a subset V
/
_V of nodes such that any two
nodes in V
/
are adjacent in G. Given a subset V
/
_V, the
induced subgraph of G(V, E) by V
/
is a subgraph G
/
(V
/
, E
/
),
where E
/
comprises all edges (u, v) in E with u, v V
/
.
In a directed graph, each edge (sometimes referred to
as arc) is an ordered pair of nodes and the graph is denoted
by G(V, A). For an edge (u, v) A,v is called the head and u
the tail of the edge. The number of edges for whichuis a tail
is called the out-degree of u and the number of edges for
which u is a head is called the in-degree of u. A (simple)
directed path is a sequence of distinct nodes (a
0
, a
1
, . . ., a
k
)
of V such that (a
i
, a
i1
), \i; 0 _ i _ k 1, is an edge of the
graph. A (simple) directed cycle is a sequence of nodes (a
0
,
a
1
, . . ., a
k
, a
0
) such that (a
0
, a
1
, . . ., a
k
) is a directed path and
(a
k
, a
0
) A. A directed graph is strongly connected if
there is a directed path between every pair of nodes. A
directed graph is weakly connected if there is an undirec-
ted path between every pair of nodes.
Graphs (15) are a very important modeling tool that
can be used to model a great variety of problems in areas
such as operations research (68), architectural level syn-
thesis (9), computer-aided Design (CAD) for very large-
scale integration (VLSI) (9,10) and computer and commu-
nications networks (11).
In operations research, a graph can be used to model the
assignment of workers to tasks, the distribution of goods
from warehouses to customers, etc. In architectural level
synthesis, graphs are used for the design of the data path
and the control unit. The architecture is specied by a
graph, a set of resources (such as adders and multipliers),
and a set of constraints. Each node corresponds to an
operation that must be executed by an appropriate
resource, and a directed edge from u to v corresponds to
a constraint and indicates that task u must be executed
before task v. Automated data path synthesis is a schedul-
ing problem where each task must be executed for some
time at anappropriate resource. The goal is to minimize the
total time to execute all tasks in the graph subject to the
edge constraints. In computer and communications net-
works, a graph can be used to represent any given inter-
connection, with nodes representing host computers or
routers and edges representing communication links.
In CAD for VLSI, a graph can be used to represent a
digital circuit at any abstract level of representation. CAD
for VLSI consists of logic synthesis and optimization, ver-
ication, analysis and simulation, automated layout, and
testing of the manufactured circuits (12). Graphs are used
in multilevel logic synthesis and optimization for combina-
tional circuits, and each node represents a function that is
used for the implementation of more than one output (9).
They model synchronous systems by means of nite state
machines and are used extensively in sequential logic
optimization (9). Binary decision diagrams are graphical
representations of Boolean functions and are used exten-
sively in verication. Algorithms for simulation and timing
analysis consider graphs where the nodes typically corre-
spond to gates or ip-ops. Test pattern generation and
fault grading algorithms also consider graphs where faults
mayexist oneither the nodes or the edges. Inthe automated
layout process (10), the nodes of a graph may represent
modules of logic (partitioning and oor-planning stages),
gates or transistors (technology mapping, placement, and
compaction stages), or simply pins (routing, pin assign-
ment, and via minimization stages).
There are manyspecial cases of graphs. Some of the most
common ones are listed below. A tree is a connected graph
that contains no cycles. A rooted tree is a tree with a
distinguished node called a root. All nodes of a tree that
are distinguished fromthe root and are adjacent to only one
node are called leaves. Trees are used extensively in the
automated layout process of intergated circuits. The clock
distribution network is a rooted tree whose root corre-
sponds to the clock generator circuit and whose leaves
are the ip-ops. The power and ground supply networks
are rootedtrees whose root is either the power or the ground
supply mechanism and whose leaves are the transistors in
the layout. Many problem formulations in operations
research, architectural level synthesis, CAD for VLSI,
and networking allow for fast solutions when restricted
to trees. For example, the simpliedversionof the datapath
scheduling problemwhere all the tasks are executed on the
same type of resource is solvable in polynomial time if the
task graph is a rooted tree. This restricted scheduling
problem is also common in multiprocessor designs (9).
A bipartite graph is a graph with the property that its
node set V can be partitioned into two disjoint subsets V
1
and V
2
, V
1
V
2
= v, such that every edge in E comprises
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
one node from V
1
and one node from V
2
. Bipartite graphs
appear often in operations research (68) and in VLSI CAD
applications (9,10).
Adirectedacyclic graphis adirected graphthat contains
no directed cycles. Directed acyclic graphs can be used to
represent combinational circuits in VLSI CAD. Timing
analysis algorithms in CAD for VLSI or software timing
analysis algorithms of synthesizable hardware description
language code operate on directed acyclic graphs.
A transitive graph is a directed graph with the property
that for any nodes u; v; wV for which there exist edges
(u; v); (v; w) A, edge (u, w) also belongs to A. Transitive
graphs nd applications in resource binding of tasks in
architectural level synthesis (9) and in fault collapsing for
digital systems testing (12).
A planar graph is a graph with the property that its
edges can be drawn on the plane so as not to cross each
other. Many problems inthe automated layout for VLSI are
modeled using planar graphs. Many over-the-cell routing
algorithms require the identication of circuit lines to be
routed on a single layer. Single-layer routing is also desir-
able for detailed routing of the interconnects within each
routing layer (10).
A cycle graph is a graph that is obtained by a cycle with
chords as follows: For every chord (a,b) of the cycle, there is
a node v
(a;b)
in the cycle graph. There is an edge (v
(a;b)
; v
(c;d)
)
in the cycle graph if and only if the respective chords (a, b)
and (c, d) intersect. Cycle graphs nd application in VLSI
CAD, as a channel with two terminal nets, or a switchbox
with two terminal nets can be represented as a cycle graph.
Then the problem of nding the maximum number of nets
in the channel (or switchbox) that can be routed on the
plane amounts to nding a maximum independent set in
the respective cycle graph (10).
Apermutation graph is a special case of a cycle graph. It
is based on the notion of a permutation diagram. A permu-
tation diagram is simply a sequence of N integers in the
range from 1 to N (but not necessarily ordered). Given an
ordering, there is a node for every integer in the diagram
and there is an edge (u,v) if and only if the integers u,v are
not in the correct order in the permutation diagram. A
permutation diagram can be used to represent a special
case of a permutable channel in a VLSI layout, where all
nets have two terminals that belong to opposite channel
sides. The problemof nding the maximumnumber of nets
in the permutable channel that can be routed on the plane
amounts to nding a maximum independent set in the
respective permutation graph. This, in turn, amounts to
nding the maximum increasing (or decreasing) subse-
quence of integers in the permutation diagram (10).
Graphs have also been linked with random processes,
yielding what is known as randomgraphs. Randomgraphs
consist of a given set of nodes while their edges are added
according to some random process. For example, G(n,p)
denotes a random graph with n nodes but whose edges are
chosen independently with probability p. Typical questions
that arise in the study of random graphs concern their
probabilistic behavior. For example, given n and p we may
want to nd the probability that G(n,p) is connected (which
has direct application to the reliability of a network whose
links fail probabilistically). Or we may want to determine
whether there is a threshold on some parameter (such as
the average node degree) of the random graph after which
the probability of some attribute (such as connectivity,
colorability, etc) changes signicantly.
ALGORITHMS AND TIME COMPLEXITY
An algorithm is an unambiguous description of a nite set
of operations for solving a computational problemina nite
amount of time. The set of allowable operations corresponds
to the operations supported by a specic computing
machine (computer) or a model of that machine.
A computational problem comprises a set of parameters
that have to satisfy a set of well-dened mathematical
constraints. A specic assignment of values to these para-
meters constitutes an instance of the problem. For some
computational problems, there is no algorithm as dened
above to nd a solution. For example, the problem of
determining whether an arbitrary computer program ter-
minates in a nite amount of time given a set of input data
cannot be solved (it is undecidable) (13). For the computa-
tional problems for whichthere does exist analgorithm, the
point of concern is how efcient is that algorithm. The
efciency of an algorithm is primarily dened in terms of
how much time the algorithm takes to terminate. (Some-
times, other considerations such as the space requirement
in terms of the physical information storage capacity of the
computing machine are also taken into account, but in this
exposition we concentrate exclusively on time.)
To formally dene the efciency of an algorithm, the
following notions are introduced.
The size of aninstance of a problemis denedas the total
number of symbols for the complete specication of the
instance under a nite set of symbols and a succinct
encoding scheme. A succinct encoding scheme is consid-
ered to be a logarithmic encoding scheme, in contrast to a
unary encoding scheme. The time requirement (time com-
plexity) of an algorithm is expressed then as a function f(n)
of the size nof aninstance of the problemand gives the total
number of basic steps that the algorithm needs to go
throughto solve that instance. Most of the time, the number
of steps is taken with regard to the worst case, although
alternative measures like the average number of steps can
also be considered. What constitutes a basic step is pur-
posely left unspecied, providedthat the time the basic step
takes to be completed is bounded from above by a constant,
that is, a value not dependent on the instance. This hides
implementation details and machine-dependent timings
and provides the required degree of general applicability.
Analgorithmwithtime complexity f(n) is saidto be of the
order of g(n) [denoted as O(g(n))], where g(n) is another
function, if there is aconstant c suchthat f (n) _ c g(n), for
all n_0. For example, an algorithm for nding the mini-
mumelement of alist of size ntakes time O(n), analgorithm
for nding a given element in a sorted list takes time O(log
n), and algorithms for sorting a list of elements can take
time O(n
2
), O(n log n), or O(n) (the latter when additional
information about the range of the elements is known). If
2 GRAPH THEORY AND ALGORITHMS
moreover, there are constants c
L
and c
H
such that
c
L
g(n) _ f (n) _ c
H
g(n), for all n_0, then f(n) is said
to be Q(g(n)).
The smaller the order-of function, the more efcient
an algorithmis generally taken to be, but in the analysis of
algorithms, the term efcient is applied liberally to any
algorithm whose order-of function is a polynomial p(n).
The latter includes time complexities like O(n log n) or
O(n
n
_
), which are clearly bounded by a polynomial.
Any algorithm with a nonpolynomial time complexity is
not considered to be efcient. All nonpolynomial-time
algorithms are referred to as exponential and include
algorithms with such time complexities as O(2
n
), O(n!),
O(n
n
), O(n
log n
) (the latter is sometimes referred to as
subexponential). Of course, in practice, for an algorithm
of polynomial-time complexity O(p(n)) to be actually ef-
cient, the degree of polynomial p(n) as well as the constant
of proportionality inthe expressionO(p(n)) shouldbe small.
In addition, because of the worst-case nature of the O()
formulation, an exponential algorithm might exhibit
exponential behavior in practice only in rare cases (the
latter seems to be the case with the simplex method for
linear programming). However, the fact on the one hand
that most polynomial-time algorithms for the problems
that occur in practice tend indeed to have small polynomial
degrees and small constants of proportionality, and on the
other that most nonpolynomial algorithms for the problems
that occur in practice eventually resort to the trivial
approach of exhaustively searching (enumerating) all can-
didate solutions, justies the use of the term efcient for
only the polynomial-time algorithms.
Given a new computational problem to be solved, it
is of course desirable to nd a polynomial-time algorithm
to solve it. The determination of whether such a
polynomial-time algorithmactually exists for that problem
is a subject of primary importance. To this end, a whole
discipline dealing with the classication of the com-
putational problems and their interrelations has been
developed.
P, NP, and NP-Complete Problems
The classication starts technically with a special class of
computational problems known as decision problems. A
computational problem is a decision problem if its solution
can actually take the form of a yes or no answer. For
example, the problem of determining whether a given
graph contains a simple cycle that passes through every
node is a decisionproblem(knownas the Hamiltoniancycle
problem). In contrast, if the graph has weights on the edges
and the goal is to nd a simple cycle that passes through
every node and has a minimumsumof edge weights is not a
decision problem but an optimization problem (the corre-
sponding problem in the example above is known as the
Traveling Salesman problem). An optimization problem
(sometimes also referredto as a combinatorial optimization
problem) seeks to nd the best solution, in terms of a well-
dened objective function Q(), over a set of feasible solu-
tions. Interestingly, every optimization problem has a
decision version in which the goal of minimizing
(or maximizing) the objective function Q() in the optimiza-
tion problem corresponds to the question of whether a
solution exists with Q() _ k (or Q() _ k) in the decision
problem, where k is now an additional input parameter to
the decision problem. For example, the decision version of
the Traveling Salesman problem is, given a graph and an
integer K, to nd a simple cycle that passes through every
node and whose sum of edge weights is no more than K.
All decision problems that can be solved in polynomial
time comprise the so-called class P (for Polynomial).
Another established class of decision problems is the NP
class, which consists of all decision problems for which a
polynomial-time algorithm can verify if a candidate solu-
tion (which has polynomial size with respect to the original
instance) yields a yes or no answer. The initials NP
stand for Non-deterministic Polynomial in that if a yes-
answer exists for aninstance of anNPproblem, that answer
can be obtained nondeterministically (in effect, guessed)
and then veried in polynomial time. Every problem in
class P belongs clearly to NP, but the question of whether
class NP strictly contains P is a famous unresolved pro-
blem. It is conjectured that NP ,= P, but there has been no
actual proof until now. Notice that to simulate the non-
deterministic guess in the statement above, an obvious
deterministic algorithm would have to enumerate all pos-
sible cases, which is an exponential-time task. It is in fact
the question of whether such an obvious algorithm is
actually the best one can do that has not been resolved.
Showing that an NP decision problem actually belongs
to P is equivalent to establishing a polynomial-time algo-
rithm to solve that problem. In the investigation of the
relations between problems in P and in NP, the notion of
polynomial reducibility plays a fundamental role. A pro-
blem R is said to be polynomially reducible to another
problem S if the existence of a polynomial-time algorithm
for S implies the existence of a polynomial-time algorithm
for R. That is, in more practical terms, if the assumed
polynomial-time algorithm for problem S is viewed as a
subroutine, then an algorithm that solves R by making a
polynomially bounded number of calls to that subroutine
and taking a polynomial amount of time for some extra
work would constitute a polynomial-time algorithm for R.
There is a special class of NP problems with the pro-
perty that if and only if any one of those problems could be
solved polynomially, then so would all NP problems (i.e.,
NP would be equal to P). These NP problems are known as
NP-complete. An NP-complete problemis an NPproblemto
which every other NP problem reduces polynomially. The
rst problem to be shown NP-complete was the Satisa-
bility problem (13). This problem concerns the existence of
a truth assignment to a given set of Boolean variables so
that the conjunction of a given set of disjunctive clauses
formed from these variables and their complements
becomes true. The proof (given by Stephen Cook in 1971)
was done by showing that every NP problem reduces poly-
nomially to the Satisability problem. After the establi-
shment of the rst NP-complete case, an extensive and
ongoing list of NP-complete problems has been established
(13).
GRAPH THEORY AND ALGORITHMS 3
Some representative NP-complete problems on graphs
that occur in various areas in digital design automation,
computer networks, and other areas in electrical and com-
puter engineering are listed as follows:
v
Longest path: Given a graph G(V, E) and an integer
K _ [V[, is there a simple path in G with at least K
edges?
v
Vertex cover: Given a graph G(V, E) and an integer
K _ [V[, is there a subset V
/
_V suchthat [V
/
[ _ K and
for eachedge (u; v) E, at least one of u, v belongs to V
/
?
v
Clique: Given a graph G(V, E) and an integer K _ [V[,
is there a clique C among the nodes in V such that
[C[ _ K?
v
Independent set: Given a graph G(V, E) and an integer
K _ [V[, is there a subset V
/
_V such that [V
/
[ _K and
no two nodes in V
/
are joined by an edge in E?
v
Feedback node set: Given a directed graph G(V, A) and
an integer K _ [V[, is there a subset V
/
_V such that
[V
/
[ _ K and every directed cycle in G has at least one
node in V
/
?
v
Hamiltonian cycle: Given a graph G(V, E), does G
contain a simple cycle that includes all nodes of G?
v
Traveling Salesman: Given a weighted graph G(V, E)
and a bound K, is there a path that visits each node
exactly once and has a total sum of weights no more
than K?
v
Graph colorability: Given a graph G(V, E) and an
integer K _ [V[, is there a coloring function f :
V 1; 2; . . . ; K such that for every edge (u; v) E;
f (u) ,= f (v)?
v
Graph bandwidth: Given a graph G(V,E) and an inte-
ger K _ [V[, is there a one-to-one function f :
V 1; 2; . . . ; [V[ such that for every edge (u; v) E;
[ f (u) f (v)[ _ K?
v
Induced bipartite subgraph: Given a graph G(V, E)
and an integer K _ [V[, is there a subset V
/
_V
such that [V
/
[ _K and the subgraph induced by V
/
is
bipartite?
v
Planar subgraph: GivenagraphG(V, E) andaninteger
K _ [E[, is there a subset E
/
_E such that [E
/
[ _K and
the subgraph G
/
(V,E
/
) is planar?
v
Steiner tree: Given a weighted graph G(V, E), a subset
V
/
_V, and a positive integer bound B, is there a
subgraph of G that is a tree, comprises at least all
nodes in V
/
, and has a total sum of weights no more
than B?
v
Graph partitioning: Given a graph G(V, E) and two
positive integers K and J, is there a partition of V
into disjoint subsets V
1
, V
2
,. . .,V
m
such that each sub-
set contains no more than K nodes and the total
number of edges that are incident on nodes in two
different subsets is no more than J?
v
Subgraph isomorphism: Given two graphs G(V
G
, E
G
)
and H(V
H
, E
H
), is there a subgraph H
/
(V
H
/ , E
H
/ ) of
H such that G and H
/
are isomorphic (i.e., there is a
one-to-one function f : V
G
V
H
/ such that (u; v) E
G
if and only if ( f (u); f (v)) E
H
/ )?
The interest in showing that a particular problem R is
NP-complete lies exactly in the fact that if it nally turns
out that NP strictly contains P, then R cannot be solved
polynomially (or, from another point of view, if a
polynomial-time algorithm happens to be discovered for
R, then NP = P). The process of showing that a decision
problemis NP-complete involves showing that the problem
belongs to NP and that some known NP-complete problem
reduces polynomially to it. The difculty of this task lies in
the choice of anappropriate NP-complete problemto reduce
from as well as in the mechanics of the polynomial reduc-
tion. An example of an NP-complete proof is given below.
Theorem 1. The Clique problem is NP-complete
Proof. The problem belongs clearly to NP, because once
a clique C of size [C[ _K has been guessed, the verication
can be done in polynomial (actually O([C[
2
)) time. The
reduction is made from a known NP-complete problem,
the 3-Satisability problem, or 3SAT (13). The latter pro-
blem is a special case of the Satisability problem in that
each disjunctive clause comprises exactly three literals.
Let j be a 3SAT instance with n variables x
1
,. . ., x
n
and
m clauses C
1
,. . ., C
m
. For each clause C
j
, 1 _ j _ m seven
new nodes are considered, each node corresponding to
one of the seven minterms that make C
j
true. The total
size of the set V of nodes thus introduced is 7m. For every
pair of nodes u, v in V, an edge is introduced if and only if
u and v correspond to different clauses and the minterms
corresponding to u and v are compatible (i.e., there is no
variable in the intersection of u and v that appears as both
a negated and an unnegated literal in u and v).
The graph G(V, E) thus constructed has O(m) nodes
and O(m
2
) edges. Let the lower bound K in the Clique
problem be K = m. Suppose rst that j is satisable; that
is, there is a truefalse assignment to the variables x
1
,. . .,
x
n
that make each clause true. This characteristic means
that, in each clause C
j
, exactly one minterm is true and all
such minterms are compatible. The set of the nodes corre-
sponding to these minterms has size m and constitutes a
clique, because all compatible nodes have edges among
themselves.
Conversely, suppose that there is a clique Cin G(V, E) of
size [C[ _m. As there are no edges among any of the seven
nodes corresponding to clause C
j
, 1 _ j _ m, the size of
the clique must be exactly [C[ =m; that is, the clique con-
tains exactly one node from each clause. Moreover, these
nodes are compatible as indicated by the edges among
them. Therefore, the union of the minterms corresponding
to these nodes yields a satisfying assignment for the 3SAT
instance.&
It is interesting to note that seemingly related prob-
lems and/or special cases of many NP-complete problems
exhibit different behavior. For example, the Shortest
Path problem, related to the Longest Path problem, is
polynomially solvable. The Longest Path problem
4 GRAPH THEORY AND ALGORITHMS
restricted to directed acyclic graphs is polynomially solva-
ble. The Graph Colorability problem for planar graphs and
for K = 4 is also polynomially solvable.
On the other hand, the Graph Isomorphism problem,
related to the Subgraph Isomorphism but now asking if
the two given graphs G and H are isomorphic, is thought
to be neither an NP-complete problem nor a polynomially
solvable problem, although this has not been proved. There
is a complexity class called Isomorphic-complete, which is
thought to be entirely disjoint from both NP-complete and
P. However, polynomial-time algorithms exist for Graph
Isomorphism when restricted to graphs with node degree
bounded by a constant (14). In practice, Graph Isomorph-
ism is easy except for pathologically difcult graphs.
NP-Hard Problems
A generalization of the NP-complete class is the NP-hard
class. The NP-hard class is extended to comprise opti-
mization problems and decision problems that do not
seem to belong to NP. All that is required for a problem
to be NP-hard is that some NP-complete problems reduce
polynomially to it. For example, the optimization version of
the Traveling Salesman problem is an NP-hard problem,
because if it were polynomially solvable, the decision
version of the problem (which is NP-complete) would be
trivially solved in polynomial time.
If the decision version of an optimization problem is
NP-complete, then the optimization problem is NP-hard,
because the yes or no answer sought in the decision
version can readily be given in polynomial time once the
optimum solution in the optimization version has been
obtained. But it is also the case that for most NP-hard
optimization problems, a reverse relation holds; that is,
these NP-hard optimization problems can reduce polyno-
mially to their NP-complete decision versions. The strategy
is to use a binary search procedure that establishes the
optimal value after a logarithmically bounded number of
calls to the decision version subroutine. Such NP-hard
problems are sometimes referred to as NP-equivalent.
The latter fact is another motivation for the study of
NP-complete problems: A polynomial-time algorithm for
anyNP-complete (decision) problemwouldactuallyprovide
a polynomial-time algorithm for all such NP-equivalent
optimization problems.
Algorithms for NP-Hard Problems
Once a new problem for which an algorithm is sought is
proven to be NP-complete or NP-hard, the search for a
polynomial-time algorithm is abandoned (unless one seeks
to prove that NP = P), and the following basic four
approaches remain to be followed:
(1) Try to improve as muchas possible over the straight-
forward exhaustive (exponential) search by using
techniques like branch-and-bound, dynamic pro-
gramming, cutting plane methods, and Lagrangian
techniques.
(2) For optimizationproblems, tryto obtainapolynomial-
time algorithm that nds a solution that is prov-
ably close to the optimal. Such an algorithm is
known as an approximation algorithm and is gene-
rally the next best thing one can hope for to solve
the problem.
(3) For problems that involve numerical bounds, try to
obtain an algorithm that is polynomial in terms of
the instance size and the size of the maximum
number occurring in the encoding of the instance.
Such an algorithm is known as pseudopolynomial-
time algorithmand becomes practical if the numbers
involvedinaparticular instance are not too large. An
NP-complete problem for which a pseudopolyno-
mial-time algorithm exists is referred to as weakly
NP-complete (as opposed to strongly NP-complete).
(4) Use a polynomial-time algorithm to nd a good
solution based on rules of thumb and insight. Such
an algorithm is known as a heuristic. No proof is
provided about how good the solution is, but
well-justied arguments and empirical studies jus-
tify the use of these algorithms in practice.
In some cases, before any of these approaches is exam-
ined, one should check whether the problem of concern is
actually a polynomially or pseudopolynomially solvable
special case of the general NP-complete or NP-hard pro-
blem. Examples include polynomial algorithms for nding
a longest path in a directed acyclic graph, and nding a
maximumindependent set inatransitive graph. Also, some
graph problems that are parameterized with an integer k
exhibit the xed-parameter tractability property; that is,
for each xed value of k, they are solvable in time bounded
by a polynomial whose degree is independent of k. For
example, the problem of whether a graph G has a cycle
of length at least k, although NP-complete in general, is
solvable in time O(n) for xed k, where n is the number of
nodes in G. [For other parameterized problems, however,
such as the Dominating Set, where the question is whether
a graph G has a subset of k nodes such that every node not
in the subset has a neighbor in the subset, the only solu-
tion known has time complexity O(n
k1
).]
Inthe next section, we givesome more informationabout
approximation and pseudopolynomial-time algorithms.
POLYNOMIAL-TIME ALGORITHMS
Graph Representations and Traversals
There are two basic schemes for representing graphs in
a computer program. Without loss of generality, we assume
that the graph is directed [represented by G(V,A)].
Undirected graphs can always be considered as bidirected.
In the rst scheme, known as the Adjacency Matrix repre-
sentation, a [V[ [V[ matrix M is used, where every row
and column of the matrix corresponds to a node of the
graph, and entry M(a, b) is 1 if and only if (a; b) A. This
simple representation requires O([V[
2
) time and space.
GRAPH THEORY AND ALGORITHMS 5
In the second scheme, known as the Adjacency List
representation, an array L[1::[V[[ of linked lists is used.
The linked list starting at entry L[i] contains the set of all
nodes that are the heads of all edges with tail node i. The
time and space complexity of this scheme is ([V[ [E[).
Bothschemes are widelyusedas part of polynomial-time
algorithms for working with graphs (15). The Adjacency
List representation is more economical to construct, but
locating an edge using the Adjacency Matrix representa-
tion is very fast [takes O(1) time compared with the O([V[)
time required using the Adjacency List representation].
The choice between the two depends on the way the algo-
rithm needs to access the information on the graph.
A basic operation on graphs is the graph traversal,
where the goal is to systematically visit all nodes of the
graph. There are three popular graph traversal methods:
Depth-rst search (DFS), breadth-rst search (BFS), and
topological search. The last applies only to directed acyclic
graphs. Assume that all nodes are marked initially as
unvisited, and that the graph is represented using an
adjacency list L.
Depth-rst search traverses a graph following the dee-
pest (forward) direction possible. The algorithm starts by
selecting the lowest numbered node v and marking it as
visited. DFSselects anedge (v, u), where uis still unvisited,
marks u as visited, and starts a new search from node u.
After completing the search along all paths starting at u,
DFS returns to v. The process is continued until all nodes
reachable from v have been marked as visited. If there are
still unvisited nodes, the next unvisited node w is selected
andthe same process is repeateduntil all nodes of the graph
are visited.
The following is a recursive implementation of sub-
routine of DFS(v) that determines all nodes reachable
from a selected node v. L[v] represents the list of all nodes
that are the heads of edges with tail v, and array M[u]
contains the visited or unvisited status of every node u.
Procedure DFS(v)
M[v[ := visited;
FOR each node u L[v[DO
IF M[u[ = unvisited THEN Call DFS(u);
END DFS
The time complexity of DFS(v) is O([V
v
[ [E
v
[), where
[V
v
[; [E
v
[ are the numbers of nodes andedges that have been
visited by DFS(v). The total time for traversing the graph
using DFS is O([E[ [V[) = O([E[).
Breadth-rst search visits all nodes at distance k from
the lowest numbered node v before visiting any nodes at
distance k1. Breadth-rst search constructs a breadth-
rst search tree, initially containing only the lowest num-
bered node. Whenever an unvisited node wis visited in the
course of scanning the adjacency list of an already visited
node u, node w and edge (u, w) are added to the tree. The
traversal terminates when all nodes have been visited.
The approach can be implemented using queues so that
it terminates in O([E[) time.
The nal graph traversal method is the topological
search, which applies only to directed acyclic graphs. In
directed acyclic graphs, nodes have no incoming edges and
nodes have no outgoing edges. Topological search visits a
node only if it has no incoming edges or all its incoming
edges have been explored. The approach can be also be
implemented to run in O([E[) time.
Design Techniques for Polynomial-Time Algorithms
We describe belowthree main frameworks that can be used
to obtain polynomial-time algorithms for combinatorial
optimization (or decision) problems: (1) greedy algorithms;
(2) divide-and-conquer algorithms; and (3) dynamic pro-
gramming algorithms. Additional frameworks such as
linear programming are available, but these are discussed
in other parts of this book.
Greedy Algorithms. These algorithms use a greedy
(straightforward) approach. As an example of a greedy
algorithm, we consider the problem of nding the chro-
matic number (minimum number of independent sets) in
aninterval graph. Aninterval is a line aligned to the y-axis,
and each interval i is represented by its left and right
endpoints, denoted by l
i
and r
i
, respectively. This repre-
sentation of the intervals is also referred to as the inter-
val diagram.
In an interval graph, each node corresponds to an inter-
val and two nodes are connected by an edge if the two
intervals overlap. This problem nds immediate appli-
cation in VLSI routing (10). The minimum chromatic
number corresponds to the minimum number of tracks
so that all intervals allocated to a track do not overlap
with each other.
The greedy algorithmis better described on the interval
diagram instead of the interval graph; i.e., it operates on
the intervals rather than or the respective nodes on the
interval graph. It is also known as the left-edge algorithm
because it rst sorts the intervals according to their left-
edge values l
i
and then allocates them to tracks in ascend-
ing order of l
i
values.
To allocate a track to the currently examined interval i,
the algorithmserially examines whether anyof the existing
tracks can accommodate it. (The existing tracks have been
generated to accommodate all intervals j such that l
j
_ l
i
.)
Net i is greedily assigned to the rst existing track that can
accommodate it. A new track is generated, and interval i is
assigned to that track only if i has a conict with each
existing track.
It can be shown that the algorithm results to minimum
number of tracks; i.e., it achieves the chromatic number of
the interval graph. This happens because if more than one
track can accommodate interval i, then any assignment
will lead to an optimal solution because all nets k with
l
k
_l
i
can be assigned to the unselected tracks.
Divide-and-Conquer. This methodology is based on a
systematic partition of the input instance in a topdown
manner into smaller instances, until small enough
instances are obtainedfor whichthe solutionof the problem
6 GRAPH THEORY AND ALGORITHMS
degenerates to trivial computations. The overall optimal
solution, i.e., the optimal solution on the original input
instance, is then calculated by appropriately working on
the already calculated results on the subinstances. The
recursive nature of the methodology necessitates the solu-
tion of one or more recurrence relations to determine the
execution time.
As an example, consider how divide-and-conquer can
be applied to transform a weighted binary tree into a heap.
A weighted binary tree is a rooted directed tree in which
every nonleaf node has out-degree 2, and there is a value
(weight) associated with each node. (Each nonleaf node is
also referred to as parent, and the two nodes it points to
are referred to as its children.) A heap is a weighted binary
tree inwhichthe weight of everynode is no smaller thanthe
weight of either of its children.
The idea is to recursively separate the binary tree into
subtrees starting from the root and considering the sub-
trees rooted at its children, until the leaf nodes are encoun-
tered. The leaf nodes constitute trivially a heap. Then,
inductively, given two subtrees of the original tree that
are rooted at the children of the same parent and have been
made to have the heap property, the subtree rooted at the
parent can be made to have the heap property too by simply
nding which child has the largest weight and exchanging
that weight with the weight of the parent in case the latter
is smaller than the former.
Dynamic Programming. In dynamic programming, the
optimal solution is calculated by starting fromthe simplest
subinstances and combining the solutions of the smaller
subinstances to solve larger subinstances, in a bottom-up
manner. To guarantee a polynomial-time algorithm, the
total number of subinstances that have to be solved must be
polynomiallybounded. Once asubinstance has beensolved,
any larger subinstance that needs that subinstances solu-
tion, does not recompute it, but looks it up in table where it
has been stored. Dynamic programming is applicable only
to problems that obey the principle of optimality. This
principle holds whenever inanoptimal sequence of choices,
each subsequence is also optimal. The difculty in this
approach is to come up with a decomposition of the problem
into a sequence of subproblems for which the principle of
optimality holds and can be applied in polynomial time.
We illustrate the approach by nding the maximum
independent set in a cycle graph in O(n
2
) time, where n
is the number of chordal endpoints in the cycle (10). Note
that the maximum independent set is NP-hard on general
graphs.
Let G(V, E) be a cycle graph, and v
ab
V correspond to
a chord inthe cycle. We assume that no two chords share an
endpoint, and that the endpoints are labeled from 0 to
2n 1, where n is the number of chords, clockwise around
the cycle. Let G
ij
be the subgraphinducedbythe set of nodes
v
ab
V such that i _ a; b _ j.
Let M(i, j) denote a maximum independent set of G
i, j
.
M(i, j) is computed for every pair of chords, but M(i, a) must
be computed before M(i, b) if a<b. Observe that, if i _ j,
M(i; j) = 0 because G
i, j
has no chords. In general, to com-
pute M(i, j), the endpoint k of the chord with endpoint
j must be found. If k is not in the range [i, j 1], then
M(i; j) = M(i; j 1) because graph G
i, j
is identical to
graph G
i,j1
. Otherwise we consider two cases: (1) If
v
k j
M(i; j), then M(i, j) does not have any node v
ab
, where
a[i; k 1[ and b [k 1; j[. In this case, M(i; j) = M(i; k
1) M(K = 1; j 1) v
k j
. (2) If v
k j
= M(i; j), then
M(i; j) = M(i; j 1). Either of these two cases may apply,
but the largest of the two maximum independent sets will
be allocated to M(i, j). The owchart of the algorithm is
given as follows:
Procedure MIS(V)
FOR j = 0 TO 2n 1 DO
Let (j, k) be the chord whose one endpoint is j;
FOR i = 0 TO j 1 DO
IFi _k_j1AND|M(i,k1)|1|M(k1, j1)|>
|M(i, j1)| THENM(i, j) =M(i, k 1) v
kj
M(k
1, j1);
ELSE M(i, j) = M(i, j1);
END MIS
Basic Graph Problems
In this section, we discuss more analytically four problems
that are widely used in VLSI CAD, computer networking,
and other areas, in the sense that many problems are
reduced to solving these basic graph problems. They are
the shortest path problem, the ow problem, the graph
coloring problem, and the graph matching problem.
Shortest Paths. The instance consists of a graph G(V, E)
with lengths l(u, v) on its edges (u, v), a given source s V,
and a target t V. We assume without loss of generality
that the graph is directed. The goal is to nd a shortest
length path from s to t. The weights can be positive or
negative numbers, but there is no cycle for whichthe sumof
the weights on its edges is negative. (If negative length
cycles are allowed, the problem is NP-hard.) Variations of
the probleminclude the all-pair of nodes shortest paths and
the m shortest path calculation in a graph.
We present here a dynamic programming algorithm
for the shortest path problem that is known as the
BellmanFord algorithm. The algorithm has O(n
3
) time
complexity, but faster algorithms exist when all the weights
are positive [e.g., the Dijkstra algorithm with complexity
O(n minlog[E[; [V[)] or when the graph is acyclic (based
on topological search and with linear time complexity). All
existing algorithms for the shortest pathproblemare based
on dynamic programming. The BellmanFord algorithm
works as follows.
Let l(i, k) be the length of edge (i, j) if directed edge (i, j)
exists and otherwise. Let s( j) denote the length of the
shortest path from the source s to node j. Assume that
the source has label 1 and that the target has label n = [V[.
We have that s(1) = 0. We also knowthat in a shortest path
to any node j there must exist a node k, k ,= j, such that
s( j) = s(k) l(k; j). Therefore,
s( j) = min
k ,= j
s(k) l(k; j); j _2
BellmanFords algorithm, which eventually computes
all s( j), 1 _ j _ n, calculates optimally the quantity
GRAPH THEORY AND ALGORITHMS 7
s( j)
m1
dened as the length of the shortest path to node
j subject to the condition that the path does not contain
more than m 1 edges, 0 _ m _ [V[ 2. To be able to
calculate quantity s( j)
m1
for some value m 1, the
s( j)
m
values for all nodes j have to be calculated.
Given the initialization s(1)
1
= 0; s( j)
1
= l(1; j), j ,=1,
the computation of s( j)
m1
for any values of j and m can
be recursively computed using the formula
s( j)
m1
= mins( j)
m
; mins(k)
m
l(k; j)
The computation terminates when m = [V[ 1, because
no shortest path has more than [V[ 1 edges.
Flows. All owproblemformulations consider adirected
or undirected graph G = (V; E), a designated source s, a
designated target t, and a nonnegative integer capacity
c(i, j) on every edge (i, j). Such a graph is sometimes
referred to as a network. We assume that the graph is
directed. A ow from s to t is an assignment F of num-
bers f(i, j) on the edges, called the amount of ow through
edge (i, j), subject to the following conditions:
0 _ f (i; j) _ c(i; j) (1)
Every node i, except s and t, must satisfy the conservation
of ow. That is,
X
j
f ( j; i)
X
j
f (i; j) = 0 (2)
Nodes s and t satisfy, respectively,
X
j
f ( j; i)
X
j
f (i; j) = v; if i = s;
X
j
f ( j; i)
X
j
f (i; j) = v; if i = t (3)
where v =
P
j
f (s; j) =
P
j
f ( j; t) is called the value of the
ow.
A ow F that satises Equations (13) is called feasible.
Inthe Max Flowproblem, the goal is to nd a feasible owF
for which v is maximized. Such a ow is called a maximum
ow. There is a problem variation, called the Minimum
Flow problem, where Equation (1) is substituted by
f (i; j) _c(i; j) and the goal is to nd a ow F for which v
is minimized. The minimum ow problem can be solved
by modifying algorithms that compute the maximum ow
in a graph.
Finally, another ow problem formulation is the Mini-
mum Cost Flow problem. Here each edge has, in addition
to its capacity c(i, j), a cost p(i, j). If f(i, j) is the owthrough
the edge, then the cost of the ow through the edge is
p(i, j) f(i, j) and the overall cost C for a ow F of a value
v is
P
i; j
p(i; j) f (i; j). The problem is to nd a minimum
cost ow F for a given value v.
Many problems in VLSI CAD, computer networking,
scheduling, and so on can be modeled or reduced to one of
these three owproblemvariations. All three problems can
be solved in polynomial time using as subroutines shortest
path calculations. We describe below for illustration
purposes an O([V[
3
) algorithm for the maximum ow pro-
blem. However, faster algorithms exist inthe literature. We
rst give some denitions and theorems.
Let P be an undirected path froms to t; i.e., the direction
of the edges is ignored. An edge (i; j) P is said to be a
forward edge if it is directed from s to t and backward
otherwise. P is said to be an augmenting path with respect
to a given owF if f (i; j) <c(i; j) for each forward edge and
f (i; j) >0 for each backward edge in P.
Observe that if the ow in each forward edge of the
augmenting path is increased by one unit and the ow in
each backward edge is decreased by one unit, the ow
is feasible and its value has been increased by one unit.
We will show that a ow has maximum value if and only if
there is no augmenting path in the graph. Then the max-
imum ow algorithm is simply a series of calls to a sub-
routine that nds an augmenting path and increments the
value of the ow as described earlier.
Let SV be a subset of the nodes. The pair (S, T) is
called a cutset if T = V S. If s S and t T, the (S, T)
is called an (s, t) cutset. The capacity of the cutset (S, T) is
dened as c(S; T) =
P
i S
P
j T
c(i; j), i.e., the sum of the
capacities of all edges from S to T. We note that many
problems in networking, operations research, scheduling,
and VLSI CAD (physical design, synthesis, and testing)
are formulated as minimumcapacity (s, t) cutset problems.
We show below that the minimum capacity (s, t) problem
can be solved with a maximum ow algorithm.
Lemma 1. The value of any (s, t) ow cannot exceed the
capacity of any (s, t) cutset.
Proof. Let F be an (s, t) ow with value v. Let (S, T) be
an (s, t) cutset. From Equation 3 the value of the ow v is
also v =
P
i S
(
P
j
f (i; j)
P
j
f ( j; i))=
P
i S
P
j S
(f (i; j)
f ( j; i))
P
i S
P
j T
( f (i; j) f ( j; i))=
P
i S
P
j T
( f (i; j)
f ( j; i)), because
P
i S
P
j S
( f (i; j) f ( j; i)) is 0. But
f (i; j) _ c(i; j) and f ( j; i) _0. Therefore, v _
P
i S
P
j S
c(i; j) = c(S; T).&
Theorem 2. A owF has maximumvalue v if and only if
there is no augmenting path from s to t.
Proof. If there is an augmenting path, then we can
modify the ow to get a larger value ow. This contradicts
the assumption that the original ow has a maximum
value.
Suppose on the other hand that F is a ow such that
there is no augmenting path from s to t. We want to show
that F has the maximum ow value. Let S be the set of all
nodes j (including s) for which there is an augmenting path
from s to j. By the assumption that there is an augmenting
path from s to t, we must have that t = S. Let T = V S
(recall that t T). From the denition of S and T, it follows
that f (i; j) = c(i; j) and f ( j; i) = 0, \ i S; j T.
Now v =
P
i S
(
P
j
f (i; j)
P
j
f ( j; i)) =
P
i S
P
j S
( f (i; j)f ( j; i))
P
i S
P
j T
( f (i; j)f ( j; i))=
P
i S
P
j T
( f (i; j) f ( j; i)) =
P
i S
P
j S
c(i; j), because c(i; j) =
f (i; j) and f ( j; i) = 0, \ i; j. By Lemma 1, the ow has
the maximum value.&
8 GRAPH THEORY AND ALGORITHMS
Next, we state two theorems whose proof is straightfor-
ward.
Theorem 3. If all capacities are integers, then a max-
imum ow F exists, where all f(i, j) are integers.
Theorem 4. The maximum value of an (s, t) ow is equal
to the minimum capacity of an (s, t) cutset.
Finding an augmenting path in a graph can be done by a
systematic graph traversal in linear time. Thus, a straight-
forward implementation of the maximum ow algorithm
repeatedly nds an augmenting path and increments the
amount of the (s, t) ow. This is a pseudopolynomial-time
algorithm (see the next section), whose worst-case time
complexity is O(v [E[). In many cases, such an algorithm
may turn out to be very efcient. For example, when all
capacities are uniform, then the overall complexity
becomes O([E[
2
).
In general, the approach needs to be modied using the
EdmondsKarp modication (6), so that each ow aug-
mentation is made along an augmenting path with a mini-
mum number of edges. With this modication, it has
been proven that a maximum ow is obtained after no
more than [E[ [V[=2 augmentations, and the approach
becomes fully polynomial. Faster algorithms for maximum
ow computation rely on capacity scaling techniques and
are described in Refs. 8 and 15, among others.
Graph Coloring. Given a graph G(V, E), a proper k-
coloring of G is a function f from V to a set of integers
from 1 to k (referred to as colors) such that f (u) ,= f (v), if
(u; v) E. The minimum k for which a proper k-coloring
exists for graph G is known as the chromatic number of G.
Finding the chromatic number of a general graph and
producing the corresponding coloring is an NP-hard pro-
blem. The decision problem of graph k-colorability is NP-
complete in the general case for xed k _3. For k = 2, it is
polynomially solvable (bipartite matching). For planar
graphs and for k = 4, it is also polynomially solvable.
The graph coloring problem nds numerous applica-
tions in computer networks and communications, architec-
tural level synthesis, and other areas. As an example, the
Federal Communications Commission (FCC) monitors
radio stations (modeled as nodes of a graph) to make
sure that their signals do not interfere with each other.
They prevent interference by assigning appropriate fre-
quencies (each frequency is a color) to each station. It is
desirable to use the smallest possible number of frequen-
cies. As another example, several resource binding algo-
rithms for data path synthesis at the architectural level are
based on graph coloring formulations (9).
There is also the version of edge coloring: A proper
k-edge-coloring of Gis a functionf fromEto a set of integers
from1 to k such that f (e
1
) ,= f (e
2
), if edges e
1
and e
2
share a
node. In this case, each color class corresponds to a match-
ing in G, that is, a set of pairwise disjoint edges of G. The
minimum k for which a proper k-edge-coloring exists for
graph G is known as the edge-chromatic number of G. It
is known (Vizings theorem) that for any graph with max-
imum node degree d, there is a (d 1)-edge coloring.
Finding such a coloring can be done in O(n
2
) time. Notice
that the edge-chromatic number of a graph with maximum
node degree d is at least d. It is NP-complete to determine
whether the edge-chromatic number is d, but givenVizings
algorithm, the problem is considered in practice solved.
Graph Matching. A matching in a graph is a set ME
such that no two edges in M are incident to the same node.
The Maximum Cardinality Matching problem is the
most common version of the matching problem. Here the
goal is to obtain a matching so that the size (cardinality) of
M is maximized.
In the MaximumWeighted Matching version, each edge
(i; j) V has a nonnegative integer weight, and the goal is
to nd a matching M so that
P
e M
w(e) is maximized.
In the Min-Max Matching problem, the goal is to nd a
maximum cardinality matching M where the minimum
weight on anedge in Mis maximized. The Max-Min Match-
ing problem is dened in an analogous manner.
All above matching variations are solvable in poly-
nomial time and nd important applications. For example,
a variation of the min-cut graph partitioning problem,
which is central in physical design automation for VLSI,
asks for partitioning the nodes of a graph into sets of size
at most two so that the sum of the weights on all edges
with endpoints in different sets is minimized. It is easy to
see that this partitioning problemreduces to the maximum
weighted matching problem.
Matching problems often occur on bipartite G(V
1
V
2
; E) graphs. The maximum cardinality matching pro-
blem amounts to the maximum assignment of elements
inV
1
(workers) onto the elements of V
2
(tasks) so that no
worker in V
1
is assigned more than one task. This nds
various applications in operations research.
The maximum cardinality matching problem on a
bipartite graph G(V
1
V
2
; E) can be solved by a maximum
ow formulation. Simply, each node v V
1
is connected to
a newnode s by an edge (s, v) and each node uV
2
to a new
node t by an edge (u, t). In the resulting graph, every edge
is assigned unit capacity. The maximum ow value v
corresponds to the cardinality of the maximum matching
in the original bipartite graph G.
Although the matching problem variations on bipar-
tite graphs are amenable to easily described polynomial-
time algorithms, such as the one given above, the existing
polynomial-time algorithms for matchings on general
graphs are more complex (6).
Approximation and Pseudopolynomial Algorithms
Approximation and pseudopolynomial-time algorithms
concern mainly the solution of problems that are proven
to be NP-hard, although they can sometimes be used on
problems that are solvable in polynomial time, but for
which the corresponding polynomial-time algorithm
involves large constants.
An a-approximation algorithm A for an optimization
problem R is a polynomial-time algorithm such that, for
any instance I of R,
[S
A
(I)SOPT(I)[
SOPT(I)
_ a c, where S
OPT
(I) is the
cost of the optimal solution for instance I, S
A
(I) is the cost
of the solution found by algorithm A, and c is a constant.
GRAPH THEORY AND ALGORITHMS 9
As anexample, consider a special but practical versionof
the Traveling Salesman problem that obeys the triangular
inequality for all city distances. Given a weighted graph
G(V, E) of the cities, the algorithm rst nds a minimum
spanning tree T of G (that is, a spanning tree that has
minimum sum of edge weights). Then it nds a minimum
weight matchingMamongall nodes that have odddegree in
T. Lastly, it forms the subgraphG
/
(V, E
/
), where E
/
is the set
of all edges in Tand Mand nds a path that starts fromand
terminates to the same node andpasses throughevery edge
exactly once (such a path is known as Eulerian tour).
Every step in this algorithm takes polynomial time. It
has been shown that
[S
A
(I)SOPT(I)[
SOPT(I)
_
1
2
.
Unfortunately, obtaining a polynomial-time approxi-
mation algorithm for an NP-hard optimization problem
can be very difcult. In fact, it has been shown that this
may be impossible for some cases. For example, it has been
shown that unless NP = P, there is no a-approximation
algorithm for the general Traveling Salesman problem for
any a>0.
A pseudopolynomial-time algorithm for a problem R
is an algorithm with time complexity O(p(n, m)), where
p() is a polynomial of two variables, n is the size of the
instance, and m is the magnitude of the largest number
occuring in the instance. Only problems involving numbers
that are not bounded by a polynomial on the size of the
instance are applicable for solution by a pseudopolynomial-
time algorithm. In principle, a pseudopolynomial-time
algorithm is exponential given that the magnitude of a
number is exponential to the size of its logarithmic encod-
ing in the problem instance, but in practice, such an algo-
rithm may be useful in cases where the numbers involved
are not large.
NP-complete problems for which a pseudopolynomial-
time algorithm exists are referred to as weakly NP-
complete, whereas NP-complete problems for which no
pseudopolynomial-time algorithm exists (unless NP = P)
are referred to as strongly NP-complete. As an example, the
Network Inhibition problem, where the goal is to nd
the most cost-effective way to reduce the ability of a net-
work to transport a commodity, has been shown to be
strongly NP-complete for general graphs and weakly NP-
complete for series-parallel graphs (16).
Probabilistic and Randomized Algorithms
Probabilistic algorithms are a class of algorithms that do
not depend exclusively on their input to carry out the
computation. Instead, at one or more points in the course
of the algorithm where a choice has to be made, they use a
pseudorandom number generator to select randomly one
out of a nite set of alternatives for arriving at a solution.
Probabilistic algorithms are fully programmable (i.e., they
are not nondeterministic), but in contrast with the deter-
ministic algorithms, they may give different results for the
same input instance if the initial state of the pseudorandom
number generator differs each time.
Probabilistic algorithms try to reduce the computation
time by allowing a small probability of error in the com-
puted answer (Monte Carlo type) or by making sure that
the running time to compute the correct answer is small for
the large majority of input instances (Las Vegas type).
That is, algorithms of the Monte Carlo type always runfast,
but the answer they compute has a small probability of
being erroneous, whereas algorithms of the Las Vegas
type always compute the correct answer but they occasion-
ally may take a long time to terminate. If there is an
imposed time limit, Las Vegas-type algorithms can alter-
natively be viewed as always producing either a correct
answer or no answer at all within that time limit.
An example of a probabilistic algorithm on graphs is
the computation of a minimum spanning in a weighted
undirected graph by the algorithm of (17). This algorithm
computes a minimumspanning tree in O(m) time, where m
is the number of edges, with probability 1 e
V(m)
.
Randomization is also very useful in the design of heur-
istic algorithms. A typical example of a randomized heur-
istic algorithmis the simulated annealing method, which is
very effective for many graph theoretic formulations in
VLSI design automation (10). In the following we outline
the use of simulated annealing in the graph balanced
bipartitioning problem, where the goal is to partition the
nodes of a graphintwo equal-size sets so that the number of
edges that connect nodes in two different sets is minimized.
This problem formulation is central in the process of
obtaining a compact layout for the integrated circuit whose
components (gates or modules) are represented by the
graph nodes and its interconnects by the edges.
The optimization in balanced bipartitioning with a very
large number of nodes is analogous to the physical process
of annealing where a material is melted and subsequently
cooled down, under a specic schedule, so that it will
crystallize. The cooling must be slow so that thermal equi-
librium is reached at each temperature so that the atoms
are aranged in a pattern that resembles the global energy
minimum of the perfect crystal.
While simulating the process, the energy within the
material corresponds to a partitioning score. The process
starts with a random initial partitioning. An alternative
partitioning is obtained by exchanging nodes that are in
opposite sets. If the change in the score d is negative,
then the exchange is accepted because this represents
reduction in the energy. Otherwise, the exchange is
accepted withprobability e
d=t
; i.e., the probability of accep-
tance decreases with the increase in temperature t. This
method allows the simulated annealing algorithm to climb
out of local optimums in the search for a global optimum.
The quality of the solutiondepends onthe initial value of
the temperature value and the cooling schedule. Such
parameters are determined experimentally. The quality
of the obtained solutions is very good, altough the method
(being a heuristic) cannot guarantee optimality or even a
provably good solution.
FURTHER READING
E. Horowitz and S. Sahni, Fundamentals of Computer Algorithms.
New York: Computer Science Press, 1984.
10 GRAPH THEORY AND ALGORITHMS
BIBLIOGRAPHY
1. B. Bollobas, Modern Graph Theory. New York: Springer
Verlag, 1988.
2. S. Even, Graph Algorithms. New York: Computer Science
Press, 1979.
3. J. L. Gross and J. Yellen, Graph Theory and its Applications.
CRC Press, 1998.
4. R. Sedgewick, Algorithms in Java: Graph Algorithms.
Reading, MA: Addison-Wesley, 2003.
5. K. Thulasiraman and M. N. S. Swamy, Graphs: Theory and
Algorithms. New York: Wiley, 1992.
6. E. L. Lawler, Combinatorial OptimizationNetworks and
Matroids. New York: Holt, Rinehart and Winston, 1976.
7. C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimi-
zation: Algorithms and Complexity. Englewood Cliffs, NJ:
Prentice Hall, 1982.
8. R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, Network Flows.
Englewood Cliffs, NJ: Prentice Hall, 1993.
9. G. De Micheli, Synthesis and Optimization of Digital Circuits.
New York: McGraw-Hill, 1994.
10. N. A. Sherwani, Algorithms for VLSI Physical Design Auto-
mation. New York: Kluwer Academic, 1993.
11. D. Bertsekas and R. Gallagher, Data Networks. Upper Saddle
River, NJ: Prentice Hall, 1992.
12. M. Abramovici, M. A. Breuer, and A. D. Friedman, Digital
Systems Testing and Testable Design. New York: Computer
Science Press, 1990.
13. M. R. GareyandD. S. Johnson, Computers andIntractability
A Guide to the Theory of NP-Completeness. New York: W. H.
Freeman, 1979.
14. S. Skiena, Graph isomorphism, in Implementing Discrete
Mathematics: Combinatorics and Graph Theory with Mathe-
matica. Reading, MA: Addison-Wesley, 1990, pp. 181187.
15. T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
Algorithms. Cambridge, MA: MIT Press, 1990.
16. C. A. Phillips, The network inhibition problem, Proc. 25th
Annual ACM Symposium on the Theory of Computing, 1993,
pp. 776785.
17. D. R. Karger, P. N. Klein, andR. E. Tarjan, Arandomizedlinear
time algorithmto ndminimumspanningtrees, J. ACM, 42(2):
321328, 1995.
DIMITRI KAGARIS
SPYROS TRAGOUDAS
Southern Illinois University
Carbondale, Illinois
GRAPH THEORY AND ALGORITHMS 11
L
LINEAR AND NONLINEAR PROGRAMMING
INTRODUCTION
An optimization problem is a mathematical problem in
which data are used to nd the values of n variables so
as to minimize or maximize an overall objective while
satisfying a nite number of imposed restrictions on the
values of those variables. For example:
Linear Program (LP1) Nonlinear Program (NLP1)
maximize
subject to
x
1
x
2
minimize
subject to
3x
1
2x
2
x
1
3x
2
_ 6 (a)
2x
1
x
2
_ 4 (b)
x
1
_ 0 (c)
x
2
_ 0 (d)
x
2
1
x
2
2
_ 9 (a)
x
2
1
6x
1
x
2
2
_ 6 (b)
x
1
unrestricted
x
2
_ 0 (c)
The expression being maximized or minimized is called the
objective function, whereas the remaining restrictions
that the variables must satisfy are called the constraints.
The problemonthe left is alinear program(LP) because the
objective function and all constraints are linear and all
variables are continuous; that is, the variables can assume
any values, including fractions, within a given (possibly
innite) interval. In contrast, the problem on the right is a
nonlinear program (NLP) because the variables are contin-
uous but the objective function or at least one constraint is not
linear. Applications of such problems in computer science,
mathematics, business, economics, statistics, engineering,
operations research, and the sciences can be found in Ref. 1.
LINEAR PROGRAMMING
The geometry of LP1 is illustrated in Fig. 1(a). A feasible
solution consists of values for the variables that simulta-
neously satisfy all constraints, and the feasible region is
the set of all feasible solutions. The status of every LP is
always one of the following three types:
v
Infeasible, which means that the feasible region is
empty; that is, there are no values for the variables
that simultaneously satisfy all of the constraints.
v
Optimal, which means that there is an optimal
solution; that is, there is a feasible solution that
also provides the best possible value of the objective
function among all feasible solutions. (The optimal
solution for LP1 shown in Fig. 1(a) is x
1
= 1:2 and
x
2
= 1:6.)
v
Unbounded, which means that there are feasible
solutions that can make the objective function as large
(if maximizing) or as small (if minimizing) as desired.
Note that the feasible region in Fig. 1(a) has a nite
number of extreme points (the black dots), which are fea-
sible points where at least n of the constraints hold with
equality. Based on this observation, in 1951, George
Dantzig (2) developed the simplex algorithm (SA) that, in
a nite number of arithmetic operations, will determine the
status of any LP and, if optimal, will produce an optimal
solution. The SA is the most commonly used method for
solving an LP and works geometrically as follows:
Step 0 (Initialization). Find an initial extreme point,
say a vector x = (x
1
; . . . x
n
), or determine none exists;
in which case, stop, the LP is infeasible.
Step 1 (Test for Optimality). Perform a relatively
simple computation to determine whether x is an
optimal solution and, if so, stop.
Step 2 (Move to a Better Point). If x is not optimal,
use that fact to
(a) (Direction of Movement). Find a direction in
the form of a vector d = (d
1
; . . . d
n
) that points
from x along an edge of the feasible region so
that the objective function value improves as
you move in this direction.
(b) (Amount of Movement). Determine areal num-
ber t that represents the maximum amount you
can move from x in the direction d and stay
feasible. If t = , stop, the LP is unbounded.
(c) (Move). Move from x to the new extreme point
x td, and return to Step 1.
The translation of these steps to algebra constitutes
the SA. Because the SA works with algebra, all inequality
constraints are rst converted to equality constraints. To
do so, a nonnegative slack variable is added to (subtracted
from) each _ ( _) constraint. For given feasible values of
the original variables, the value of the slack variables
represents the difference between the left and the right
sides of the inequality constraint. For example, the cons-
traint x
1
3x
2
_ 6 in LP1 is converted to x
1
3x
2
s
1
= 6,
and if x
1
= 2 and x
2
= 0, thenthe value of the slack variable
is s
1
= 4, which is the difference between the left side of
x
1
3x
2
and the right side of 6 in the original constraint.
The algebraic analog of an extreme point is called a basic
feasible solution.
Step0 of the SAis referred to as phase 1, whereas Steps 1
and 2 are called phase 2. Phase 2 is nite because there
are a nite number of extreme points and no such point
visited by the SAis repeated because the objective function
strictly improves at each iteration, provided that t >0 in
Step 2(b) (even in the degenerate case when t = 0, the
algorithm can be made nite). Also, a nite procedure for
phase 1 is to use the SA to solve a special phase 1 LP that
involves only phase 2. In summary, the SA is nite.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
See Ref. 3 for improvements in the computational ef-
ciency and numerical accuracy of the SA. Many commercial
software packages use the SA, from Solver, an add-in to a
Microsoft (Redmond, WA) Excel spreadsheet, to CPLEX
(ILOG, Mountain View, CA) and LINDO(LINDOSystems,
Inc., Chicago, IL), stand-alone professional codes for sol-
ving large real-world problems with many thousands of
variables and constraints. These packages provide the
optimal solution and other economic information used in
postoptimality and sensitivity analysis to answer questions
of the form, What happens to the optimal solution and
objective function value if some of the data change? Such
questions are asked because the data are often estimated
rather than known precisely. Although you can always
change data and resolve the problem, you can sometimes
answer these questions without having to use a computer
when only one objective function coefcient or one value on
the right-hand side (rhs) of a constraint is changed.
Muchpostoptimality analysis is based onduality theory,
which derives from the fact that associated with every
original primal LP having n variables and m constraints
is another dual LP that has m variables and n constraints.
Where the primal problemis to be maximized (minimized),
the dual problem is to be minimized (maximized), for
example:
Primal LP Dual LP
min c
1
x
1
c
n
x
n
s:t: a
11
x
1
a
1n
x
n
_b
1
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
x
1
a
mn
x
n
_b
m
x
1
; ; x
n
_0
max b
1
u
1
b
m
u
m
s:t: a
11
u
1
a
m1
u
m
_ c
1
.
.
.
.
.
.
.
.
.
.
.
.
a
1n
u
1
a
mn
u
m
_ c
n
u
1
; ; u
n
_0
Duality theory is used in the following ways:
1. As a test for optimality in Step 1 of the SA. (If x is
feasible for the primal LPanduis feasible for the dual
LP, and the objective function values of the primal
and dual at x and u are equal, then x and u are
optimal solutions for their respective problems.)
2. To determine the status of an LP without having to
use a computer. (For example, if the primal LP is
feasible and the dual LPis infeasible, then the primal
LP is unbounded.)
3. To provide economic information about the primal
LP. (For example, the optimal value of a dual vari-
able, also called the shadow price of the associated
primal constraint, represents the change in the opti-
mal objective function value of the primal LPper unit
of increase in the right-hand side of that constraint
within a certain range of values around that right-
hand sideif all other data remain unchanged.)
4. To develop nite algorithms for solving the dual LP
and, in so doing, provide a solution to the primal LP,
assuming that both problems have optimal solutions.
Although efcient for solving most real-world problems,
many versions of the SA can, in the worst case, visit an
exponential (in terms of n) number of extreme points (see
Ref. 4 for the rst such example). It is an open question
as to whether there is a version of the SA that, in the worst
case, is polynomial (in the size of the data) although
Borgwardt (5) proved that a version of the SAis polynomial
on average. In 1979, Khachian (6) proposed the rst poly-
nomial algorithm for solving every LP. Karmarkar (7)
subsequently developed the rst polynomial algorithm
that was also efcient in practice. These interior point
methods (IPM) go through the inside of the feasible region
[see Fig. 1(a)] rather than along the edges like the SA
does. Computational experience with current versions of
these IPMs indicate that they can be more efcient than
the SA for solving specially-structured large LPs but
generally are not more efcient than the SA for other
LPs. See Refs. 8 and 9 for a discussion of IPMs and
Ref. 10 for linear programming and the SA.
NONLINEAR PROGRAMMING
An NLP differs from an LP in both the problem structure
and the ability to obtain solutions. Although an LP always
has constraints (otherwise, the LP is unbounded), an NLP
can be unconstrainedthat is, have no constraintsor
constrained, for example, nd x = (x
1
; . . . ; x
n
) to:
Unconstrained Problem (NLP2) Constrained Problem (NLP3)
minimize f (x) minimize f (x)
subject to g
1
(x) _ 0
.
.
.
.
.
.
.
.
.
g
m
(x) _ 0
Other differences between an LP and an NLP arise due
to nonlinearities in the objective function and cons-
traints of an NLP. Where an LP must be either infeasible,
x
1
x
2
1 2
1
2
Feasible
Region
Extreme
Points
Optimal Solution
(Extreme Point)
x
1
x
2
1 2
1
2
Optimal
Solution
(a)
(b)
(c)
(d)
3
3
Feasible
Region
(a) The Geometry of LP1 (b) The Geometry of NLP1
Figure 1. The geometry of LP1 and NLP1.
2 LINEAR AND NONLINEAR PROGRAMMING
optimal, or unbounded, witha nonlinear objective function,
an NLP can additionally approach, but never attain, its
smallest possible objective function value. For example,
when minimizing f (x) = e
x
, the value of f approaches 0
as x approaches but never attains the value 0.
The linear objective function and constraints in an
LP allow for a computationally efcient test for opti-
mality to determine whether a given feasible solution
x = (x
1
; . . . ; x
n
) is optimal. Incontrast, givena feasible point
x for an NLP with general nonlinear functions, there is no
known efcient test to determine whether x is a global
minimum of an NLP, that is, a point such that for every
feasible point y = (y
1
; . . . ; y
n
), f (x) _ f (y). There is also no
known efcient test to determine whether x is a local
minimum, that is, a feasible point such that there is a
real number d >0 for which every feasible point y with
|x y| <d satises f (x) _ f (y) [where |x y| =
P
n
i=1
(x
i
y
i
)
2
].
In the absence of an efcient test for optimality, many
NLP algorithms instead use a stopping rule to determine
whether the algorithm should terminate at the current
feasible point x. As described inthe next two sections, these
stopping rules are typically necessary conditions for x
to be a local minimum and usually have the following
desirable properties:
v
It is computationally practical to determine whether
the current feasible solution x satises the stopping
rule.
v
If x does not satisfy the stopping rule, then it is
computationally practical to nd a direction in which
to move fromx, stayfeasible at least for asmall amount
of movement in that direction, and simultaneously
improve the objective function value.
Be aware that if x is a feasible point for an NLP that
satises the stopping rule, this does not mean that x is a
global minimum (or even a local minimum) of the NLP.
Additional conditions on the nonlinear functionssuch as
convexityare required to ensure that x is a global mini-
mum. Indeed, it is often the case that nonconvex functions
make it challenging to solve an NLP because there can be
many local minima. In such cases, depending on the initial
values chosen for the variables, algorithms often terminate
at a local minimum that is not a global minimum. In
contrast, due to the (convex) linear functions in an LP,
the simplex algorithm always terminates at an optimal
solution if one exists, regardless of the initial starting
feasible solution.
A nal difculty caused by nonlinear functions is that
algorithms for attempting to solve such problems are
usually not nite (unlike the simplex algorithm for LP).
That is, NLP algorithms usually generate an innite
sequence of feasible values for the variables, say
X = (x
0
; x
1
; x
2
; . . .). Under fortunate circumstances, these
points will converge, that is, get closer to specic values for
the variables, say x
+
, where x
+
is a local or global minimum
for the NLP, or at least satises the stopping rule. In
practice, these algorithms are made to stop after a nite
number of iterations with an approximation to x
+
.
In the following discussion, details pertaining to
developing such algorithms are presented for the uncon-
strained NLP2 and the constrained NLP3, respectively.
From here on, it is assumed that the objective function f
and all constraint functions g
1
; . . . ; g
m
are differentiable.
See Refs. 11 and 12 for a discussion of nondifferentiable
optimization, also called nonsmooth optimization.
Unconstrained Nonlinear Programs
The unconstrained problem NLP2 differs from an LP
in that (1) the objective function of NLP2 is not linear
and (2) every point is feasible for NLP2. These differences
lead to the following algorithm for attempting to solve
NLP2 by modifying the simplex algorithm described
previously:
Step 0 (Initialization). Start with any values, say
x = (x
1
; . . . ; x
n
), for NLP2 because NLP2 has no con-
straints that must be satised.
Step 1 (Stopping Rule). Stop if the current point x =
(x
1
; . . . ; x
n
) for NLP2 satises the following stopping
rule, which uses the partial derivatives of f and is a
necessary condition for x to be a local minimum:
\ f (x) =
@ f
@x
1
(x); . . . ;
@ f
@x
n
(x)
= (0; . . . ; 0):
Note that ndingapoint xwhere\ f (x) = (0; . . . ; 0) is
aproblemof solvingasystemof nnonlinear equations
in the n unknowns x
1
; . . . ; x
n
, which can possibly be
done with Newtons method or some other algorithm
for solving a system of nonlinear equations.
Step2(MovetoaBetter Point). If \ f (x) ,= (0; . . . ; 0),
use that fact to
(a) (Direction of Movement). Find a direction of
descent d so that all small amounts of movement
fromx in the direction dresult in points that have
smaller objective function values than f(x). (One
such direction is d = \ f (x), but other directions
that are usually computationally more efcient
exist.)
(b) (Amount of Movement). Perform a possibly
innite algorithm, called a line search, that
may or may not be successful, to nd a value of
a real number t >0 that provides the smallest
value of f (x td). If t = , stop without nding
an optimal solution to NLP2.
(c) (Move). Move fromx to the new point x td, and
return to Step 1.
The foregoing algorithm may run forever, generating a
sequence of values for the variables, say X = (x
0
; x
1
; x
2
; . . .).
Under fortunate circumstances, these points will con-
verge to specic values for the variables, say x
+
, where x
+
LINEAR AND NONLINEAR PROGRAMMING 3
is a local or global minimum for NLP2 or at least satises
\ f (x) = (0; . . . ; 0).
Constrained Nonlinear Programs
Turning now to constrained NLPs, as seen by comparing
Fig. 1(a) and (b), the feasible region of a constrained NLP
can be different fromthat of an LPand the optimal solution
need not be one of a nite number of extreme points. These
differences leadto the following algorithmfor attemptingto
solve NLP3 by modifying the simplex algorithm described
previously:
Step 0 (Initialization). There is no known nite pro-
cedure that will nd an initial feasible solution for
NLP3 or determine that NLP3 is infeasible. Thus, a
possibly innite algorithmis needed for this step and,
under favorable circumstances, produces a feasible
solution.
Step 1 (Stopping Rule). Stop if the current feasible
solution x = (x
1
; . . . ; x
n
) for NLP3 satises the follow-
ing stopping rule: Real numbers u
1
; . . . ; u
m
exist such
that the following KarushKuhnTucker (KKT)
conditions hold at the point x (additional conditions
on f and g
i
, such as convexity, are required to ensure
that x is a global minimum):
(a) (feasibility) For each i = 1; . . . ; m; g
i
(x) _ 0and
u
i
_0.
(b) (complementarity) For each i = 1; . . . ; m;
u
i
g
i
(x) _ 0.
(c) (gradient condition) \ f (x)
P
m
i=1
u
i
\g
i
(x) =
(0; . . . ; 0).
Note that when there are no constraints, the KKT
conditions reduce to \ f (x) = (0; . . . ; 0). Also, when f
andg
i
are linear, andhence NLP3is anLP, the values
u = (u
i
; . . . ; u
m
) that satisfy the KKT conditions are
optimal for the dual LP. That is, conditions (a) and (c)
of the KKTconditions ensure that x is feasible for the
primal LP and u is feasible for the dual LP, whereas
conditions (b) and (c) ensure that the objective func-
tion value of the primal LP at x is equal to that of the
dual LP at u. Hence, as stated in the rst use of
duality theory in the Linear Programming section,
x is optimal for the primal LP and uis optimal for the
dual LP.
Step 2 (Move to a Better Point). If x is not a KKT
point, use that fact to
(a) (Directionof Movement). Find a feasible direc-
tion of improvementd so that all small amounts of
movement from x in the direction d result in
feasible solutions that have smaller objective
function values than f(x).
(b) (Amount of Movement). Determine the max-
imum amount t
+
you can move from x in the
direction d, and stay feasible. Then perform a
possibly innite algorithm, called a constrained
line search, in an attempt to nd a value of t that
minimizes f (x td) over the interval [0; t
+
[. If
t = , stop without nding an optimal solution
to NLP3.
(c) (Move). Move from x to the new feasible point
x td, and return to Step 1.
The foregoing algorithm may run forever, generating a
sequence of values for the variables, say X = (x
0
; x
1
; x
2
; . . .).
Under fortunate circumstances, these points will converge
to specic values for the variables, say x
+
, where x
+
is a local
or global minimum for NLP3 or a KKT point.
In addition to the foregoing methods of feasible direc-
tions, many other approaches have been developed for
attempting to solve NLP2 and NLP3, including conjugate
gradient methods, penalty and barrier methods, sequen-
tial quadratic approximation algorithms, and xed-point
algorithms (see Ref. 13 for details). A discussion of interior
point methods for solving NLPs can be found in Ref. 14.
Commercial software packages from Solver in Excel and
GRG exist for attempting to solve such problems and can
currently handle up to about 100 variables and 100 con-
straints. However, if an NLP has special structure, it may
be possible to develop an efcient algorithm that can solve
substantially larger problems.
Topics related to LP and NLP include integer program-
ming (in which the value of one or more variable must be
integer), network programming, combinatorial optimiza-
tion, large-scale optimization, xed-point computation, and
solving systems of linear and nonlinear equations.
BIBLIOGRAPHY
1. H. P. Williams, Model Building in Mathematical Program-
ming, 4th ed., New York: Wiley, 1999.
2. G. B. Dantzig, Maximization of a linear function of variables
subject to linear inequalities, in T. C. Koopmans (ed.), Activity
Analysis of Production and Allocation. NewYork: Wiley, 1951.
3. R. E. Bixby, Solving real-world linear programs: A decade and
more of progress, Oper. Res., 50(1): 315, 2002.
4. V. Klee and G. J. Minty, How good is the simplex algorithm?
O. Shisha (ed.), Inequalities III, New York: Academic Press,
1972.
5. K. H. Borgwardt, Some distribution independent results about
the assymptotic order of the average number of pivot steps in
the simplex algorithm, Math. Oper. Res., 7: 441462, 1982.
6. L. G. Khachian, Polynomial algorithms in linear programming
(in Russian), Doklady Akademiia Nauk SSSR, 244: 1093
1096, 1979; English translation: Soviet Mathematics Doklady,
20: 191194.
7. N. Karmarkar, A new polynomial-time algorithm for linear
programming, Combinatorica, 4: 373395, 1984.
8. C. Roos, T. Terlaky, and J. P. Vial, Theory and Algorithms for
Linear Optimization: An Interior Point Approach, New York:
Wiley, 1997.
9. Stephen J. Wright, Primal-dual Interior-Point Methods,
Society for Industrial and Applied Mathematics, Philadelphia,
1997.
10. C. M. S. Bazaraa, J. J. Jarvis, and H. Sherali, Linear Pro-
grammingandNetwork Flows, 3rd ed., NewYork: Wiley, 2004.
11. G. Giorgi, A. Guerraggio, and J. Thierfelder, Mathematics
of Optimization: Smooth and Nonsmooth Case, 1st ed.,
Amsterdam: Elsevier, 2004.
4 LINEAR AND NONLINEAR PROGRAMMING
12. D. Klatte and B. Kummer, Nonsmooth Equations in Optimiza-
tionRegularity, Calculus, Methods and Applications, Dor-
drecht: Kluwer Academic, 2002.
13. C. M. S. Bazaraa, H. Sherali, and C. M. Shetty, Nonlinear
Programming: Theory and Algorithms, 2nd ed., New York:
Wiley, 1993.
14. T. Terlaky (ed.), Interior Point Methods of Mathematical Pro-
gramming, Dordrecht, The Netherlands: Kluwer Academic,
1996.
DANIEL SOLOW
Case Western Reserve
University
Cleveland, Ohio
LINEAR AND NONLINEAR PROGRAMMING 5
M
MARKOV CHAIN MONTE CARLO
SIMULATIONS
Markov Chain Monte Carlo (MCMC) simulations are
widely used in many branches of science. They are nearly
as old as computers themselves, since they started in earn-
est with a 1953 paper by Nicholas Metropolis, Arianna
Rosenbluth, Marshall Rosenbluth, Augusta Teller, and
Edward Teller (1) at the Los Alamos National Laboratory,
New Mexico.These authors invented what is nowadays
called the Metropolis algorithm. Various applications
and the history of the basic ideas are reviewed in
the proceedings (2) of the 2003 Los Alamos conference,
which celebrated the ftieth anniversary of the Metropolis
algorithm.
OVERVIEW
The Monte Carlo method to compute integrals (3) amounts
to approximate an integral
R
V
dmXOX, inside a volume
V, where dmX is some integration measure and
R
V
dmX <1, by sampling. Without loss of generality,
one can assume that
R
V
dmX 1, and consider accord-
ingly dmX as a probability measure. Drawing indepen-
dently N
sample
values of X 2V according to this probability
measure, one has
Z
V
dmXOX
1
N
samples
X
N
samples
s1
OX
s
(1)
where X
s
is the sthrandomvariable drawn. The right hand
side of Equation 1 is an unbiased estimator of the integral,
namely it is exact in average, for any N
sample
. It converges
toward the correct value when N
samples
!1, with a
1=
N
samples
p
leading correction. This method is of practical
use if one can easily draw, in a computer program, random
values X
s
2V according to the measure dmX. This is the
case inparticular if one integrates inside anN- dimensional
hypercube with the at measure dm
flat
X /
Q
N
k1
dX
k
,
where the X
k
s are the Cartesian components of X, or
when the integral reduces to a nite sum. This is not the
case in the situation where this simple measure is multi-
plied by a nontrivial function vX of the components X
k
.
If vX is a smooth function in V with a limited range of
variations, one can still draw values of X according to the
simple measure dm
flat
and write (assuming without loss of
generality that
R
V
dmX 1 and
R
V
dm
flat
X 1)
Z
V
dmXOX
Z
V
dm
flat
XvXOX
1
N
samples
X
N
sample
s1
vX
s
OX
s
2
If the function vX has a very large range of variations
with one or several sharp peaks, the sum in Equation 2 is
dominated by a few rare congurations, and the Monte
Carlo method does not work (most samples are drawn in
vain). This is the case of the problem considered in Ref. 1,
where vX is a Boltzmann weight, the exponential of N
times a function of order one, with N the number of mole-
cules in a box, which is a notoriously large number.
The Markov chain Monte Carlo method allows overcom-
ing this problem by generating a set of random X 2V,
distributed according to the full measure dmX, using
an auxiliary Markov chain. Note that often, in particular
in the physics literature, Monte Carlo is used as a short
name for Markov chain Monte Carlo (MCMC). MCMC is
sometimes called dynamic Monte Carlo, in order to distin-
guish it from the usual, static, Monte Carlo. Many prob-
ability distributions, which cannot be sampled directly,
allowfor MCMCsampling. Fromnowonwe write a formula
for a discrete space V with K
st
states, although the results
can be generalized. A Markov chain (47) is a sequence of
random variables X
1
; X
2
; X
3
; . . ., that can be viewed as
the successive states of a system as a function of a discrete
time t, with a transition probability PX
t1
rjX
t
s
W
r;s
that is a function of r and s only. The next future state
X
t1
is a function of the current state X
t
alone. Andrey
Markov (8) was the rst to analyze these processes. Inorder
to analyze Markov chains, one introduces an ensemble of
chains withaninitial probability distribution PX
0
s. By
multiplying this vector by the transitionmatrix repeatedly,
one obtains PX
1
s, and thenPX
2
s; . . . ; successively.
The natural question is whether this sequence converges.
One says that a probability distribution w fw
s
g
s 2V
is an
equilibrium distribution of a Markov chain if it is let
invariant by the chain, namely if
X
K
st
s1
W
r;s
v
s
v
r
(3)
This conditionis called balance in the physics literature.
A Markov chain is said to be ergodic (irreducible and
aperiodic in the mathematical literature) if for all states
r; s 2Vthere is a N
r;s
suchthat for all t >N
r;s
the probability
W
t
s;r
to go fromr to s int steps is nonzero. If anequilibrium
distribution exists and if the chain is irreducible and aper-
iodic, one can show (47) that, starting from any distribu-
tion PX
0
s, the distribution after t steps PX
t
s
converges when t !1 toward the equilibrium congura-
tion v
s
.
The Metropolis algorithm (see the next section) offers a
practical way to generate a Markov chain with a desired
equilibrium distribution, on a computer using a pseudor-
andom numbers generator. Starting from an initial cong-
uration X
0
, successive congurations are generated. In
most cases, the convergence of PX
t
s toward v
s
is expo-
nential in t and one can safely assume that, after some
number of steps t
eq
, equilibrium is reached and that the
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
congurations X
teq1
; X
teq2
; . . . are distributed according to
vX.
Whereas the random samples used in a conventional
Monte Carlo (MC) integration are obviously statistically
independent, those used in MCMC are correlated. The
effective number of independent events generated by an
equilibrated Markov chain is given by the number of
steps done divided by a quantity called integrated auto-
correlation time t
int
. In most cases, t
int
does not depend on
the quantity Omeasured, but there are exceptions like, for
example the vicinity of a second-order (continuous) phase
transition. The algorithm will fail if t
int
is too big. In
practice it can be quite difcult to estimate t
int
reliably
and there can also be a residual effect of the starting
position. More sophisticated MCMC-based algorithms
such as coupling from the past (9,10) produce indepen-
dent samples rigorously but at the cost of additional
computation and an unbounded (although nite on aver-
age) running time.
The method was originally developed to investigate a
statistical physics problem. Suitable applications arise as
well in computational biology, chemistry, physics, econom-
ics, and other sciences. MCMC calculations have also revo-
lutionized the eld of Bayesian statistics.
Most concepts of modern MCMC simulation were ori-
ginally developed by physicists and chemists, who still
seem to be at the cutting edge of new innovative develop-
ments into which mathematician, computer scientists, and
statisticians have joined. Their interest developed mainly
after a paper by Hastings (11), who generalized the accept/
reject step of the Metropolis method. Unfortunately a lan-
guage barrier developed that inhibits cross-fertilization.
For instance the heat bath algorithm, which may be
applied when the conditional distributions can be sampled
exactly, was introduced by physicist Refs. (12) and (13).
Thenit was rediscoveredinthe context of Bayesianrestora-
tion of images under the name Gibbs sampler (14).
Another example is the umbrella sampling algorithm
(15) that was invented in chemical physics and later redis-
covered and improved by physicists. This is just two of
many examples of how different names for the same
method emerged. The reader should be aware that this
article was written by physicists, so that their notations
and views are dominant in this article. The book by Liu (16)
tries to some extent to bridge the language barrier. Other
textbooks include those by Robert and Casella (17) for the
more mathematically minded reader, Landau and Binder
(18) for statistical physicists, Berg (19) from a physicist
point of view, but also covering statistics and providing
extensive computer code (in Fortran). Kendall, et al. (20)
edited a book that combines expositions from physicists,
statisticians, and computer scientists.
Inthe following discussion, we will rst explainthe basic
method and illustrate it for a simple statistical physics
system, the Ising ferromagnet in two dimensions, followed
by a few remarks on autocorrelations and cluster algo-
rithms. An overview of the MCMC updating scheme is
subsequently given. The nal section of this article focuses
on so-called generalized ensemble algorithms.
MCMC AND STATISTICAL PHYSICS
As mentioned, the MCMCalgorithmwas invented to inves-
tigate problems in statistical physics. The aim of statistical
physics is to predict the average macroscopic properties of
systems made of many constituents, which can be, for
example, molecules in a box (as in the original article of
Metropolis et al.), magnetic moments (called spins) at xed
locations in a piece of material, or polymer chains, in a
solvent. If one considers the so-called canonical ensemble,
where the systemhas a xedtemperature T, one knows that
the probability to observe a given microscopic conguration
(or micro state) s (dened in the three examples givenabove
by the positions and velocities of all molecules; the orienta-
tions of all the spins; and the exact conguration of all the
polymer chains, respectively) is proportional to the Boltz-
mann weight expE
s
=k
B
T expbE
s
, where E
s
is the
energy of the conguration s, k
B
is the Boltzmann constant,
and T is the temperature. (In the following discussion, we
will use a unit of temperature such that k
B
1). Let O
s
be
the value of Ocomputed in conguration s. The mean value
of any macroscopic observable O(e.g. the energy) is givenby
the average of O
s
over all possible congurations s, weighted
by the Boltzmann weight of the conguration. This sum
(or integral if one has continuous variables) is to be normal-
ized by the sum over all congurations of the Boltzmann
weight [the so-called partition function ZT], namely
^
O
^
OT hOi Z
1
T
X
K
st
s1
O
s
e
E
s
=T
(4)
where ZT
X
K
st
s1
expE
s
=T. The index s 1; . . . ; K
st
labels the conguration (states) of the system.
Aparticularly simple systemis the Isingmodel for which
the energy is given by
E
X
<i j >
s
i
s
j
(5)
Here the sum is over the nearest-neighbor sites of a hyper-
cubic D-dimensional lattice and the Ising spins take the
values s
i
1, i 1; . . . ; N for a system of N
Q
D
i1
L
i
spins. Periodic boundary conditions are used in most simu-
lations. The energy per spin is e E=N. The model
describes a ferromagnet for which the magnetic moments
are simplied to 1 spins at the sites of the lattice. In the
N!1 limit (and for D>1), this model has two phases
separated by a phase transition (a singularity) at the
critical temperature T T
c
. This is a second-order phase
transition, which is continuous in the energy, but not in
specic heat. This model can be solved analytically when
D 2. (This means that one can obtain exact analytical
expressions for the thermodynamic quantities, e.g., the
energy per spin, as a function of temperature, in the
N!1limit.) This makes the two-dimensional (2-D) Ising
model an ideal testbed for MCMC algorithms.
2 MARKOV CHAIN MONTE CARLO SIMULATIONS
The number of congurations of the Ising model is
K
st
2
N
, because each spin can occur in two states (up
or down). Already for a moderately sized lattice, say linear
dimension L
1
L
2
100 in D 2, K
st
is a tremendously
large number. With the exception of very small systems, it
is therefore impossible to do the sums in Equation 4 expli-
citly. Instead, the large value of K
st
suggests a statistical
evaluation.
IMPORTANCE SAMPLING THROUGH MCMC
As explained in the previous section, one needs a procedure
that generates congurations s with the Boltzmann prob-
ability
P
B;s
c
B
w
B;s
c
B
e
bE
s
(6)
where the constant c
B
is determined by the condition
X
s
P
B;s
1. The vector P
B
: fP
B;s
g is called the Boltz-
mann state. When congurations are generated with the
probability P
B;s
, the expectation values [Equation (4)] are
estimated by the arithmetic averages:
hOi lim
N
samples
!1
1
N
samples
X
N
samples
s1
O
s
(7)
With the MCMCalgorithm, this is obtained froma Markov
chain with the following properties:
1. Irreducibility and aperiodicity (ergodicity in the phy-
sics literature): For any two congurations r and s, an
integer N
r;s
exists such that for all n>N
r;s
, the prob-
ability W
n
r;s
is nonzero.
2. Normalization:
X
r
W
r;s
1.
3. Stationarity of the Boltzmann distribution (balance
inthe physics literature): The Boltzmannstate [Equa-
tion(6)] is anequilibriumdistribution, namely a right
eigenvector of the matrix W with eigenvalue one, i.e.,
X
s
W
r;s
e
bEs
e
bEr
holds.
There are many ways to construct a Markov process
satisfying 1, 2, and 3. In practice, MCMC algorithms are
often based on a stronger condition than 3, namely,
3. Detailed balance:
W
r;s
e
bE
s
W
s;r
e
bE
r
8r; s: (8)
METROPOLIS ALGORITHM AND ILLUSTRATION FOR THE
ISING MODEL
Detailed balance obviously does not uniquely x the transi-
tion probabilities W
r;s
. The original Metropolis algorithm
(1) has remained a popular choice because of its generality
(it can be applied straightforwardly to any statistical phy-
sics model) and its computational simplicity. The original
formulation generates congurations with the Boltzmann
weights. It generalizes immediately to arbitrary weights
(see the last section of this article). Given a conguration s,
the algorithm proposes a new conguration r with a priori
probabilities f r; s. This new conguration r is accepted
with probability
a
r;s
min 1;
P
B;r
P
B;s
1 for E
r
<E
s
e
bE
r
E
s
for E
r
>E
s
(9)
If the newconguration is rejected, the old conguration is
kept and counted again. For such decisions one uses nor-
mally a pseudorandom number generator, which delivers
uniformly distributed pseudorandomnumbers inthe range
0; 1; see Refs. (18) and (19) for examples, and Ref. (21) for
the mathematics involved. The Metropolis algorithm gives
then rise to the transition probabilities
W
r;s
f r; sa
r;s
r 6s (10)
and
W
s;s
f s; s
X
r 6s
f r; s1 a
r;s
(11)
Therefore, the ratio W
r;s
=W
s;r
satises the detailed balance
condition [Equation (8)] if
f r; s f s; r (12)
holds. This condition can be generalized when also the
acceptance probability is changed (11). In particular such
biased Metropolis algorithms allow for approaching the
heat-bath updating scheme (22).
For the Ising model, the new putative conguration
differs from the old one by the ip of a single spin. Such a
singlespinupdate requiresanumberof operationsof order
one and leads to a ratio P
B;r
=P
B;s
in Equation (9). This is
fundamental inorder to have anefcient MCMCalgorithm.
The spin itself may be chosen at random, although some
sequential procedure is normally more efcient (19). The
latter procedure violates the detailed balance condition but
fullls the balance condition, and thus, it does lead to the
probability of states approaching the Boltzmann distribu-
tion [Equation (6)]. The unit of Monte Carlo time is usually
dened by N single spin updates, aka a lattice sweep or an
update per spin. Many ways to generate initial congura-
tions exist. Two easy to implement choices are as follows:
1. Use random sampling to generate a conguration of
1 spins.
2. Generate a completely ordered conguration, either
all spin 1 or 1.
Figure 1 shows two Metropolis energy time series (the
successive values of the energy as a function of the discrete
Monte Carlo time) of 6000 updates per spin for a 2D Ising
model on a 100 100 lattice at b 0:44, which is close to
the (innite volume) phase transition temperature of this
model (b
c
ln1
2
p
=2 0:44068 . . .). One can compare
with the exact value for e (23) on this system size, which is
indicated by the straight line in Fig. 1. The time series
MARKOV CHAIN MONTE CARLO SIMULATIONS 3
corresponding to the ordered start begins at e 2 and
approaches the exact value from below, whereas the other
time series begins (up to statistical uctuation) at e 0 and
approaches the exact value from above. It takes a rather
long MCMC time of about 3000 to 4000 updates per spin
until the two time series start to mix. For estimating
equilibrium expectation values, measurements should
only be counted from there on. A serious error analysis
for the subsequent equilibrium simulation nds an inte-
grated autocorrelation time of t
int
1700 updates per spin.
This long autocorrelation time is related to the proximity of
the phase transition (this is the so-called critical slowing
downof the dynamics). One canshowthat, at the transition
point, t
int
diverges like a power of N.
For this model and a number of other systems with
second-order phase transitions, cluster algorithms
(24,25) are known, which have much shorter autocorrela-
tion times, in simulations close to the phase transition
point. For the simulation of Fig. 1, t
int
5 updates per
spin instead of 1700. Furthermore, for cluster algorithms,
t
int
grows much slower with the system size N than for the
Metropolis algorithm. This happens because these algo-
rithms allow for the instantaneous ip of large clusters of
spins (still with order one acceptance probability) in con-
trast to the local updates of single spins done in the Metro-
polis and heat-bath-type algorithms. Unfortunately such
nonlocal updating algorithms have remained conned to
special situations. The interested reader can reproduce
these Ising model simulations discussed here using the
code that comes with Ref. 19.
UPDATING SCHEMES
We give an overview of MCMC updating schemes that is,
because of space limitations, not complete at all.
1. We have already discussed the Metropolis scheme (1)
and its generalization by Hastings (7). Variables of
the system are locally updated.
2. Within such local schemes, the heat-bath method
(12-14) is usually the most efcient. In practice it
can be approximated by tabulating MetropolisHast-
ings probabilities (22).
3. For simulations at (very) low temperatures, event-
driven simulations (26,27), also known as the N-fold
way, are most efcient. Theyare basedonMetropolis
or heat-bath schemes.
4. As mentioned before, for a number of models with
second order phase transitions the MCMC efciency
is greatly improved by using nonlocal cluster updat-
ing (24).
5. Molecular dynamics (MD) moves can be used as
proposals inMCMCupdating (28,29), a scheme called
hybrid MC, see Ref. (30) for a reviewof MD simula-
tions.
More examples can be found in the W. Krauth contribu-
tion in Ref. (31) and A. D. Sokal in Ref. (32).
GENERALIZED ENSEMBLES FOR MCMC SIMULATIONS
The MCMC method, which we discussed for the Ising
model, generates congurations distributed according to
the BoltzmannGibbs canonical ensemble, with weights
P
B;s
. Mean values of physical observables at the tempera-
ture chosen are obtained as arithmetic averages of the
measurements made [Equation 7]. There are, however,
in statistical physics, circumstances where another ensem-
ble (other weights P
NB;s
) can be more convenient. One case
is the computation of the partition function ZT as a
function of temperature. Another is the investigation of
congurations of physical interest that are rare in the
canonical ensemble. Finally the efciency of the Markov
process, i.e., the computer time needed to obtain a desired
accuracy, can depend greatly on the ensemble in which the
simulations are performed. This is the case, for example
when taking into account the Boltzmann weights, the
phase space separates, loosely speaking, into several popu-
lated regions separated by mostly vacant regions, creating
so-called free energy barriers for the Markov chain.
A rst attempt to calculate the partition function by
MCMCsimulations dates back to a 1959 paper by Salsburg
et al. Ref. (33). As noticed by the authors, their method is
restricted to very small lattices. The reason is that their
approach relies on what is called in the modern language
reweighting. It evaluates results at a given temperature
from data taken at another temperature. The reweighting
method has a long history. McDonald and Singer (34) were
the rst to use it to evaluate physical quantities over a
range of temperatures from a simulation done at a single
temperature. Thereafter dormant, the method was redis-
covered in an article (35) focused on calculating complex
zeros of the partition function. It remained to Ferrenberg
and Swendsen (36), to formulate a crystal clear picture for
what the method is particularly good, and for what it is not:
The reweighting method allows for focusing on maxima of
appropriate observables, but it does not allowfor covering a
nite temperature range in the innite volume limit.
To estimate the partition function over a nite energy
density range De, i.e., DEN, one canpatchthe histograms
Figure 1. Two-dimensional Ising model: Two initial Metropolis
time series for the energy per spin e.
4 MARKOV CHAIN MONTE CARLO SIMULATIONS
from simulations at several temperatures. Such multi-his-
togrammethods also have a long tradition. In 1972 Valleau
and Card (37) proposed the use of overlapping bridging
distributions and called their method multistage
sampling. Free energy and entropy calculations become
possible when one can link the temperature region of
interest with a range for which exact values of these
quantities are known. Modern work (38,39) developed ef-
cient techniques to combine the overlapping distributions
into one estimate of the spectral density rE [with
ZT
R
dErEexpbE] and to control the statistical
errors of the estimate. However, the patching of histograms
from canonical simulations faces several limitations:
1. The number of canonical simulations neededdiverges
like
N
p
when one wants to cover a nite, noncritical
range of the energy density.
2. At a rst-order phase transition point, the canonical
probability of congurations with an interface
decreases exponentially with N.
One can cope with the difculties of multi-histogram
methods by allowing arbitrary sampling distributions
instead of just the BoltzmannGibbs ensemble. This was
rst recognized by Torrie and Valleau (15) when they
introduced umbrella sampling. However, for the next 13
years, the potentially very broad range of applications of
the basic idea remained unrecognized. A major barrier,
which prevented researchers from trying such extensions,
was certainly the apparent lack of direct and straightfor-
ward ways of determining suitable weighting functions
P
NB;s
for problems at hand. In the words of Li and Scheraga
(40): The difculty of nding such weighting factors has
prevented wide applications of the umbrella sampling
method to many physical systems.
This changed with the introduction of the multicanoni-
cal ensemble multicanonical ensemble (41), which focuses
on well-dened weight functions and offers a variety of
methods to ndaworkingapproximation. Here aworking
approximation is dened as being accurate enough, so that
the desired energy range will indeed be covered after the
weight factors are xed. A similar approach can also be
constructed for cluster algorithms (42). A typical simula-
tion consists then of three parts:
1. Construct a working approximation of the weight
function P
muca
. The WangLandau recursion (43) is
an efcient approach. See Ref. (44) for a comparison
with other methods.
2. Performa conventional MCMCsimulationwiththese
weight factors.
3. Reweight the data to the desired ensemble: hOi
lim
Nsamples !1
1
N
samples
X
Nsamples
s1
O
s
P
B;s
=P
muca;s
Details can be
found in Ref. 19.
Another class of algorithms appearedinacouple of years
around 1991 in several papers (4550), which all aimed at
improving MCMCcalculations by extending the connes of
the canonical ensemble. Practically most important has
been the replica exchange method, which is also known
under the names parallel tempering and multiple Markov
chains. In the context of spin glass simulations, an
exchange of partial lattice congurations at different tem-
peratures was proposed by Swendsen and Wang (45). But it
was only recognized later (46,50) that the special case for
which entire congurations are exchanged is of utmost
importance, see Ref. 19 for more details. Closely related
is the Jump Walker (J-Walker) approach (51), which feeds
replica from a higher temperature into a simulation at a
lower temperature instead of exchanging them. But in
contrast to the replica exchange method this procedure
disturbs equilibrium to some extent. Finally, and perhaps
most importantly, from about 1992 on, applications of
generalized ensemble methods diversied tremendously
as documented in a number of reviews (5254).
Figure 2. Multicanonical P
muca
E together with canonical PE
energy distribution as obtained in Ref. 41 for the 2d 10-state Potts
model on a 70 70 lattice. In the multicanonical ensemble, the gap
between the two peaks present in PE is lled up, accelerating
considerably the dynamics of the Markov Chain.
Figure 3. Canonical energy distributions PE from a parallel
tempering simulationwitheight processes for the 2d10-state Potts
model on 20 20 lattices (Fig. 6.2 in Ref. 19).
MARKOV CHAIN MONTE CARLO SIMULATIONS 5
The basic mechanisms for overcoming energy barriers
with the multicanonical algorithm are best illustrated for
rst-order phase transitions (namely a transition with a
nonzero latent heat, like the icewater transition), where
one deals with a single barrier. For a nite system the
temperature can be ne-tuned to a pseudocritical value,
which is dened so that the energy density exhibits two
peaks of equal heights. To give an example, Fig. 2 shows for
the 2d10-state Potts (55) the canonical energy histogramat
a pseudocritical temperature versus the energy histogram
of a multicanonical simulation (41). The same barrier can
also be overcomebyaparallel temperingsimulationbut ina
quite different way. Figure 3 shows the histograms from a
parallel tempering simulation with eight processes on
20 20 lattices. The barrier can be jumped when there
are on both sides temperatures in the ensemble, which are
sufciently close to a pseudocritical temperature for which
the two peaks of the histogramare of competitive height. In
complex systems with a rugged free energy landscape (spin
glasses, biomolecules, . . .), the barriers can no longer be
explicitly controlled. Nevertheless it has turned out that
switching to the discussed ensembles can greatly enhance
the MCMC efciency (53,54). For a recent discussion of
ensemble optimization techniques, see Ref. (56) and refer-
ences given therein.
BIBLIOGRAPHY
1. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H.
Teller, and E. Teller, Equation of state calculations by fast
computing machines. J. Chem. Phys., 21: 10871092, 1953.
2. J. Gubernatis (Editor). The Monte Carlo methodinthe physical
sciences: Celebrating the 50th anniversary of the Metropolis
algorithm, AIP Conference Proc. Vol 690, Melville, NY, 2003.
3. N. Metropolis and S. Ulam, The Monte Carlo method. J. Am.
Stat. Assoc., 44: 335341, 1949.
4. J. G. Kemeny and J. L. Snell, Finite Markov Chains.NewYork:
Springer, 1976.
5. M. Iosifescu, Finite Markov Processes and Their Applica-
tions.Chichester: Wiley, 1980.
6. K. L. Chung, Markov Chains with Stationary Transition Prob-
abilities, 2nd ed., New York: Springer, 1967.
7. E. Nummelin, General Irreducible Markov Chains and Non-
Negative Operators.Cambridge: Cambridge Univ. Press, 1984.
8. A. A. Markov, Rasprostranenie zakona bolshih chisel na veli-
chiny, zavisyaschie drug ot druga. Izvestiya Fiziko-matema-
ticheskogo obschestva pri Kazanskom Universitete, 2-ya
seriya, tom 15: 135156, 1906.
9. J. G. Propp and D. B. Wilson, Exact sampling with coupled
Markov chains and applications in statistical mechanics. Ran-
dom Structures and Algorithms, 9: 223252, 1996.
10. J. G. Proof and D. B. Wilson, Coupling from the Past Users
Guide. DIMACS Series in Discrete Mathematics and Theore-
tical Computer Science (AMS), 41: 181192, 1998.
11. W. K. Hastings, Monte Carlo sampling methods using Markov
chains and their applications. Biometrika, 57: 97109, 1970.
12. R. J. Glauber, Time-dependent statistics of the Ising model. J.
Math. Phys., 4: 294307, 1963.
13. M. Creutz, Monte Carlo studyof quantizedSU(2) gauge theory.
Phys. Rev. D., 21: 23082315, 1980.
14. S. Geman and D. Geman, Stochastic relaxation, Gibbs distri-
butions, and the Bayesian restoration of images. IEEE Trans.
Pattern Anal. Machine Intelli. 6: 721741, 1984.
15. G. M. Torrie and J. P. Valleau, Nonphysical sampling distribu-
tions in Monte Carlo free energy estimation: Umbrella sam-
pling. J. Comp. Phys., 23: 187199, 1977.
16. J. S. Liu, Monte Carlo strategies in scientic computing.
New York: Springer, 2001.
17. C. P. Robert and G. Casella, Monte Carlo statistical methods
(2nd ed.), New York: Springer, 2005.
18. D. P. Landau and K. Binder, A guide to Monte Carlo simula-
tions in statistical physics. Cambridge: Cambridge University
Press, 2000.
19. B. A. Berg, Markov chain Monte Carlo simulations and their
statistical analysis. Singapore: World Scientic, 2004.
20. W. S. Kendall, F. Liang, and J.-S. Wang (Eds), Markov Chain
Monte Carlo: Innovations and applications (Lecture Notes
Series, Institute for Mathematical Sciences, National Univer-
sity of Singapore). Singapore: World Scientic, 2005.
21. D. Knuth, The art of computer programming, Vol 2: Semi
numerical algorithms, Third Edition. Reading, MA: Addison-
Wesley, 1997, pp. 1193
22. A. Bazavov and B. A. Berg, Heat bath efciency with Metro-
polis-type updating. Phys. Rev. D., 71: 114506, 2005.
23. A. E. Ferdinand and M. E. Fisher, Bounded and inhomoge-
neous Ising models. I. Specic-heat anomaly of a nite lattice.
Phys. Rev., 185: 832846, 1969.
24. R. H. Swendsen and J.-S. Wang, Non-universal critical
dynamics in Monte Carlo simulations. Phys. Rev. Lett., 58:
8688, 1987.
25. U. Wolff, Collective Monte Carlo updating for spin systems.
Phys. Rev. Lett., 62: 361363, 1989.
26. A. B. Bortz, M. H. Kalos, and J. L. Lebowitz, A new algorithm
for Monte Carlo simulation of Ising spin systems. J. Comp.
Phys., 17: 1018, 1975.
27. M. A. Novotny, A tutorial on advanced dynamic Monte carlo
methods for systems with discrete state spaces. Ann. Rev.
Comp. Phys., 9: 153210, 2001.
28. S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth,
Hybrid Monte Carlo. Phys. Lett. B., 195: 216222, 1987.
29. B. Mehlig, D. W. Heermann, and B. M. Forrest, Hybrid Monte
Carlo Methods for condensed-matter systems. Phys. Rev. B.,
45: 679685, 1992.
30. D. Frenkel and B. Smit, Understanding molecular simulation.
San Diego CA: Academic Press, 1996.
31. J. Kertesz and I. Kondor (Ed.), Advances in Computer Simula-
tion. Lecture Notes in Physics, Heidelberg: Springer Verlag,
1998.
32. K. Binder (Ed.), Monte Carlo and molecular dynamics simula-
tions inpolymer science, Oxford: OxfordUniversityPress, 1996.
33. Z. W. Salsburg, J. D. Jacobson, W. S. Fickett, and W. W. Wood.
Applications of the Monte Carlo method to the lattice-gas
model. I. Two-dimensional triangular lattice. J. Chem.
Phys., 30: 6572, 1959.
34. I. R. McDonald and K. Singer, Calculation of thermodynamic
properties of liquidargonfromLennard-Jones parameters by a
Monte Carlo Method. Discussions Faraday Soc., 43: 4049,
1967.
35. M. Falcioni, E. Marinari, L. Paciello, G. Parisi, and B.
Taglienti, Complex zeros in the partition function of four-
dimensional SU(2) lattice gauge model. Phys. Lett. B., 108:
331332, 1982.
6 MARKOV CHAIN MONTE CARLO SIMULATIONS
36. A. M. Ferrenberg and R. H. Swendsen, New Monte Carlo
technique for studying phase transitions. Phys. Rev. Lett.,
61: 26352638, 1988; 63: 1658, 1989.
37. J. P. ValleauandD. N. Card, Monte Carlo estimationof the free
energyby multistage sampling. J. Chem. Phys., 37: 54575462,
1972.
38. A. M. Ferrenberg and R. H. Swendsen, Optimized Monte Carlo
data analysis. Phys. Rev. Lett., 63: 11951198, 1989.
39. N. A. Alves, B. A. Berg, and R. Villanova, Ising-Model Monte
Carlo simulations: Density of states and mass gap. Phys. Rev.
B., 41: 383394, 1990.
40. Z. Li and H. A. Scheraga, Structure and free energy of complex
thermodynamic systems. J. Mol. Struct. (Theochem), 179: 333
352, 1988.
41. B. A. Berg and T. Neuhaus, Multicanonical ensemble: A new
approach to simulate rst-order phase transitions. Phys. Rev.
Lett., 68: 912, 1992.
42. W. Janke and S. Kappler, Multibondic cluster algorithm for
Monte Carlo simulations of rst-order phase transitions. Phys.
Rev. Lett., 74: 212215, 1985.
43. F. Wang and D. P. Landau, Efcient, multiple-range random
walk algorithm to calculate the density of states. Phys. Rev.
Lett., 86: 20502053, 2001.
44. Y. Okamoto, Metropolis algorithms in generalized ensemble.
In Ref. (2), pp. 248260. On the web at http://arxiv.org/abs/
cond-mat/0308119.
45. R. H. Swendsen and J.-S. Wang, Replica Monte Carlo simula-
tions of spin glasses. Phys. Rev. Lett., 57: 26072609, 1986.
46. C. J. Geyer, Markov Chain Monte Carlo maximum likelihood,
in E. M. Keramidas (ed.), Computing Science and Statistics,
Proc. of the 23rd Symposium on the Interface. Fairfax, VA,
Interface Foundation, 1991, pp. 156163.
47. C. J. Geyer and E. A. Thompson, Annealing Markov chain
Monte Carlo with applications to ancestral inference. J. Am.
Stat. Ass., 90: 909920, 1995.
48. E. Marinari and G. Parisi, Simulated tempering: A newMonte
Carlo scheme. Europhys. Lett., 19: 451458, 1992.
49. A. P. Lyubartsev, A. A. Martsinovski, S. V. Shevkanov, and
P. N. Vorontsov-Velyaminov, New approach to Monte Carlo
calculation of the free energy: Method of expanded ensembles.
J. Chem. Phys., 96: 17761783, 1992.
50. K. Hukusima and K. Nemoto, Exchange Monte Carlo method
and applications to spinglass simulations. J. Phys. Soc. Japan,
65: 16041608, 1996.
51. D. D. Frantz, D. L. Freemann, and J. D. Doll, Reducing quasi-
ergodic behavior in Monte Carlo simulations by J-walking:
Applications to atomic clusters. J. Chem. Phys., 93: 2769
2784, 1990.
52. W. Janke, Multicanonical Monte Carlo simulations. Physica
A., 254: 164178, 1998.
53. U. H. Hansmann and Y. Okamoto, The generalized-ensemble
approach for protein folding simulations. Ann. Rev. Comp.
Phys., 6: 129157, 1999.
54. A. Mitsutake, Y. Sugita, and Y. Okamoto, Generalized-
ensemble algorithms for molecular simulations of biopolymers.
Biopolymers (Peptide Science), 60: 96123, 2001.
55. F. Y. Wu, The Potts model. Rev. Mod. Phys., 54: 235268, 1982.
56. S. Trebst, D. A. Huse, E. Gull, H. G. Katzgraber, U. H. E.
Hansmann, and M. Troyer, Ensemble optimization techniques
for the simulation of slowly equilibrating systems, Invited talk
at the19th Annual Workshop on Computer Simulation Studies
in Condensed Matter Physics, in D. P. Landau, S. P. Lewis, and
H.-B. Schuettler (eds.), Athens, GA, 2024 February 2006;
Springer Proceedings in Physics, Vol. 115, 2007. Available
on the internet at: http://arxiv.org/abs/cond-mat/0606006.
BERND A. BERG
Florida State University
Tallahassee, Florida
ALAIN BILLOIRE
Service de Physique Theorique
CEA Saclay
Gif-sur-Yvette, France
MARKOV CHAIN MONTE CARLO SIMULATIONS 7
M
MARKOV CHAINS
INTRODUCTION
A Markov chain is, roughly speaking, some collection of
random variables with a temporal ordering that have the
property that conditional upon the present, the future does
not depend on the past. This concept, which can be viewed
as a formof something known as the Markov property, will
be made precise below, but the principle point is that such
collections lie somewhere between one of independent
random variables and a completely general collection,
which could be extremely complex to deal with.
Andrei Andreivich Markov commenced the analysis of
such collections of random variables in 1907, and their
analysis remains an active area of research to this day.
The study of Markov chains is one of the great achieve-
ments of probability theory. In his seminal work (1),
Andrei Nikolaevich Kolmogorov remarked, Historically,
the independence of experiments and random variables
represents the very mathematical concept that has given
probability its peculiar stamp.
However, there are many situations in which it is neces-
sary to consider sequences of randomvariables that cannot
be considered to be independent. Kolmogorov went on to
observe that [Markov et al.] frequently fail to assume
complete independence, they nevertheless reveal the
importance of assuming analogous, weaker conditions, in
order to obtain signicant results. The aforementioned
Markov property, the dening feature of the Markov chain,
is such an analogous, weaker condition and it has proved
both strong enough to allow many, powerful results to be
obtained while weak enough to allow it to encompass a
great many interesting cases.
Much development in probability theory during the
latter part of the last century consisted of the study of
sequences of random variables that are not entirely inde-
pendent. Two weaker, but related conditions proved to be
especially useful: the Markov property that denes the
Markov chain and the martingale property. Loosely
speaking, a martingale is a sequence of random variables
whose expectation at any point in the future, which is
conditional on the past and the present is equal to
its current value. A broad and deep literature exists on
the subject of martingales, which will not be discussed
in this article. A great many people have worked on the
theory of Markov chains, as well as their application to
problems inadiverse range of areas, over the past century,
and it is not possible to enumerate them all here.
There are two principal reasons that Markov chains play
such a prominent role in modern probability theory.
Therst reasonis that theyprovide apowerful yet tractable
framework in which to describe, characterize, and analyze
a broad class of sequences of random variables that nd
applications in numerous areas from particle transport
through nite state machines and even in the theory of
gene expression. The second reason is that a collection of
powerful computational algorithms have been developed to
provide samples fromcomplicated probability distributions
via the simulation of particular Markov chains: These
Markov chain Monte Carlo methods are now ubiquitous
in all elds in which it is necessary to obtain samples from
complex probability distributions, and this has driven
much of the recent research in the eld of Markov chains.
The areas in which Markov chains occur are far too
numerous to list here, but here are some typical examples:
v
Any collection of independent random variables forms
a Markov chain: In this case, given the present, the
future is independent of the past and the present.
v
The celebrated symmetric random walk over the inte-
gers provides a classic example: The next value taken
by the chain is one more or less than the current value
withequal probability, regardless of the route bywhich
the current value was reached. Despite its simplicity,
this example, and some simple generalizations, can
exhibit a great many interesting properties.
v
Many popular board games have a Markov chain
representation, for example, Snakes and Ladders,
inwhichthere are 100 possible states for eachcounter
(actually, there are somewhat fewer, as it is not
possible to end a turn at the top of a snake or the
bottomof a ladder), and the next state occupied by any
particular counter is one of the six states that can be
reached from the current one, each with equal prob-
ability. So, the next state is a function of the current
state and an external, independent random variable
that corresponds to the roll of a die.
v
More practically, the current amount of water heldina
reservoir canbe viewedas aMarkov chain: The volume
of water stored after a particular time interval will
depend only onthe volume of water stored nowandtwo
random quantities: the amount of water leaving the
reservoir and the amount of water entering the reser-
voir. More sophisticated variants of this model are
used in numerous areas, particularly within the eld
of queueing theory (where water volume is replaced by
customers awaiting service).
v
The evolutionof anite state machine canbe viewedas
the evolution of a (usually deterministic) Markov
chain.
It is common to think of Markov chains as describing the
trajectories of dynamic objects. In some circumstances, a
natural dynamic systemcan be associated with a collection
of random variables with the right conditional indepen-
dence structurethe randomwalk example discussed pre-
viously, for example, canbe interpreted as moving fromone
position to the next, with the nth element of the associated
Markov chain corresponding to its position at discrete time
index n. As the distribution of each random variable in the
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
sequence depends only onthe value of the previous element
of the sequence, one can endow any such collection
(assuming that one can order the elements of the collection,
which the denition of a Markov chain employed here
ensures is always possible) with a dynamic structure.
One simply views the distribution of each element, condi-
tional on the value of the previous one as being the
probability of moving between those states at that time.
This interpretation provides no great insight, but it can
allow for simpler interpretations and descriptions of the
behavior of collections of random variables of the sort
described here. Indeed, it is the image of a chain of states,
with each one leading to the next that suggests the term
Markov chain.
STOCHASTIC PROCESSES
To proceed to the formal denition of a Markov chain, it is
rst necessaryto make precise what is meant byacollection
of random variables with some temporal ordering. Such a
collection of randomvariables may be best characterized as
a stochastic process. An E-valued process is a function X :
1 E that maps values in some index set 1 to some other
space E. The evolution of the process is described by
considering the variation of X
i
:= X(i) with i. An E-valued
stochastic process (or random process) can be viewed as a
process in which, for each i 1, X
i
is a random variable
taking values in E.
Although a rich literature on more general situations
exists, this article will consider only discrete time stochas-
tic processes in which the index set 1 is the natural
numbers, N (of course, any index set isomorphic to N
can be used in the same framework by simple relabeling).
The notation X
i
is used to indicate the value of the process
at time i (note that there need be no connection between
the index set and real time, but this terminology is both
convenient and standard). Note that the Markov property
maybe extended to continuous time processes inwhichthe
index set is the positive real numbers, and this leads to a
collection of processes known as either Markov processes
or continuous time Markov chains. Such processes are not
considered here in more detail, as they are of somewhat
lesser importance in computer science and engineering
applications. A rich literature on these processes does
exist, and many of the results available in the discrete
time case have continuous time analogindeed, some
results may be obtained considerably more naturally in
the continuous time setting.
At this point, a note on terminology is necessary.
Originally, the term Markov chain was used to describe
any stochastic process with the Markov property and a
nite state space. Some references still use this denition
today. However, in computer science, engineering, and
computational statistics, it has become more usual to use
the termto refer to anydiscrete timestochastic process with
the Markov property, regardless of the state space, and this
is the denition used here. Continuous time processes with
the Markov property will be termed Markov processes, and
little reference will be made to them. This usage is moti-
vated by considerations developing from Markov chain
Monte Carlo methods and is standard in more recent
literature.
Filtrations and Stopping Times
This section consists of some technical details that,
although not essential to a basic understanding of the
stochastic process or Markov chains in particular, are
fundamental and will be encountered in any work dealing
with these subjects.
Alittle more technical structure is generally required to
deal with stochastic processes than with simple random
variables. Although technical details are avoided as far as
possible in this article, the following concept will be needed
to understand much of the literature on Markov chains.
To deal with simple random variables, it sufces to
consider a probability space (V; T; P) in which V is the
set of events, T is the s-algebra corresponding to the
collection of measurable outcomes (i.e., the collection of
subsets of V to which it is possible to assign a probability;
typically the collection of all subsets of V in the discrete
case), and P is the probability measure, which tells us the
probability that any element of T contains the event that
occurs as follows: P : T [0; 1[. To deal with stochastic
processes, it is convenient to dene a ltered probability
space(V; T; T
i
i N
; P). The collection of sub-s-algebras,
T
i
i N
, which is termed a ltration, has a particular
structure:
T
1
T
2
. . . T
n
T
n1
. . . T
and its most important property is that, for any n, the collec-
tion of variables X
1
; X
2
; . . .; X
n
must be measurable with
respect to T
n
. Although much more generality is possible,
it is usually sufcient to consider the natural ltration of a
process: that is the one generatedbythe process itself. Given
any collection of random variables of a common probability
space, a smallest s-algebra exists with respect to which
those randomvariables are jointly measurable. The natural
ltrationis the ltrationgeneratedbysettingeachT
n
equal
to the smallest s-algebra with respect to which X
1
; . . .; X
n
are measurable. Only this ltrationwill be consideredinthe
current article. An intuitive interpretation of this ltration,
which provides increasingly ne subdivisions of the prob-
ability space, is that T
n
tells us how much information can
be provided by knowledge of the values of the rst n random
variables: It tells us which events can be be distinguished
given knowledge of X
1
; . . .; X
n
.
It is natural when considering a process of this sort to
askquestions about randomtimes: Is there anythingto stop
us from dening additional random variables that have an
interpretation as the index, which identies a particular
time inthe evolutionof the process? Ingeneral, some care is
required if these random times are to be useful: If the
temporal structure is real, then it is necessary for us to
be able to determine whether the time which has been
reached so far is the time of interest, given some realization
of the process upto that time. Informally, one might require
that t = n can be ascribed a probability of zero or one,
given knowledge of the rst n states, for any n. In fact, this
is a little stronger than the actual requirement, but it
2 MARKOV CHAINS
provides a simple interpretation that sufces for many
purposes. Formally, if t : V1 is a random time, and
the event ! : t(!) = n T
n
for all n, then t is known as
a stopping time. Note that this conditionamounts to requir-
ing that the event ! : t(!) = n is independent of all
subsequent states of the chain, X
n1
; X
n2
; . . . conditional
upon X
1
; . . .; X
n
. The most common example of a stopping
time is the hitting time, t
A
, of a set A:
t
A
:= infn : X
n
A
which corresponds to the rst time that the process enters
the set A. Note that the apparently similar
t
/
A
= infn : X
n1
A
is not a stopping time (in any degree of generality) as the
state of the chain at time n 1 is not necessarily known in
terms of the rst n states.
Note that this distinction is not an articial or frivolous
one. Consider the chain produced by setting X
n
= X
n1
W
n
where W
n
are a collection of independent random
variables that correspond to the value of a gamblers win-
nings, in dollars, in the nth independent game that he
plays. If A = [10; 000; ), then t
A
would correspond to
the event of having won $10,000, and, indeed, it would
be possible to stop when this occurred. Conversely, if
A = (; 10; 000[, then t
/
A
would correspond to the last
time before that at which $10,000 have been lost. Although
many people would like to be able to stop betting immedi-
ately before losing money, it is not possible to knowthat one
will lose the next one of a sequence of independent games.
Given a stopping time, t, it is possible to dene the
stopped process, X
t
1
; X
t
2
; . . ., associated with the process
X
1
; X
2
; . . ., which has the expected denition; writing
m.n for the smaller of m and n, dene X
t
n
= X
t .n
. That
is, the stoppedprocess corresponds to the process itself at all
times up to the random stopping time, after which it takes
the value it had at that stopping time: It stops. In the case of
t
A
, for example, the stopped process mirrors the original
process until it enters A, and then it retains the value it had
upon entry to A for all subsequent times.
MARKOV CHAINS ON DISCRETE STATE SPACES
Markov chains that take values in a discrete state space,
such as the positive integers or the set of colors with
elements red, green, and blue, are relatively easy to dene
and use. Note that this class of Markov chains includes
those whose state space is countably innite: As is oftenthe
case with probability, little additional difculty is intro-
duced by the transition fromnite to countable spaces, but
considerably more care is needed to deal rigorously with
uncountable spaces.
To specify the distribution a Markov chain on a discrete
state space, it is intuitively sufcient to provide an initial
distribution, the marginal distribution of its rst element,
and the conditional distributions of each element given the
previous one. To formalize this notion, and precisely what
the Markov property referred to previously means, it is
useful to consider the joint probability distribution of the
rst nelements of the Markov chain. Using the denitionof
conditional probability, it is possible to write the joint
distribution of n randomvariables, X
1
; . . .; X
n
, in the follow-
ing form, using X
1:n
to denote the vector (X
1
; . . .; X
n
):
P(X
1:n
= x
1:n
) = P(X
1
= x
1
)
Y
n
i=2
P(X
i
= x
i
[X
1:i1
= x
1:i1
)
The probability that each of the rst n elements takes
particular values can be decomposed recursively as the
probability that all but one of those elements takes the
appropriate value and the conditional probability that
the remaining element takes the specied value given
that the other elements take the specied values.
This decomposition could be employed to describe the
nite-dimensional distributions (that is, the distribution of
the random variables associated with nite subsets of 1) of
any stochastic process. In the case of a Markov chain, the
distribution of any element is inuenced only by the pre-
vious state if the entire history is known: This is what is
meant by the statement that conditional uponthe present,
the future is independent of the past. This property may be
written formally as
P(X
n
= x
n
[X
1:n1
= x
1:n1
) = P(X
n
= x
n
[X
n1
= x
n1
)
and so for any discrete state space Markov chain:
P(X
1:n
= x
1:n
) = P(X
1
= x
1
)
Y
n
i=2
P(X
i
= x
i
[X
i1
= x
i1
)
As an aside, it is worthwhile to notice that Markov
chains encompass a much broader class of stochastic pro-
cesses than is immediately apparent. Given any stochastic
process in which for all n>L and x
1:n1
,
P(X
n
=x
n
[X
1:n1
= x
1:n1
)=P(X
n
=x
n
[X
nL:n1
=x
nL:n1
)
it sufces to consider a process Y on the larger space E
L
dened as
Y
n
= (X
nL1
; . . .; X
n
)
Note that (X
1L
; . . .; X
0
) can be considered arbitrary with-
out affecting the argument. Now, it is straightforward to
determine that the distribution of Y
n1
depends only on
Y
n
. In this way, any stochastic process with a nite mem-
ory may be cast into the form of a Markov chain on an
extended space.
The Markov property, as introduced above, is more
correctly known as the weak Markov property, and in
the case of Markov chains in which the transition prob-
ability is not explicitly dependent on the time index, it is
normally written in terms of expectations of integrable
test function j : E
m
R, where m may be any positive
integer. The weak Markov property in fact tells us that
the expected value of the integral of any integrable test
function over the next m states of a Markov chain depends
MARKOV CHAINS 3
onlyonthevalueof thecurrentstate, soforanynandanyx
1:n
:
E[j(X
n1
; . . .; X
nm
)[X
1:n
[ = E[j(X
n1
; . . .; X
nm1
)[X
n
[
It is natural to attempt to generalize this by considering
random times rather than deterministic ones. The strong
Markov property requires that, for any stopping time t, the
following holds:
E[j(X
t1
; . . .; X
tm
)[X
1:t
[ = E[j(X
t1
; . . .; X
tm1
)[X
t
[
In continuous time settings, these two properties allowus
to distinguish between weak and strong Markov pro-
cesses (the latter is a strict subset of the former, because
t = n is a stopping time). However, in the discrete
time setting, the weak and strong Markov properties
are equivalent and are possessed by Markov chains as
dened above.
It is conventional to view a Markov chain as describing
the path of a dynamic object, which moves fromone state to
another as time passes. Many physical systems that can be
described by Markov chains have precisely this property
for example, the motion of a particle in an absorbing
medium. The position of the particle, together with an
indication as to whether it has been absorbed or not,
may be described by a Markov chain whose states contain
coordinates and an absorbed/not-absorbed ag. It is then
natural to think of the initial state as having a particular
distribution, say, m(x
1
) = P(X
1
= x
1
) and, furthermore, for
there to be some transition kernel that describes the
distribution of moves from a state x
n1
to a state x
n
at
time n, say, K
n
(x
n1
; x
n
) = P(X
n
= x
n
[X
n1
= x
n1
). This
allows us to write the distribution of the rst n elements
of the chain in the compact form:
P(X
1:n
= x
1:n
) = m(x
1
)
Y
n
i=2
K
i
(x
i1
; x
i
)
Nothing is preventing these transition kernels from being
explicitly dependent on the time index; for example, in the
reservoir example presented above, one might expect both
water usage and rainfall to have a substantial seasonal
variation, and so the volume of water stored tomorrow
would be inuenced by the date as well as by that volume
stored today. However, it is not surprising that for a great
many systems of interest (and most of those used in com-
puter simulation) that the transition kernel has no depen-
dence on the time. Markov chains that have the same
transition kernel at all times are termed time homogeneous
(or sometimes simply homogeneous) and will be the main
focus of this article.
In the time homogeneous context, the n-step transition
kernels denoted K
n
, which have the property that
P(X
mn
= x
mn
[X
m
= x
m
) = K
n
(x
m
; x
mn
) may be obtained
inductively, as
K
n
(x
m
; x
mn
) =
X
x
m1
K(x
m
; x
m1
)K
n1
(x
m1
; x
mn
)
for any n>1, whereas K
1
(x
m
; x
m1
) = K(x
m
; x
m1
).
A Matrix Representation
The functional notation above is convenient, as it gener-
alizes to Markov chains on state spaces that are not
discrete. However, discrete state space Markov chains
exist in abundance in engineering and particularly in
computer science. It is convenient to represent probability
distributions on nite spaces as a rowvector of probability
values. To dene such a vector m, simply set m
i
= P(X = i)
(where Xis some randomvariable distributed according to
m). It is also possible to dene a Markov kernel on this
space by setting the elements of a matrix K equal to the
probability of moving from a state i to a state j; i.e.:
K
i j
= P(X
n
= j[X
n1
= i)
Although this may appear little more than a notational
nicety, it has some properties that make manipulations
particularly straightforward; for example, if X
1
~m,
then:
P(X
2
= j) =
X
i
P(X
1
= i)P(X
2
= j[X
1
= i)
=
X
i
m
i
K
i j
= (mK)
j
where mK denotes the usual vector matrix product and
(mK)
j
denotes the jth element of the resulting rowvector.In
fact, it canbe showninductivelythat P(X
n
= j) = (mK
n1
)
j
,
where K
n1
is the usual matrix power of K. Even more
generally, the conditional distributions may be written in
terms of the transition matrix, K:
P(X
nm
= j[X
n
= i) = (K
m
)
i j
and so a great many calculations can be performed via
simple matrix algebra.
A Graphical Representation
It is common to represent a homogeneous, nite-state
space Markov chain graphically. A single directed graph
with labeled edges sufces to describe completely the
transition matrix of such a Markov chain. Together
with the distribution of the initial state, this completely
characterizes the Markov chain. The vertices of the graph
correspond to the states, and those edges that exist illus-
trate the moves that it is possible to make. It is usual to
label the edges with the probability associated with the
move that they represent, unless all possible moves are
equally probable.
A simple example, which also shows that the matrix
representation can be difcult to interpret, consists of the
Markov chain obtained on the space 0; 1; . . .; 9 in which
the next state is obtained by taking the number rolled onan
unbiased die and adding it, modulo 10, to the current state
unless a 6 is rolled when the state is 9, in which case, the
chain retains its current value. This has a straightforward,
4 MARKOV CHAINS
but cumbersome matrix representation, in which:
K =
1
6
0 1 1 1 1 1 1 0 0 0
0 0 1 1 1 1 1 1 0 0
0 0 0 1 1 1 1 1 1 0
0 0 0 0 1 1 1 1 1 1
1 0 0 0 0 1 1 1 1 1
1 1 0 0 0 0 1 1 1 1
1 1 1 0 0 0 0 1 1 1
1 1 1 1 0 0 0 0 1 1
1 1 1 1 1 0 0 0 0 1
1 1 1 1 1 0 0 0 0 1
2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
Figure 1 shows a graphical illustration of the same
Markov transition kerneltransition probabilities are
omitted in this case, as they are all equal. Although it
may initially seemno simpler to interpret thanthe matrix,
on closer inspection, it becomes apparent that one can
easily determine from which states it is possible to reach
any selected state, which states it is possible to reach from
it and those states it is possible to move between in a
particular number of moves without performing any
calculations. It is these properties that this representa-
tion makes it very easy to interpret even in the case of
Markov chains with large state spaces for which the
matrix representation rapidly becomes very difcult to
manipulate. Note the loop in the graph showing the
possibility of remaining in state 9this is equivalent to
the presence of a nonzero diagonal element in the
transition matrix.
MARKOV CHAINS ON GENERAL STATE SPACES
Ingeneral, more subtle measure theoretic constructions are
required to dene or study Markov chains on uncountable
state spacessuch as the real numbers or the points in
three-dimensional space. To deal with a fully general state
space, a degree of measure theoretic probability beyond that
which can be introduced in this article is required. Only
Markov chains on some subset of d-dimensional, Euclidean
space, R
d
with distributions and transition kernels that
admit a density (for deniteness, with respect to Lebesgue
measurethat which attributes to any interval mass cor-
respondingto its lengthover that space) will be considered
here. K
n
(x; y) (or K(x; y) in the time homogeneous case)
denotes a density with the property that
P(X
n
A[X
n1
= x
n1
) =
Z
A
K
n
(x
n1
; y)dy
This approach has the great advantage that many concepts
may be writtenfor discrete and continuous state space cases
in precisely the same manner, with the understanding that
the notation refers to probabilities in the discrete case and
densities in the continuous setting. To generalize things, it
is necessary to consider Lebesgue integrals with respect to
the measures of interest, but essentially, one can replace
equalities of densities with those of integrals over
any measurable set and the denitions and results
presented below will continue to hold. For a rigorous and
concise introduction to general state space Markov chains,
see Ref. 2. For a much more detailed exposition, Ref. 3 is
recommended highly.
STATIONARY DISTRIBUTIONS AND ERGODICITY
The ergodic hypothesis of statistical mechanics claims,
loosely, that given a thermal system at equilibrium, the
long-term average occurrence of any given system cong-
urationcorresponds precisely to the average over aninnite
0
1
2
3
4
5
6
7
8
9
Figure 1. A graphical representation of a Markov chain.
MARKOV CHAINS 5
ensemble of identically prepared systems at the same tem-
perature. One area of great interest in the analysis of
Markov chains is that of establishing conditions under
which a (mathematically rened form of) this assertion
can be shown to be true: When are averages obtained by
considering those states occupied by a Markov chain over a
long period of its evolution close to those that would be
obtainedbycalculatingthe average under some distribution
associated with that chain? Throughout this section, inte-
grals over the state space are used with the understanding
that in the discrete case these integrals should be replaced
by sums. This process minimizes the amount of duplication
required to deal with both discrete and continuous state
spaces, whichallows the signicant differences tobeempha-
sized when they develop.
One of the most important properties of homogeneous
Markov chains, particularly within the eld of simulation,
is that they can admit a stationary (or invariant) distribu-
tion. A transition kernel K is p-stationary if
Z
p(x)K(x; y)dx = p(y)
That is, given a sample X = x from p, the distribution of a
randomvariable Y, drawnfromK(x; ) is the same as that of
X, although the two variables are, of course, not indepen-
dent. In the discrete case, this becomes
X
i
p(i)K(i; j) = p( j)
or, more succinctly, in the matrix representation, pK = p.
The last of these reveals a convenient characterization
of the stationary distributions, where they exist, of a
transition kernel: They are the left eigenvectors (or eigen-
functions in the general state space case) of the transition
kernel with an associated eigenvalue of 1. Viewing the
transition kernel as an operator on the space of distribu-
tions, the same interpretation is valid in the general state
space case.
It is often of interest to simulate evolutions of Markov
chains with particular stationary distributions. Doing so is
the basis of Markov chain Monte Carlo methods and is
beyond the scope of this article. However, several theore-
tical concepts are required to determine when these dis-
tributions exist, when they are unique, and when their
existence is enough to ensure that a large enough sample
pathwill have similar statistical properties to acollectionof
independent, identically distributed random variables
from the stationary distribution. The remainder of this
section is dedicated to the introduction of such concepts
and to the presentation of two results that are of great
importance in this area.
One property useful in the construction of Markov
chains with a particular invariant distributions is that of
reversibility. Astochastic process is termed reversible if the
statistical properties of its time reversal are the same as
those of the process itself. To make this concept more
formal, it is useful to cast things in terms of certain joint
probability distributions. Astationary process is reversible
if for any n; m, the following equality holds for all measur-
able sets A
n
; . . .; A
nm
:
P(X
n
A
n
; . . .; X
nm
A
nm
)=P(X
n
A
n
; . . .; X
nm
A
nm
)
It is simple to verify that, in the context of a Markov chain,
this is equivalent to the detailed balance condition:
P(X
n
A
n
; X
n1
A
n1
) = P(X
n
A
n1
; X
n1
A
n
)
A Markov chain with kernel K is said to satisfy detailed
balance for a distribution p if
p(x)K(x; y) = p(y)K(y; x)
It is straightforward to verify that, if Kis p-reversible, then
p is a stationary distribution of K:
Z
p(x)K(x; y)dy =
Z
p(y)K(y; x)dy
p(x) =
Z
p(y)K(y; x)dy
This is particularlyuseful, as the detailedbalance condition
is straightforward to verify.
Given a Markov chain with a particular stationary dis-
tribution, it is important to be able to determine whether,
over a long enough period of time, the chain will explore all
of the space, which has a positive probability under that
distribution. This leads to the concepts of accessibility,
communication structure, and irreducibility.
In the discrete case, a state j is said to be accessible from
another state i written as i j, if for some n, K
n
(i; j) >0.
That is, a state that is accessible fromsome starting point is
one that can be reached with positive probability in some
number of steps. If i is accessible from j and j is accessible
fromi, thenthe two states are saidto communicate, andthis
is written as i j. Given the Markov chain on the space
E = 0; 1; 2 with transition matrix:
K =
1 0 0
0
1
2
1
2
0
1
2
1
2
2
6
4
3
7
5
it is not difcult to verify that the uniform distribution m =
(
1
3
;
1
3
;
1
3
) is invariant under the action of K. However, if
X
1
= 0, then X
n
= 0, for all n: The chain will never reach
either of the other states, whereas starting fromX
1
1; 2,
the chain will never reach 0. This chain is reducible: Some
disjoint regions of the state space do not communicate.
Furthermore, it has multiple stationary distributions;
(1; 0; 0) and (0;
1
2
;
1
2
) are both invariant under the action of
K. In the discrete setting, a chain is irreducible if all states
communicate: Starting from any point in the state space,
any other point may be reached with positive probability in
some nite number of steps.
Although these concepts are adequate for dealing with
discrete state spaces, a little more subtlety is required in
more general settings: As ever, when dealing with prob-
ability on continuous spaces, the probability associated
with individual states is generally zero and it is necessary
6 MARKOV CHAINS
to consider integrals over nite regions. The property that
is captured by irreducibility is that, wherever the chain
starts from, a positive probability of it reaches anywhere in
the space. To generalize this to continuous state spaces, it
sufces to reduce the strength of this statement very
slightly: From most starting points, the chain has a
positive probability of reaching any region of the space
that itself has positive probability. To make this precise,
a Markov chain of stationary distribution p is said to be p-
irreducible if, for all x (except for those lying in a set of
exceptional points that has probability 0 under p), and all
sets A with the property that
R
A
p(x)dx >0,
n :
Z
A
K
n
(x; y)dy >0
Theterms stronglyirreducible andstronglyp-irreducible are
sometimes used when the irreducibility or p-irreducibility
condition, respectively, holds for n = 1. Notice that any
irreducible Markov chain is p-irreducible with respect to
any measure p.
These concepts allowus to determine whether a Markov
chain has a joined-up state space: whether it is possible to
move around the entire space (or at least that part of the
space that has mass under p). However, it tells us nothing
about when it is possible to reachthese points. Consider the
difference betweenthe following two transitionmatrices on
the space 0; 1; for example,
K =
1
2
1
2
1
2
1
2
" #
and L =
0 1
1 0
Both matrices admit p = (
1
2
;
1
2
) as a stationary distribution,
and both are irreducible. However, consider their respec-
tive marginal distributions after several iterations:
K
n
=
1
2
1
2
1
2
1
2
" #
; whereas L
n
=
0 1
1 0
nodd
1 0
0 1
neven
8
>
>
>
<
>
>
>
:
In other words, if m = (m
1
; m
2
), then the Markov chain
associated with K has distribution mK
n
= (
1
2
;
1
2
) after n
iterations, whereas that associated with L has distribution
(m
2
; m
1
) after any odd number of iterations and distribution
(m
1
; m
2
) after any even number. L, then, never forgets its
initial conditions, and it is periodic.
Although this example is contrived, it is clear that such
periodic behavior is signicant and that a precise char-
acterization is needed. This is straightforward in the case
of discrete state space Markov chains. The period of any
state, i in the space is dened, using gcd to refer to
the greatest common divisor (i.e., the largest common
factor), as
d = gcdn : K
n
(i; i) >0
Thus, in the case of L, above, both states have a period
d = 2. Infact, it canbe showneasily, that any pair of states
that communicate must have the same period. Thus,
irreducible Markov chains have a single period, 2, in
the case of L, above and 1, in the case of K. Irreducible
Markov chains may be said to have a period themselves,
and when this period is 1, they are termed aperiodic.
Again, more subtlety is required inthe general case. It is
clear that somethingis neededto ll the role that individual
states play in the discrete state space case and that indi-
vidual states are not appropriate in the continuous case. A
set of events that is small enough that it is, in some sense,
homogeneous and large enough that it has positive prob-
ability under the stationarydistributionis required. Aset C
is termed small if some integer n, some probability distri-
bution n, and some e >0 exist such that the following
condition holds:
inf
x C
K
n
(x; y) _en(y)
This condition tells us that for any point in C, with
probability e, the distribution of the next state the chain
enters is independent of where in Cit is. In that sense, Cis
small, and these sets are precisely what is necessary to
extend much of the theory of Markov chains from the
discrete state space case to a more general setting. In
particular, it is nowpossible to extend the notion of period
fromthe discrete state space setting to a more general one.
Note that in the case of irreducible aperiodic Markov
chains on a discrete state space, the entire state space
is small.
A Markov chain has a cycle of length d if a small set C
exists suchthat the greatest commondivisor of the lengthof
paths fromCto a measurable set of positive probability Bis
d. If the largest cycle possessed by a Markov chain has
length 1, then that chain is aperiodic. In the case of p-
irreducible chains, every state has a commonperiod(except
a set of events of probability 0 under p), and the above
denition is equivalent to the more intuitive (but more
difcult to verify) condition, in which a partition of the
state space, E, into d disjoint subsets E
1
; . . .; E
d
exists with
the property that P(X
n1
,= E
j
[X
n
E
i
) = 0 if
j = i 1 mod d.
Thus far, concepts that allow us to characterize those
Markov chains that can reach every important part of the
space and that exhibit no periodic structure have been
introduced. Nothing has been said about how often a given
region of the space might be visited. This point is particu-
larly important: A qualitative difference exists between
chains that have a positive probability of returning to a
set innitely often and those that can only visit it nitely
many times. Let h
A
denote the number of times that a set A
is visitedby aMarkov chain; that is, h
A
= [X
n
A : nN[.
A p-irreducible Markov chain is recurrent if E[h
A
[ = for
everyAwithpositive probabilityunder p. Thus, arecurrent
Markovchainis one withpositive probabilityof visitingany
signicant (with respect to p) part of the state space in-
nitely often: It does not always escape to innity. Aslightly
stronger condition is termed Harris recurrence; it requires
that every signicant state is visitedinnitely often(rather
than this event having positive probability); i.e., P(h
A
=
) = 1 for every set A for which
R
A
p(x)dx >0. A Markov
chain that is not recurrent is termed transient.
MARKOV CHAINS 7
The following example illustrates the problems that can
originate if a Markov chain is p-recurrent but not Harris
recurrent. Consider the Markov chain over the positive
integers with the transition kernel dened by
K(x; y) = x
2
d
1
(y) (1 x
2
)d
x1
(y)
where for any state, x, d
x
denotes the probability distribu-
tionthat places all of its mass at x. This kernel is clearlyd
1
-
recurrent: If the chain is started from 1, it stays there
deterministically. However, as the sum
X
k=2
1
k
2
<
the BorelCantelli lemma ensures that whenever the
chain is started for any x greater than 1, a positive
probability exists that the chain will never visit
state 1the chain is p-recurrent, but it is not Harris
recurrent. Although this example is somewhat contrived,
it illustrates an important phenomenonand one that
often cannot be detected easily in more sophisticated
situations. It has been suggested that Harris recurrence
can be interpreted as a guarantee that no such patholo-
gical system trajectories exist: No parts of the space exist
from which the chain will escape to innity rather than
returning to the support of the stationary distribution.
It is common to refer to a p-irreducible, aperiodic,
recurrent Markov chain as being ergodic and to an ergodic
Markov chainthat is also Harris recurrent as being Harris
ergodic. These properties sufce to ensure that the Mar-
kov chain will, on average, visit every part of the state
space in proportion to its probability under p, that it
exhibits no periodic behavior in doing so, and that it might
(or will, in the Harris case) visit regions of the state space
with positive probability innitely often. Actually, ergo-
dicity tells us that a Markov chain eventually forgets its
initial conditionsafter a sufciently long time has
elapsed, the current state provides arbitrarily little infor-
mation about the initial state. Many stronger forms of
ergodicity provide informationabout the rate at whichthe
initial conditions are forgotten; these are covered in great
detail by Meyn and Tweedie (3). Intuitively, if a sequence
of random variables forgets where it has been, but has
some stationary distribution, then one would expect the
distribution of sufciently widely separated samples to
approximate that of independent samples from that sta-
tionary distribution. This intuition can be made rigorous
and is strong enough to tell us a lot about the distribution
of large samples obtained by iterative application of the
Markov kernel and the sense in which approximations of
integrals obtained by using the empirical average
obtained by taking samples fromthe chainmight converge
to their integral under the stationary measure. This sec-
tion is concluded with two of the most important results in
the theory of Markov chains.
The ergodic theorem provides an analog of the law of
large numbers for independent randomvariables: It tells us
that under suitable regularity conditions, the averages
obtained from the sample path of a Markov chain will
converge to the expectation under the stationary distribu-
tion of the transition kernel. This mathematically rened,
rigorously proved form of the ergodic hypothesis was
alluded to at the start of this section. Many variants of
this theorem are available; one particularly simple form is
the following: If X
n
is a Harris ergodic Markov chain of
invariant distribution p, then the following strong law of
large numbers holds for any p-integrable function f :
ER (convergence is with probability one):
lim
n
1
n
X
n
i=1
f (X
i
)
Z
f (x)p(x)dx
This is aparticular case of Ref. 5(p. 241, Theorem6.63), and
a proof of the general theorem is given there. The same
theorem is also presented with proof in Ref. 3 (p. 433,
Theorem 17.3.2).
A central limit theorem also exists and tells us some-
thing about the rate of convergence of averages under the
sample path of the Markov chain. Under technical
regularity conditions (see Ref. 4 for a summary of various
combinations of conditions), it is possible to obtain a
central limit theorem for the ergodic averages of a Harris
recurrent, p-invariant Markov chain, and a function that
has at least two nite moments, f : ER (with E[ f [ <
and E[ f
2
[ <).
1
lim
n
n
_ 1
n
X
n
i=1
f (X
i
)
Z
f (x)p(x)dx
" #
d
A(0; s
2
( f ))
s
2
(f )=E[( f (X
1
)f )
2
[2
X
k=2
E[( f (X
1
)f )(f (X
k
) f )[
where
d
; y Z
n
where R
p
; y Z
n
n
j=1
f
i
y
j
m
i=1
n
j=1
c
i j
x
i j
(1)
n
j=1
x
i j
= a
i
for i = 1; . . . ; m (2)
m
i=1
x
i j
_ b
j
y
j
for j = 1; . . . ; n (3)
x R
mn
; y 0; 1
n
(4)
where the objective functioninEquation(1) includes terms
for the xed cost of opening the depots and for the variable
shipping costs, the constraints in Equation (2) ensure that
the demand of each client is satised, the constraints in
Equation (3) ensure that, if depot j is closed, nothing is
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
shipped from that depot, and otherwise at most b
j
is
shipped, and nally Equation (4) indicates the variable
types and bounds.
A Multi-Item Production Planning Problem
Given m items to be produced over a time horizon of n
periods withknowndemands d
i
t
for 1 _ i _ m, 1 _ t _ n; all
the items have to be processed on a single machine. The
capacity of the machine in period t is L
t
, and the production
costs are as follows: Axed set-up cost of q
i
t
exists if itemi is
produced in period t, as well as a per unit variable cost p
i
t
; a
storage cost h
i
t
exists for each unit of i in stock at the end of
period t. The problem is to satisfy all the demands and
minimize the total cost.
Introducing the variables
x
i
t
is the amount of item i produced in period t,
s
i
t
is the amount of item i in stock at the end of period t,
y
i
t
= 1 if there is production of item i in period t, and
y
i
t
= 0 otherwise,
a possible MIP formulation of the problem is:
min
m
i=1
n
t=1
p
i
t
x
i
t
q
i
t
y
i
t
h
i
t
s
i
t
_ _
s
i
t1
x
i
t
= d
i
t
s
i
t
for i = 1; . . . ; m; t = 1; . . . ; n
(5)
x
i
t
_ L
t
y
i
t
for i = 1; . . . ; m; t = 1; . . . ; n (6)
m
i=1
x
i
t
_ L
t
for t = 1; . . . ; n
s; x R
mn
; y 0; 1
mn
(7)
where constraints in Equation (5) impose conservation of
product from one period to the next (namely end-stock of
itemi int 1 productionof i int =demand for i int end-
stock of i int), Equation(6) force the set-up variable y
i
t
to 1 if
there is production (x
i
t
>0), and Equation (7) ensures that
the total amount produced in period t does not exceed the
capacity.
Traveling Salesman Problem with Time Windows
Suppose that a truck (or salesperson) must leave the
depot, visit a set of n clients, and then return to the depot.
The travel times between clients (including the depot,
node 0) are given in an (n 1) (n 1) matrix c. Each
client i has a time window [a
i
; b
i
[ during which the truck
must make its delivery. The delivery time is assumed to be
negligible. The goal is to complete the tour and return to
the depot as soon as possible while satisfying the time
window constraints of each client. Two possible sets of
variables are
y
ij
= 1 if the truck travels directly from i to j,
i, j 0; 1 . . . ; n.
t
j
is the time of delivery to client j, j 0; . . . ; n.
t is the time of return to the depot.
The problem now can be formulated as the following
MIP:
mint t
0
(8)
n
j=0
y
i j
= 1; i 0; . . . ; n (9)
n
i=0
y
i j
= 1; j 0; . . . ; n (10)
i S; j = S
y
i j
_1; fS_1; . . . ; n (11)
t
j
_t
i
c
i j
y
i j
M(1 y
i j
);
i 0; . . . ; n; j 1; . . . ; n
(12)
t _t
i
c
i0
y
i0
M(1 y
i0
); i 1; . . . ; n (13)
a
i
_ t
i
_ b
i
; i 1; . . . ; n
t R; t R
n1
; y 0; 1
(n1)(n2)
2
(14)
where Mis a large value exceeding the total travel time, the
objective function, in Equation (8) measures the difference
between the time of leaving and returning to the depot,
Equations (9) and (10) ensure that the truck leaves/arrives
once at site i and Equation (11) ensures that the tour of the
truck is connected. Constraints in Equations (12) and (13)
ensure that if the truck goes fromi to j, the arrival time in j
is at least the arrival time ini plus the travel time fromi to j.
Finally, Equation (14) ensures that the time window con-
straints are satised.
BASIC PROPERTIES OF MIPS
Given the MIP
z = mincx hy : Ax Gy _b; x R
p
; y Z
n
; y Z
n
Z
n
: Ax
Gy _bis known as the feasible region of the: MIP. When
A, G, b are rational matrices, then
(i) conv(X
MIP
) is a polyhedron, namely conv(X
MIP
)
= (x; y) R
pn
: A
/
x G
/
y _b
/
for some A
/
; G
/
; b
/
,
and
(ii) the linear program mincx hy : (x; y) conv
(X
MIP
) solves MIP. In Fig.1 one sees tha t an
optimal vertex of conv(X
MIP
) lies in X
MIP
.
The last observationsuggests that it it is easy to solve an
MIP. Namely, it sufces to nd the convex hull of the set of
feasible solutions and then solve a linear program. Unfor-
tunately, it is rarely this simple. Finding conv(X
MIP
) is
difcult, and usually an enormous number of inequalities
are needed to describe the resulting polyhedron.
Thus, one typically must be less ambitious and examine
(i) whether certain simple classes of MIPs exist for
which one can nd an exact description of
conv(X
MIP
), and
(ii) whether one can nd a good approximation of
conv(X
MIP
) by linear inequalities in a reasonable
amount of time.
Below, in looking at ways to nd such a description of
X
MIP
and in using it insolving anMIP, we oftenwill mix two
distinct viewpoints:
(i) Find sets X
1
; . . . ; X
k
such that
X
MIP
= '
k
i=1
X
i
where optimizing over X
i
is easier than optimizing over
X
MIP
for i = 1; . . . ; k, and where possibly good descrip-
tions of the sets conv(X
i
) are known. This decomposition
forms the basis of the branch-and-bound approach
explained in the next section.
(ii) Find sets X
1
; . . . ; X
k
such that
X
MIP
=
k
i=1
X
i
where agoodor exact descriptionof conv(X
i
) is knownfor
i = 1; . . . ; k. Then, a potentially effective approximation
to conv(X
MIP
) is given by the set
k
i=1
conv(X
i
). This
decomposition forms the basis of the preprocessing
and cut generation steps used in the branch-and-cut
approach.
THE BRANCH-AND-CUT ALGORITHM FOR MIPs
Below we will examine the main steps contributing to a
branch-and-cut algorithm. The rst step is the underlying
branch-and-bound algorithm. This algorithm then can be
improved by (1) a priori reformulation, (2) preprocessing,
(3) heuristics to obtain good feasible solutions quickly, and
nally(4) cuttingplanes or dynamic reformulationinwhich
case we talk of a branch-and-cut algorithm.
Branch-and-Bound
First, we discuss the general ideas, andthenwediscuss how
they typically are implemented. Suppose the MIP to be
solved is z = mincx hy : (x; y) X
MIP
with
X
MIP
= '
k
i=1
X
i
where X
i
= (x; y) R
p
Z
n
: A
i
x G
i
y _b
i
for i = 1;
. . . ; k. In addition suppose that we know the value of
each linear program
z
i
LP
= mincx hy : A
i
x G
i
y _b
i
; x R
p
; y R
n
>z
t
LP
and y
t
is fractional), we
have not succeeded in nding the optimal solution in X
t
, so
we branch (i.e., break the set X
t
into two or more pieces). As
the linear programming solution was not integral, some
variable y
j
takes a fractional value y
t
j
. The simplest and
most common branching rule is to replace X
t
by two new
sets
X
t
_
= X
t
(x; y) : y
i
_ y
t
j
|;
X
t
_
= X
t
(x; y) : y
j
_ y
t
j
|
whose union is X
t
. The two newsets are added to the list L,
and the algorithm continues.
Obvious questions that are important in practice are the
choice of branching variable and the order of selection/
removal of the sets fromthe list L. Good choices of branch-
ing variable can reduce signicantly the size of the enu-
merationtree. Pseudo-costs or approximate dual variables
are used to estimate the costs of different variable choices.
Strong branching is very effectivethis involves selecting
a subset of the potential variables and temporarily branch-
ing and carrying out a considerable number of dual pivots
with each of them to decide which is the most signicant
variable on which to nally branch. The order in which
nodes/subproblems are removed from the list L is a com-
promise between different goals. At certain moments one
may use a depth-rst strategy to descend rapidly in the tree
so as to nd feasible solutions quickly: however, at other
moments one may choose the node withthe best bound so as
not to waste time on nodes that will be cut off anyway once a
better feasible solution is found.
The complexity or running time of the branch-and-
bound algorithm obviously depends on the number of sub-
sets X
t
that have to be examined. In the worst case, one
might need to examine 2
n
such sets just for a 01 MIP.
Therefore, it is crucial to improve the formulation so that
the value of the linear programming relaxation gives better
bounds and more nodes are pruned, and/or to nd a good
feasible solution as quickly as possible.
Ways to improve the formulation, including preproces-
sing, cutting planes and a priori reformulation, are dis-
cussed in the next three subsections. The rst two typically
are carried out as part of the algorithm, whereas the MIP
formulation given to the algorithm is the responsibility of
the user.
Preprocessing
Preprocessing can be carried out on the initial MIP, as
well as on each problem X
t
taken from the list L. The idea
is to improve the formulation of the selected set X
t
. This
action typically involves reducing the number of con-
straints and variables so that the linear programming
relaxation is solved much faster and tightening the
bounds so as to increase the value of the lower bound
z
t
LP
, thereby increasing the chances of pruning X
t
or its
descendants.
Avariable canbe eliminated if its value is xed. Also, if a
variable is unrestricted in value, it can be eliminated by
substitution. Aconstraint canbe eliminated if it is shownto
be redundant. Also, if a constraint only involves one vari-
able, then it can be replaced by a simple bound constraint.
These observations and similar, slightly less trivial obser-
vations often allow really dramatic decreases in the size of
the formulations.
Nowwe give four simple examples of the bound tighten-
ing and other calculations that are carried out very rapidly
in preprocessing:
(i) (Linear Programming) Suppose that one constraint
of X
t
is
j N1
a
j
x
j
j N2
a
j
x
j
_b; a
j
>0 for all
j N
1
'N
2
and the variables have bounds
l
j
_ x
j
_ u
j
.
If
j N1
a
j
u
j
j N2
a
j
l
j
<b, then the MIP is
unfeasible
If
j N1
a
j
l
j
j N2
a
j
u
j
_b, then the con-
straint is redundant and can be dropped.
For a variable t N
1
, we have a
t
x
t
_b
j N2
a
j
x
j
j N1t
a
j
x
j
_b
j N2
a
j
l
j
j
N
1
t
a
j
u
j
. Thus, we have the possibly improved bound
on x
t
x
t
_max[l
t
;
b
j N
2
a
j
l
j
j N
1
t
a
j
u
j
a
t
[
One also possibly, can improve the upper bounds on x
j
for j N
2
in a similar fashion.
(ii) (Integer Rounding) Suppose that the bounds on an
integer variable l
j
_ y
j
_ u
j
just have beenupdated
by preprocessing. If l
j
; u
j
= Z, thenthese bounds can
be tightened immediately to
[l
j
[ _ y
j
_ [u
j
[
(iii) (0-1 Logical Deductions) Suppose that one of the
constraints can be put in the form
j N
a
j
y
j
_
b; y
j
0; 1 for j N with a
j
>0 for j N.
If b <0, then the MIP is infeasible.
If a
j
>b _0, then one has y
j
= 0 for all points of
X
MIP
.
If a
j
a
k
>b _maxa
j
; a
k
, then one obtains the
simple valid inequality y
j
y
k
_ 1 for all points of
X
MIP
.
4 MIXED INTEGER PROGRAMMING
(iv) (Reduced cost xing) Given an incumbent value z
fromthe best feasible solution, andarepresentation
of the objective function in the formz
t
LP
j
c
j
x
j
j
c
j
y
j
with c
j
_0 and c
j
_0 obtained by linear
programming, any better feasible solution in X
t
must satisfy
j
c
j
x
j
j
~ c
j
y
j
<z z
t
LP
Thus, any such solution satises the bounds x
j
_
zz
t
LP
c
j
and y
j
_
zz
t
LP
c
j
| . (Note that reductions such as
in item iv) that take into account the objective function
actually modify the feasible region X
MIP
).
Valid Inequalities and Cutting Planes
Denition 5. An inequality
p
j=1
p
j
x
j
n
j=1
m
j
y
j
_p
0
is
a valid inequality (VI) for XMIP if it is satised by every
point of X
MIP
.
The inequalities added in preprocessing typically are
very simple. Here, we consider all possible valid inequal-
ities, but because innitely many of them exist, we restrict
our attention to the potentially interesting inequalities. In
Figure 1, one sees that only a nite number of inequalities
(known as facet-dening inequalities) are needed to
describe conv(X
MIP
). Ideally, we would select a facet-den-
ing inequality among those cutting off the present linear
programming solution (x
+
; y
+
) Formally, we need to solve
the
Separation Problem: Given X
MIP
and a point
(x
+
; y
+
) R
p
R
n
Z
1
: y _ b x It can be
shown that every point of X satises the valid
inequality
y _ b|
x
1 f
where f = b b| is the fractional part of b.
(ii) (Mixed Integer Rounding) Consider an arbitrary
row or combination of rows of X
MIP
of the form:
j P
a
j
x
j
j N
g
j
y
j
_ b
x R
p
; y Z
n
j P:a
j
<0
a
j
1 f
x
j
j N
( g
j
|
( f
j
f )
1 f
)y
j
_ b |
where f = b b | and f
j
= g
j
g
j
| . The Separa-
tion Problemfor the complete family of MIRinequalities
derived from all possible combinations of the initial
constraints, namely all single row sets of the form:
uAx uGy _ub; x R
p
; y Z
n
j P
u
a
j
x
j
j N
u
g
j
y
j
= a
0
with y
+
u
= a
0
= Z
1
; x
+
j
= 0 for j P
u
;and y
+
u
= 0 for
j N
u
. Applying the MIR inequality and then eliminat-
ing the variable y
u
by substitution, we obtain the valid
inequality
j P
u
:a
j
>0
a
j
x
j
j P
u
:a
j
<0
f
0
1 f
0
a
j
x
j
j N
u
: f
j
_ f
0
f
j
y
j
j N
u
: f
j
> f
0
f
0
(1 f
j
)
1 f
0
y
j
_ f
0
called the Gomory mixed integer cut, where
f
j
= g
j
g
j
| for j N
u
'0. This inequality cuts
off the LPsolution(x
+
; y
+
). Here, the SeparationProblem
is trivially solved by inspection and nding a cut is
guaranteed. However, inpractice the cuts maybe judged
ineffective for a variety of reasons, such as very small
violations or very dense constraints that slow down the
solution of the LPs.
(iv) (01 MIPs and Cover Inequalities) Consider a 01
MIP with a constraint of the form
j N
g
j
y
j
_ b x; x R
1
; y Z
[N[
with g
j
>0 for j N. A set C_N is a cover if
j c
g
j
=
b l with l > 0. The MIP cover inequality is
j C
y
j
_ [C[ 1
x
l
MIXED INTEGER PROGRAMMING 5
Using appropriate multiples of the constraints y
j
_0
and y
j
_ 1, the cover inequality can be obtained as a
weakening of an MIR inequality. When x = 0, the
Separation Problem for such cover inequalities can be
shown to be an AT-hard, single row 01 integer pro-
gram.
(v) (Lot Sizing) Consider the single itemuncapacitated
lot-sizing set
X
LSU
=
_
(x; s; y) R
n
R
n
0; 1
n
:
s
t1
x
t
= d
t
s
t
1 _ t _ n
x
t
_
_
n
u=t
d
u
_
y
t
1 _ t _ n
_
Note that the feasible region X
PP
of the production
planning problem formulated in A multi-Item produc-
tion plenning problem section can be written as
X
pp
=
m
i=1
X
LSU
i
Y where Y _0; 1
mn
contains
the joint Machine constraints in Equation (7) and the
possibly tighter bound constraints in Equation (6).
Select an interval [k; k 1; . . . ; l[ with 1 _ k _ l _ n and
some subset T_k; . . . ; l Note that if k _ u _ l and no
production exists in any period in k; . . . ; uT(i:e:;
j k;...;uT
y
j
= 0); then the demand d
u
in period u must
either be part of the stock than the s
k1
or be produced in
some period in Tk; . . . ; u. This nding establishes the
validity of the inequality
s
k1
j T
x
j
_
i
u=k
d
u
1
j k;...;uT
y
j
_
_
_
_
(15)
Taking l as above, L = 1; . . . ; l, and S = 1; . . . ; k 1
'T;, the above inequality can be rewritten as a so-called
(l,S)-inequality:
j S
x
j
j LS
1
u=j
d
u
_
_
_
_
y
j
_
1
u=1
d
u
This family of inequalities sufces to describe conv(X
LSU
)
Now, givenapoint (x
+
; s
+
; y
+
), the SeparationProblemfor
the (l,S) inequalities is solved easily by checking if
1
j=1
min[x
+
j
;
1
u=j
d
u
_
_
_
_
y
+
j
[ <
1
u=1
d
u
for some l. If it does not, the point lies in conv(X
LSU
):
otherwise a violated (l, S) inequality is found by taking S =
j 1; . . . ; l : x
+
j
<
_
l
u=j
d
u
_
y
+
j
:
A Priori Modeling or Reformulation
Below we present four examples of modeling or a priori
reformulations in which we add either a small number of
constraints or new variables and constraints, called
extended formulations, with the goal of obtaining tighter
linear programming relaxations and, thus, much more
effective solution of the corresponding MIPs.
(i) (Capacitated Facility LocationAdding a Redun-
dant Constraint) The constraints in Equations (2)
through (4) obviously imply validity of the con-
straint
n
j=1
b
j
y
j
_
m
i=1
a
i
stat which that the capacity of the open depots must
be at least equal to the sum of all the demands of the
clients. As y 0; 1
n
, the resulting set is a 01 knap-
sack problem for which cutting planes are derived
readily.
(ii) (Lot SizingAdding a FewValid Inequalities) Con-
sider againthe single item, uncapacitatedlot-sizing
set X
LSU
: Initemv) of the validininequalities and
cutting planes sections, we described the inequal-
ities that give the convex hull. In practice, the most
effective inequalities are those that cover a periods.
Thus, a simple a priori strengthening is given by
adding the inequalities in Equation (15) with T = f
and l _ k k
s
k1
_
1
u=k
d
u
1
u
j=k
y
j
_
_
_
_
for some small value of k:
(iii) (Lot SizingAn Extended Formulation) Consider
again the single item, uncapacitated lot-sizing set
X
LSU
. Dene the newvariables z
ut
withu _ t as the
amount of demand for period t produced inperiod u.
Now one obtains the extended formulation
t
n=1
z
ut
= d
t
; 1 _ t _ n
z
ut
_ d
t
y
u
; 1 _ u _ t _ n
x
u
=
n
t=u
z
ut
; 1 _ u _ n
s
t1
x
t
= d
t
s
t
; 1 _ t _ n
x; s R
n
; z R
n(n1)=2
; y 0; 1 |
m
whose (x; s; y) solutions are just the points of conv
(X
LSu
). Thus, the linear program over this set solves
the lot-sizing problem, whereas the original description
of x
LSu
provides a much weaker formulation.
(iv) (Modeling Disjunctive or Or ConstraintsAn
ExtendedFormulation) Numerousproblems involve
disjunctions,for instance, given two jobs i, and j to
be processed on a machine with processing times p
i
,
p
j
, respectively, suppose that either job i must
be completed before job j or vice versa. If t
i
, t
j
are
variables representing the start times, we have the
constraint EITHER job i precedes job j OR job j
6 MIXED INTEGER PROGRAMMING
precedes job i, which can be written more formally
as
t
i
p
i
_ t
j
or t
j
p
j
_ t
i
More generally one oftenencounters the situationwhere
one must select a point (a solution) from one of k sets or
polyhedra (a polyhedron is a set described by a nite
number of linear inequalities):
x '
k
i=1
P
i
where P
i
= x : A
i
x _b
i
_R
n
When each of the sets P
i
is nonempty and bounded, the set
'
k
i=1
P
i
can be formulated as the MIP:
x =
k
i=1
z
i
(16)
A
i
z
i
_ b
i
y
i
for i = 1; . . . ; k (17)
k
i=1
y
i
= 1 (18)
x R
n
; z R
nk
; y 0; 1
k
(19)
where y
i
= 1 indicates that the point lies in P
i
. Given a
solution with y
i
= 1, the constraint in Equation (18) then
forces y
j
= 0 for j ,=i and the constraint in Equation (17)
thenforces z
i
P
i
and z
j
= 0 for j ,=i. Finally, Equation(16)
shows that x P
i
if and only if y
i
= 1 as required, and it
follows that the MIP models '
k
i=1
P
i
. What is more, it has
been shown (6) that the linear programming relaxation of
this set describes conv( '
k
i=1
P
i
), so this again is an inter-
esting extended formulation.
Extended formulations can be very effective in giving
better bounds, and they have the important advantage that
they can be added a priori to the MIP problem, avoid which
is the need to solve a separation problem whenever one
wishes to generate cutting planes just involving the origi-
nal (x, y) variables. The potential disadvantage is that the
problem size can increase signicantly and, thus, the time
to solve the linear programming relaxations also may
increase.
Heuristics
In practice, the MIP user often is interested in nding a
good feasible solution quickly. In addition, pruning by
optimality in branch-and-bound depends crucially on
the value of the best known solution value z. We now
describe several MIP heuristics that are procedures
designed to nd feasible, and hopefully, good, solutions
quickly.
In general, nding a feasible solution to an MIP is an
AT-hard problem, so devising effective heuristics is far
from simple. The heuristics we now describe are all based
onthe solutionof one or more MIPs that hopefully are much
simpler to solve than the original problem.
We distinguish between construction heuristics, in
which one attempts to nd a (good) feasible solution from
scratch, and improvement heuristics, which start from a
feasible solution and attempt to nd a better one. We start
with construction heuristics.
Rounding . The rst idea that comes to mind is to take
the solution of the linear program and to round the values
of the integer variables to the nearest integer. Unfortu-
nately, this action is rarely feasible in X
MIP
.
A Diving Heuristic. This heuristic solves a series of
linear programs. At the tth iteration, one solves
mincx hy : Ax Gy _b; x R
p
; y R
n
; y
j
= y
+
j
for j N
t
; y
j
= y
j
for j '
k1
t=1
I
k
;
y
j
Z
1
for j I
t
; y
j
R
1
for j '
K
t=k1
I
t
If (x; y) is an optimal solution, then one sets y
j
= y
j
for
j I
k
and kk 1. If the Kth MIP is feasible, then a
heuristic solution is found; otherwise, the heuristic fails.
Now we describe three iterative improvement heuris-
tics. For simplicity, we suppose that the integer variables
are binary. In each case one solves an easier MIP by
restricting the values taken by the integer variables to
some neighborhood N(y
+
) of the best-known feasible solu-
tion (x
+
, y
+
) and one iterates. Also, let (x
LP
, y
LP
) be the
current LP solution. In each case, we solve the MIP
min cx hy
Ax Gy _b
x R
p
; y Z
n
; y N(y
+
)
with a different choice of N(y
+
).
MIXED INTEGER PROGRAMMING 7
The Local Branching Heuristic. This heuristic restricts
oneto asolutionat a(Hamming-) distanceat most kfromy
+
:
N(y
+
) = y 0; 1
n
: [y
j
y
+
j
[ _ k
This neighborhood can be represented by a linear con-
straint
N(y
+
) = y 0; 1
n
:
j:y
+
j
=0
y
j
j:y
+
j
=1
(1 y
j
) _ k
The Relaxation Induced Neighborhood Search Heuristic
(RINS). This heuristic xes all variables that have the
same value in the IPand LPsolutions and leaves the others
free. Let A = j N : y
IP
j
= y
+
j
. Then
N(y
+
) = y : y
j
= y
+
j
for j A
The Exchange Heuristic. This heuristic allows the user to
choose the set Aof variables that are xed. As for the Relax-
and-Fix heuristic, if a natural ordering of the variables
exists then, a possible choice is to x all the variables except
for those in one interval I
k
. Now if A = N\I
k
, the neighbor-
hood can again be taken as
N(y
+
) = y : y
j
= y
+
j
for j A
One possibility then is to iterate over k = 1,. . ., K, and
repeat as long as additional improvements are found.
The Branch-and-Cut Algorithm
The branch-and-cut algorithm is the same as the branch-
and-bound algorithm except for one major difference. Pre-
viouslyone just selectedasubset of solutions X
t
fromthe list
L that was described by the initial problem representation
Ax Gy _b; x R
p
; y Z
n
: A
t
x G
t
y _b
t
; l
t
_ y _ u
t
where X
t
= p
t
(R
p
Z
n
).
Now the steps, once X
t
is taken from the list, L are
(i) Preprocess to tighten the formulation P
t
.
(ii) Solve the linear program z
t
LP
= mincx hy :
(x; y) P
t
.
(iii) Prune the node, if possible, as inbranch-and-bound.
(iv) Call one or more heuristics. If a better feasible
solution is obtained, Then update the incumbent
value z.
(v) Look for violated valid inequalities. If one or more
satisfactory cuts are found, then add them as cuts,
modify P
t
, and repeat ii).
(vi) If no more interesting violated inequalities are
found, Then branch as in the branch-and-bound
algorithm and add the two new sets X
t
_
and X
t
_
to the list L, along withtheir latest formulations P
t
.
Thenone returns to the branch-and-boundstepof select-
ing a new set from the list and so forth.
Inpractice, preprocessing andcut generationalways are
carriedout onthe original set X
MIP
andthenonselectedsets
drawn from the list (for example, sets obtained after a
certain number of branches or every k-th set drawn from
the list). Often, the valid inequalities added for set X
t
are
valid for the original set X
MIP
; inwhichcase the inequalities
can be added to each set P
t
. All the major branch-and-cut
systems for MIP use preprocessing, and heuristics, such as
diving and RINS, and the valid inequalities generated
include MIR inequalities, Gomory mixed integer cuts, 0
1 cover inequalities, andpathinequalities, generalizing the
(l, S) inequalities. A owchart of a branch-and-cut algo-
rithm is shown in Fig. 2.
REFERENCES AND ADDITIONAL TOPICS
Formulations of Problems as Mixed Integer Programs
Many examples of MIP models from numerous areas,
including air and ground transport, telecommunications,
cutting andloading, andnance canbe foundinHeipcke (2)
and Williams (3), as well as in the operations research
journals such as Operations Research, Management
Initial formulation on list L
List empty?
Remove X
t
with formulation P
t
from list.
Preprocess.
Solve LP.
Call heuristics. Update incumbent.
Prune by
infeasibility
optimality, or
bound?
Separation. Cuts found?
Branch. Add two new nodes to list L.
Update formulation P .
t
YES
NO
YES
NO
NO
Stop. Incumbent is optimal.
YES
Figure 2. Branch-and-cut schema.
8 MIXED INTEGER PROGRAMMING
Science, Mathematical Programming, Informs Journal of
Computing, European Journal of Operational Research,
and more specialized journals such as Transportation
Science, Networks, Journal of Chemical Engineering,
and so forth.
Basic References
Two basic texts on integer and mixed integer programming
are Wolsey (4) and part I of Pochet and Wolsey (5). More
advanced texts are Schrijver (6) and Nemhauser and Wol-
sey (7). Recent surveys on integer and mixed integer pro-
gramming with an emphasis on cutting planes include
Marchandet al. (8), FiigenschuhandMartin(9), Cornuejols
(10), and Wolsey (11).
Preprocessing is discussed in Savelsbergh (12) and
Andersen and Andersen (13), and branching rules is dis-
cussedinAchterberg et al. (14). Muchfundamental workon
cutting planes is because of Gomory (15, 16). The related
mixed integer rounding inequality appears inchaper II.1 of
Nemhauser and Wolsey (7), and cover inequalities for 01
knapsack constraints are discussed in Balas (17), Hammer
et al. (18), and Wolsey (19). The local branching heuristic
appears in Fischetti and Lodi (29): RINS and diving
appears in Danna et al. (21).
Decomposition Algorithms
Signicant classes of MIP problems cannot be solved
directly by the branch-and-cut approach outlined above.
At least three important algorithmic approaches use the
problemstructure to decompose a probleminto a sequence
of smaller/easier problems. One such class, known as
branch-and-price or column generation, see, for instance,
Barnhart et al. (22), extends the well-known Dantzig
Wolfe algorithm for linear programming (23) to IPs and
MIPs. Essentially, the problem is reformulated with a
huge number of columns/variables, then dual variables or
prices from linear programming are used to select/gen-
erate interesting columns until optimality is reached, and
then the whole is embedded into a branch-and-bound
approach. Very many problems in the area of airlines,
roadandrail transport, andstaff scheduling are treatedin
this way. A related approach, Lagrangian relaxation (24),
uses the prices to transfer complicating constraints into
the objective function. The resulting, easier problem pro-
vides a lower bound on the optimal value, and the prices
then are optimized to generate as good a lower bound as
possible.
An alternative decomposition strategy, known as Bend-
ers decomposition (25), takes a different approach. If the
value of the integer variables is xed, then the remaining
problem is a linear program f(y) = mincx : Ax _b
Gy; x R
p
@
2
u
@x
2
2
_ _
= f in V = [0; 1[ [0; 1[ (8)
with the homogeneous Dirichlet BCs
u = 0 on @V (9)
In Equation (8) we adopted, for simplicity, the unit
square domain V = [0; 1[ [0; 1[ R
2
. Central nite-
difference approximation of the second derivatives in
Equation (8), dened on a grid of uniformly spaced points
((x
1
)
i
; (x
2
)
j
) = (ih; jh); i; j = 1; . . . ; n
h
1, results in a lin-
ear system in Equation (4), in which the coefcient matrix
A
h
can be represented in stencil notation as
A
h
=
1
h
2
1
1 4 1
1
_
_
_
_
(10)
We remark that the discretization in Equation (10) of the
two-dimensional model problem on a uniform Cartesian
grid is obtained as a tensor product of two one-dimensional
discretizations in x
1
and x
2
coordinate directions.
The Smoothing Property of Standard Iterative Methods
Whenanapproximate solution ~ u
h
of the systeminEquation
(4) is computed, it is important to know how close it is to
the true discrete solution u
h
. Two quantitative measures
commonly are used in this context. The error is dened as
e
h
= u
h
~ u
h
and the residual as r
h
= f
h
A
h
~ u
h
. Note that
if the residual is small, this does not imply automatically
that the approximation ~ u
h
is close to the solution u
h
. The
error e
h
and the residual r
h
are connected by the residual
equation
A
h
e
h
= r
h
(11)
The importance of Equation (11) can be seen from the fact
that if the approximation ~ u
h
to the solution u
h
of Equation
(4) is computed by some iterative scheme, it can be
improved as u
h
= ~ u
h
e
h
.
The simplest iterative schemes that can be deployed
for the solution of sparse linear systems belong to the
class of splitting iterations (see Refs 2,4,7,8, and 10 for
more details). In algorithmic terms, each splitting iteration
starts from the decomposition of the coefcient matrix
A
h
as A
h
= M
h
N
h
, with M
h
being a regular matrix (det
(M
h
) ,=0), such that the linear systems of the form M
h
u
h
=
f
h
are easy to solve. Then, for a suitably chosen initial
approximation of the solution ~ u
(0)
an iteration is formed:
~ u
(k1)
h
= M
1
h
N
h
~ u
(k)
M
1
h
f
h
; k = 0; 1; . . . (12)
Note that (12) also can be rewritten to include the residual
vector r
(k)
h
= f
h
Au
(k)
h
:
~ u
(k1)
h
= ~ u
(k)
h
M
1
h
r
(k)
h
; k = 0; 1; . . . (13)
Some well-known methods that belong to this category
are the xed-point iterations, or the relaxation methods,
such as the Jacobi method, the Gauss-Seidel method, and
its generalizations SOR and SSOR (4). To introduce these
methods, we start from the splitting of the coefcient
matrix A
h
in(4) inthe formA
h
= D
h
L
h
U
h
, where D
h
=
diag(A
h
) and L
h
and U
h
are strictly the lower and the
upper triangular part of A
h
, respectively. In this way, the
systeminEquation(4) becomes (D
h
L
h
U
h
)u
h
= f
h
and
we can form a variety of iterative methods of the form in
Equation(12) bytakingsuitable choices for the matrices M
h
and N
h
.
Inthe Jacobi method, the simplest choice for M
h
is taken,
that is M
h
= D
h
; N
h
= L
h
U
h
and the iteration can be
written as:
~ u
(k1)
h
= D
1
h
(L
h
U
h
) ~ u
(k)
h
D
1
h
f
h
; k = 0; 1; . . . (14)
A slight modication of the original method in Equation
(14) is to take the weighted average between ~ u
(k1)
h
and ~ u
(k)
h
to form a new iteration, for example,
~ u
(k1)
h
= [(1 v)I vD
1
h
(L
h
U
h
)[ ~ u
(k)
h
vD
1
h
f
h
;
k = 0; 1; . . .
(15)
where vR. Equation (15) represents the weighted or
damped Jacobi method. For v = 1, Equation (15) reduces
to the standard Jacobi method in Equation (14).
MULTIGRID METHODS 3
The GaussSeidel method involves the choice
M
h
= D
h
L
h
, N
h
= U
h
and can be written as
~ u
(k1)
h
= (D
h
L
h
)
1
U
h
~ u
(k)
h
(D
h
L
h
)
1
f
h
;
k = 0; 1; . . .
(16)
The main advantage of the GaussSeidel method is that
the components of the newapproximation ~ u
(k1)
h
canbe used
as soon as they are computed. Several modications of the
standard GaussSeidel method are developed with the aim
of improving the convergence characteristics or paralleliz-
ability of the original algorithm in Equation (16) (sym-
metric, red-black, etc. see Refs. 4,7, and 8).
When applied to the solution of linear systems in Equa-
tion (7) that arise from the discretization of the model
problem in Equations (5) and (6), the convergence of split-
ting iterations is initially rapid, only to slow down signi-
cantly after a few iterations. More careful examination of
the convergence characteristics using Fourier analysis (7
10) reveal different speeds of convergence for different
Fourier modes. That is, if different Fourier modes (vectors
of the form (v
l
)
i
= sin(ilp=n
h
); i = 1; . . . ; n
h
1, where l is
the wave number) are used as the exact solutions of the
residual Equation (11) with a zero initial guess, the con-
vergence speed of the splitting iterations improves withthe
increasing wave number l. This means that the conver-
gence is faster when l is larger, for example, when the error
in the solution u
h
contains the high-frequency (highly
oscillatory) components. Thus, when a system in Equation
(7) with an arbitrary right-hand side is solved using a
simple splitting iteration with an arbitrary initial guess
~ u
(0)
h
, the initial fast convergence is because of the rapid
elimination of the high-frequency components in the error
e
h
. A slow decrease in the error at the later stages of
iteration indicates the presence of the low-frequency com-
ponents. We assume that the Fourier modes in the lower
half of the discrete spectrum (with the wave numbers
1 _ l <n
h
=2) are referred to as the low-frequency (smooth)
modes, whereas the modes in the upper half of the discrete
spectrum (with the wave numbers n
h
=2 _ l _ n
h
1) are
referred to as the high-frequency (oscillatory) modes.
If we rewrite a general splitting iteration in Equation
(12) as ~ u
(k1)
h
= G
h
~ u
(k)
h
g
h
, then the matrix G
h
is referred
to as the iteration matrix. The error e
(k)
h
= u
h
~ u
(k)
h
after k
iterations satises the relation e
(k)
h
= G
k
h
e
(0)
h
. A sufcient
and necessary condition for a splitting iteration to con-
verge to the solution (~ u
(k)
h
u
h
) is that the spectral
radius of the iteration matrix G
h
is less than 1 (24). The
spectral radius is dened as r(G
h
) = max[l
j
(G
h
)[, where
l
j
(G
h
) are the eigenvalues of G
h
[recall that for a sym-
metric and positive denite (SPD) matrix, all the eigen-
values are real and positive (24)]. The speed of
convergence of a splitting iteration is determined by the
asymptotic convergence rate (4)
t = ln lim
k
|e
(k)
h
|
|e
(0)
h
|
_ _
1=k
(17)
For the case of the linear system in Equation (7) obtained
fromthe discretizationof the modelprobleminEquations
(5) and (6), the eigenvalues of the iteration matrix
G
J
h
= I
vh
2
2
A
h
for the damped Jacobi method are
given by l
j
(G
J
h
) = 1 2vsin
2
(
jp
2n
h
); j = 1; . . . ; n
h
1.
Thus, [l
j
(G
J
h
)[ <1 for each j if 0 <v<1, and the method
is convergent. However, different choices of the damping
parameter v have a crucial effect on the amount by which
different Fourier components of the error are reduced.
One particular choice is to nd the value of the para-
meter v that maximizes the effectiveness of the damped
Jacobi method in reducing the oscillatory components of
the error (the components with the wave numbers
n
h
=2 _ l _ n
h
1). The optimal value for the linear sys-
tem in Equation (7) is v = 2=3(7), and for this value we
have [l
j
[ <1=3for n
h
=2 _ j _ n
h
1. This means that each
iteration of the damped Jacobi method reduces the mag-
nitude of the oscillatory components of the error by at least
a factor 3. For the linear system obtained from the FEM
discretization of the two-dimensional modelproblem in
Equation (8), the optimal value of v is 4/5 for the case of
linear approximation and v = 8=9 for bilinear case, (2, p.
100, 4). For the GaussSeidel method, the eigenvalues of
the iteration matrix for the model-problemin Equation (7)
are givenbyl
j
(G
GS
h
) = cos
2
(
jp
n
h
); j = 1; . . . ; n
h
1. As inthe
case of the dampedJacobi method, the oscillatory modes in
the error are reduced rapidly, whereas the smooth modes
persist.
The property of the xed-point iteration schemes to
reduce rapidly the high-frequency components in the error
is known as the smoothing property, and such schemes
commonly are known as the smoothers. This property is, at
the same time, the main factor that impairs the applic-
ability of these methods as stand-alone solvers for linear
systems that arise in FDM/FEM discretizations of PDEs.
Two-Grid Correction Method
Having introduced the smoothing property of standard
iterative methods, we investigate possible modica-
tions of such iterative procedures that would enable the
efcient reduction of all frequency components of the
solution error eh. Again, we study the modelproblem in
Equations (5) and (6) discretized on the uniform grid
V
h
= x
i
= ih; h = 1=n
h
; i = 1; . . . ; n
h
1 yielding a linear
system in Equation (7). The mesh V
h
may be regarded as a
ne mesh obtained by the uniform renement of a
coarse mesh V
H
= x
i
= iH; H = 2h; i = 1; . . . ; n
H
1.
The coarse mesh contains only the points of the ne
mesh with the even numbers. After applying several steps
of a xed-point method to the h-system in Equation (7),
only the smooth components of the error remain. The
questions that naturally arise in this setting concern the
properties of the smootherror components fromthe gridV
h
,
whenrepresentedonthe coarse gridV
H
. These components
seem to be more oscillatory on V
H
than on V
h
(see Ref. 7).
Notice that on the coarse grid V
H
, we have only half as
many Fourier modes compared with the ne grid V
h
. The
fact that the smooth error components from the ne grid
seem less smooth on the coarse grid offers a potential
remedy for the situation when a xed-point iteration loses
4 MULTIGRID METHODS
its effectivenesswe need to move to a coarse grid V
H
,
where a xed-point iteration will be more effective in
reducing the remaining error components. This idea forms
the basis of the two-grid method, which is summarized in
the Algorithm 1.
Algorithm 1. Two-grid correction scheme.
1: ~ u
h
G
v1
h
(~ u
(0)
h
; f
h
)
2: r
h
= f
h
A
h
~ u
h
3: r
H
= R
H
h
r
h
4: Solve approximately A
H
e
H
= r
H
5: e
h
= P
h
H
e
H
6: ~ u
h
= ~ u
h
e
h
7: ~ u
h
G
v2
h
(~ u
h
; f
h
)
In Algorithm 1, G
v
h
(~ u
h
; f
h
) denotes the application of v
iterations of a xed-point iteration method to a linear
system in Equation (7) with the initial guess ~ u
h
. At Step
4 of Algorithm 1, the coarse-grid problem A
H
e
H
= r
H
needs
to be solved. The coarse-grid discrete operator A
H
can be
obtained either by the direct discretization of the contin-
uous problem on a coarse mesh V
H
or from the ne-grid
discrete operator A
h
by applying the Galerkin projection
A
h
= R
H
h
A
h
P
h
H
. After solving the coarse grid problem (Step
4), we need to add the correction e
H
, dened on the coarse
grid V
H
, to the current approximation of the solution ~ u
h
,
which is dened on the ne grid V
h
. It is obvious that these
two vectors do not match dimensionally. Thus, before the
correction, we need to transformthe vector e
H
to the vector
e
h
. The numerical procedure that implements the transfer
of information from a coarse to a ne grid is referred to
as interpolation or prolongation. The interpolation can
be presented in operator form as v
h
= P
h
H
v
H
, where
v
H
R
n
H
; v
h
R
n
h
, and P
h
H
R
n
h
n
H
. Here, n
h
denotes the
size of a discrete problem on the ne grid and n
H
denotes
the size of a discrete problem on the coarse grid. Many
different strategies exist for doing interpolation when MG
is considered in the FDM setting (see Refs. 4,7,8, and 10).
The most commonly used is linear or bilinear interpolation.
In a FEM setting, the prolongation operator P
h
H
is
connected naturally with the FEbasis functions associated
with the coarse grid (see Refs 1,7, and 8). In the case of
the one-dimensional problem in Equations (5) and (6), the
elements of the interpolation matrix are given by
(P
h
H
)
g( j);g(l)
= f
H
l
(x
j
), where g(l) is the global number of
the node l on the coarse grid V
H
and g(j) is the global
number of the node j on the ne grid V
h
. f
H
l
(x
j
) is the value
of the FEbasis function associated with the node l fromthe
coarse grid V
H
at the point j of the ne grid with the
coordinate x
j
.
Before solving the coarse grid problem A
H
e
H
= r
H
, we
needto transfer the informationabout the ne gridresidual
r
h
to the coarse grid V
H
, thus getting r
H
. This operation is
the reverse of prolongation and is referred to as restriction.
The restriction operator can be represented as v
H
= R
H
h
v
h
,
where R
H
h
R
n
H
n
h
. The simplest restriction operator is
injection, dened in one dimension as (v
H
)
j
= (v
h
)
2j
,
j = 1; . . . ; n
H
1, for example, the coarse grid vector takes
the immediate values from the ne grid vector. Some more
sophisticated restriction techniques include half injection
and full weighting (see Refs 4,7, and 8). The important
property of the full weighting restriction operator in
the FDM setting is that it is the transpose of the linear
interpolation operator up to the constant that depends on
the spatial dimension d, for example, P
h
H
= 2
d
(R
H
h
)
T
. In the
FEM setting, the restriction and the interpolation opera-
tors simply are selected as the transpose of each other, for
example, P
h
H
= (R
H
h
)
T
. The spatial dimension-dependent
factor 2
d
that appears in the relation between the restric-
tion R
H
h
and the interpolation P
h
H
is the consequence of the
residuals being taken pointwise in the FDMcase and being
element-weighted in the FEM case.
If the coarse grid problem is solved with sufcient accu-
racy, the two-grid correction scheme should work ef-
ciently, providing that the interpolation of the error from
coarse to ne grid is sufciently accurate. This happens in
cases when the error e
H
is smooth. As the xed-point
iteration scheme applied to the coarse-grid problem
smooths the error, it forms a complementary pair with
the interpolation, and the pair of these numerical proce-
dures work very efciently.
V-Cycle Multigrid Scheme and Full Multigrid
If the coarse-grid problem A
H
e
H
= r
H
in Algorithm 1 is
solved approximately, presumably by using the xed-point
iteration, the question is how to eliminate successfully the
outstanding low-frequency modes on V
H
? The answer lies
in the recursive application of the two-grid scheme. Such a
scheme would require a sequence of nested grids
V
0
V
1
V
L
, where V
0
is a sufciently coarse grid
(typically consisting of only a few nodes) to allow efcient
exact solution of the residual equation, presumably by a
direct solver. This scheme denes the V-cycle of MG, which
is summarized in Algorithm 2:
Algorithm 2. V-Cycle multigrid (recursive denition): u
L
=
MG(A
L
; f
L
; ~ u
(0)
L
; v
1
; v
2
; L):
1: function MG (A
l
; f
l
; ~ u
(0)
l
; v
1
; v
2
; l)
2: ~ u
l
G
v1
h
(~ u
(0)
l
; f
l
)
3: r
l
= f
l
A
l
~ u
l
;
4: r
l1
= R
l1
l
r
l
5: if l 1 = 0 then
6: Solve exactly A
l1
e
l1
= r
l1
7: else
8: e
l
1 = MG(A
l1
; r
l1
; 0; v
1
; v
2
; l 1)
9: e
l
= P
l
l1
e
l1
10: ~ u
l
= ~ u
l
e
l
11: ~ u
l
G
v2
h
( ~ u
l
; f
l
)
12: end if
13: return ~ u
l
14: end function
Anumber of modications of the basic V-cycle exist. The
simplest modication is to vary the number of recursive
calls g of the MG function in Step 8 of the Algorithm 2. For
g = 1, we have the so-called V-cycle, whereas g = 2 pro-
duces the so-called W-cycle (7).
Until now, we were assuming that the relaxation on the
ne grid is done with an arbitrary initial guess ~ u
(0)
L
, most
commonly taken to be the zero vector. Anatural question in
this context would be if it is possible to obtain an improved
MULTIGRID METHODS 5
initial guess for the relaxation method. Such an approxima-
tion can be naturally obtained by applying a recursive
procedure referred to as nested iteration (7,8,10). Assume
that the modelproblemin Equation (7) is discretized using
a sequence of nested grids V
0
V
1
V
L
. Then we can
solve the problem on the coarsest level V
0
exactly, inter-
polate the solution to the next ner level, and use this value
as the initial guess for the relaxation (i.e., ~ u
(0)
1
= P
1
0
~ u
0
). This
procedure canbe continued until we reachthe nest level L.
Under certain assumptions, the error in the initial guess
~ u
(0)
L
= P
L
L1
~ u
L1
on the nest level is of the order of the
discretizationerror and only a small number of MGV-cycles
is needed to achieve the desired level of accuracy. The
combination of the nested iteration and the MG V-cycle
leads to the full multigrid (FMG) algorithm, summarized
in Algorithm 3:
Algorithm 3. Full multigrid: ~ u
L
= FMG(A
L
; f
L
; v
1
; v
2
; L):
1: function
~ u
L
= FMG(A
L
; f
L
; v
1
; v
2
; l)
2: Solve
A
0
~ u
0
= f
0
with sufcient accuracy
3: for l=1, L do
4:
~ u
(0)
l
=
^
P
l
l1
~ u
l1
5:
~ u
l
= MG(A
l
; f
l
; ~ u
(0)
l
; v
1
; v
2
; l) % Algorithm 2
6: end for
7: return
~ u
L
8: end function
The interpolation operator
^
P
l
l1
in the FMG scheme can
be different, in general, from the interpolation operator
P
l
l1
usedinthe MGV-cycle. AFMGcycle canbe viewedas a
sequence of V-cycles on progressively ner grids, where
each V-cycle on the grid l is preceeded by a V-cycle on the
grid l 1.
Computational Complexity of Multigrid Algorithms
We conclude this section with an analysis of algorithmic
complexity for MG, both in terms of the execution time and
the storage requirements. For simplicity, we assume that
the model problemin d spatial dimensions is discretized on
asequence of uniformlyrenedgrids (inoneandtwo spatial
dimensions, the modelproblem is given by Equations (5)
and (6) and Equations (8) and (9), respectively). Denote the
storage required to t the discrete problem on the nest
grid by M
L
. The total memory requirement for the MG
scheme then is M
MG
= mM
L
, where m depends on d but not
onL. This cost does not take intoaccount the storage needed
for the restriction/prolongation operators, although in
some implementations of MG, these operators do not
need to be stored explicitly. For the case of uniformly
rened grids, the upper bound for the constant m is 2 in
one dimension, and
4
3
in two dimensions (by virtue of
m =
L
j=0
1
2
jd
<
j=0
1
2
jd
). For non uniformly rened grids,
m is larger but M
MG
still is a linear function of M
L
, with the
constant of proportionality independent of L.
Computational complexity of MG usually is expressed
in terms of work units (WU). One WU is the arithmetic
cost of one step of a splitting iteration applied to a discrete
problem on the nest grid. The computational cost of a
V(1, 1) cycle of MG (V(1,1) is a V-cycle with one pre
smoothing and one post smoothing iteration at each level
and is v times larger than WU, with v = 4 for d = 1, v =
8
3
for d = 2 and v =
16
7
for d = 3 (7). These estimates do not
take into account the application of the restriction/inter-
polation. The arithmetic cost of the FMG algorithm is
higher than that of a V-cycle of MG. For the FMGwith one
relaxation step at each level, we have v = 8 for d = 1, v =
7
2
for d = 2, and v =
5
2
for d = 3 (7). Now we may ask how
many V-cycles of MG or FMG are needed to achieve an
iteration error commensurate with the discretization
error of the FDM/FEM. To answer this question, one
needs to knowmore about the convergence characteristics
of MG. For this, deeper mathematical analysis of spectral
properties of the two-grid correction scheme is essential.
This analysis falls beyond the scope of our presentation.
For further details see Refs 710. If a modelproblem in d
spatial dimensions is discretizedby auniformsquare grid,
the discrete problem has O(m
d
) unknowns, where m is
the number of grid lines in each spatial dimension. If the
V(v
1
; v
2
) cycle of MG is applied to the solution of
the discrete problem with xed parameters v
1
and v
2
,
the convergence factor t in Equation (17) is bounded
independently of the discretization parameter h (for a
rigorous mathematical proof, see Refs 810). For
the linear systems obtained from the discretization of
the second-order, elliptic, self-adjoint PDEs, the conver-
gence factor t of a V-cycle typically is of order 0.1. To
reduce the error e
h
= u
h
~ u
h
to the level of the disretiza-
tion error, we need to apply O(log m) V-cycles. This means
that the total cost of the V-cycle scheme is O(m
d
log m). In
the case of the FMG scheme, the problems discretized on
grids V
l
; l = 0; 1; . . . ; L 1already are solved to the level of
discretization error before proceeding with the solution of
the problem on the nest grid V
L
. In this way, we need to
perform only O(1) V-cycles to solve the problem on the
nest grid. This process makes the computational cost of
the FMG scheme O(m
d
), for example, it is an optimal
algorithm.
ADVANCED TOPICS
The basic MG ideas are introduced in the previous section
for the case of a linear, scalar and stationary DE with a
simple grid hierarchy. This section discusses some impor-
tant issues when MG is applied to more realistic problems
(non linear and/or time-dependent DEs and systems of
DEs), with complex grid structures in several spatial
dimensions, and when the implementation is done on mod-
ern computer architectures.
Non linear Problems
The coarse grid correction step of the MG algorithm is not
directly applicable to discrete non linear problems as the
superposition principle does not apply in this case. Two
basic approaches exist for extending the application of MG
methods to non linear problems. In the indirect approach,
the MG algorithm is employed to solve a sequence of linear
problems that result from certain iterative global linear-
ization procedures, such as Newtons method. Alternately,
with the slight modication of the grid transfer and coarse
6 MULTIGRID METHODS
grid operators, the MG algorithm can be transformed into
the Brandt Full Approximation Scheme (FAS) algorithm
(6,7), whichcanbeapplieddirectlyto the nonlinear discrete
equations.
In the FAS algorithm, the non linear discrete problem
dened on a ne grid V
h
as A
h
(u
h
) = f
h
is replaced by
A
h
(~ u
h
e
h
) A
h
( ~ u
h
) = f
h
A
h
(~ u
h
) (18)
The Equation (18) reduces in the linear case to the
residual Equation (11). Dene a coarse-grid approxima-
tion of Equation (18)
A
H
^
R
H
h
~ u
h
e
H
_ _
A
H
^
R
H
h
~ u
h
_ _
=
^
R
H
h
f
h
A
h
( ~ u
h
)
_ _
(19)
where
^
R
H
h
is some restriction operator. Note that if the
discrete nonlinear probleminEquation(3) comes fromthe
FEM approximation, the restriction operators
^
R
H
h
in
Equation (19) should differ by the factor of 2
d
, where d
is the spatial dimension (see the discussion on p. 11).
Introducing the approximate coarse grid discrete solution
~ u
H
=
^
R
H
h
~ u
h
e
H
and the ne-to-coarse correction t
h
H
=
A
H
(
^
R
H
h
~ u
h
)
^
R
H
h
A
h
( ~ u
h
), the coarse grid problem in the
FAS MG algorithm becomes
A
H
(~ u
H
) = R
H
h
f
h
t
h
H
(20)
Then, the coarse grid correction
~ u
h
~ u
h
^
P
h
H
e
H
= ~ u
h
^
P
h
H
~ u
H
^
R
H
h
~ u
h
_ _
(21)
can be applied directly to non linear problems. In Equ-
ation, (21)
^
R
H
h
~ u
h
represents the solution of Equation (20)
on coarse grid V
H
. This means that once the solution of the
ne grid problem is obtained, the coarse grid correction
does not introduce any changes through the interpolation.
The ne-to-coarse correction t
h
H
is a measure of how close
the approximation properties of the coarse grid equations
are to that of the ne grid equations.
When the FAS MG approach is used, the global Newton
linearization is not needed, thus avoiding the storage of
large Jacobian matrices. However, the linear smoothing
algorithms need to be replaced by non linear variants of the
relaxationschemes. Withinanonlinear relaxationmethod,
we needto solve anonlinear equationfor eachcomponent of
the solution. To facilitate this, local variants of the Newton
method commonly are used. The FAS MG algorithm struc-
ture and its implementation are almost the same as the
linear case and require only small modications. For non
linear problems, a proper selection of the initial approx-
imationis necessary to guarantee convergence. To this end,
it is recommended to use the FMGmethod described in the
Basic Concepts section above.
Systems of Partial Differential Equations
Many complex problems in physics and engineering cannot
be describedbysimple, scalar DEs. These problems, suchas
DEs with unknownvector elds or multi-physics problems,
usually are described by systems of DEs. The solution of
systems of DEs is usually a challenging task. Although no
fundamental obstacles exist to applying standard MGalgo-
rithms to systems of DEs, one needs to construct the grid
transfer operators and to create the coarse grid discrete
operators. Moreover, the smoothing schemes need to be
selected carefully.
Problems in structural and uid mechanics frequently
are discretized using staggered grids (10). In such cases,
different physical quantities are associated with different
nodal positions within the grid. The main reason for such
distribution of variables is the numerical stability of such
schemes. However, this approach involves some restric-
tions and difculties in the application of standard MG
methods. For instance, using the simple injection as a
restriction operator may not be possible. An alternative
construction of the restriction operator is based on aver-
aging the ne grid values. Moreover, the non matching
positions of ne and coarse grid points considerably com-
plicates the interpolation near the domain boundaries. The
alternative approach to the staggered grid discretization is
to use non structured grids with the same nodal positions
for the different types of unknowns. Numerical stabiliza-
tion techniques are necessary to apply in this context (2,8).
The effectiveness of standard MG methods, when
applied to the solution of systems of DEs, is determined
to a large extent by the effectiveness of a relaxation proce-
dure. For scalar elliptic problems, it is possible to create
nearly optimal relaxationschemes basedonstandardxed-
point iterations. The straightforward extension of this
scalar approach to systems of DEs is to group the discrete
equations either with respect to the grid points or with
respect to the discrete representations of each DE in the
system (the latter corresponds to the blocking of a linear
system coefcient matrix). For effective smoothing, the
order in which the particular grid points/equations are
accessed is very important and should be adjusted for
the problem at hand. Two main approaches exist for the
decoupling of a system: global decoupling, where for each
DE all the grid points are accessed simultaneously, and
local decoupling, where all DEs are accessed for each grid
point.
The local splitting naturally leads to collective or block
relaxation schemes, where all the discrete unknowns at a
given grid node, or group of nodes, are relaxed simulta-
neously. The collective relaxation methods are very attrac-
tive withinthe FASMGframeworkfor nonlinear problems.
Another very general class of smoothing procedures that
are applicable in such situations are distributive relaxa-
tions (20). The idea is to triangulate locally the discrete
operator for all discrete equations and variables within the
framework of the smoother. The resulting smoothers also
are called the transforming smoothers (25).
In the section on grid-coarsening techniques, an
advanced technique based on the algebraic multigrid
(AMG) for the solution of linear systems obtained by the
discretization of systems of PDEs will be presented.
Time-Dependent Problems
A common approach in solving time-dependent problems
is to separate the discretization of spatial unknowns in
MULTIGRID METHODS 7
the problem from the discretization of time derivatives.
Discretization of all spatial variables by some stan-
dard numerical procedure results in such cases in a dif-
ferential-algebraic system (DAE). A number of numerical
procedures, such as the method of lines, have been devel-
oped for the solution of such problems. Numerical proce-
dures aimed at the solution of algebraic-differential
equations usually are some modications and general-
izations of the formal methods developed for the solution
of systems of ODEs [e.g., the RungeKutta method, the
linear multistep methods, and the Newmark method
(26,27)]. In a system of DAEs, each degree of freedom
within the spatial discretization produces a single ODE
in time. All methods for the integration of DAE systems
can be classied into two groups: explicit and implicit.
Explicit methods are computationally cheap (requiring
only the sparse matrixvector multiplication at each
time step) and easy to implement. Their main drawback
lies in stability, when the time step size changes (i.e., they
are not stable for all step sizes). The region of stability of
these methods is linked closely to the mesh size used in
spatial discretization (the so-called CFL criterion). How-
ever, if a particular application requires the use of small
time steps, then the solution algorithms based on explicit
time-stepping schemes can be effective. The implicit time-
stepping schemes have no restrictions with respect to the
time step size, for example, they are unconditionally
stable for all step sizes. The price to pay for this stability
is that at each time step one needs to solve a system of
linear or non linear equations. If sufciently small time
steps are used with the implicit schemes, standard itera-
tive solvers based on xed-point iterations or Krylov
subspace solvers (4) can be more effective than the use
of MG solvers. This particularly applies if the time extra-
polation method or an explicit predictor method (within a
predictorcorrector scheme) is used to provide a good
initial solution for the discrete problem at each time step.
However, if sufciently large time steps are used in the
time-stepping algorithm, the discrete solution will contain
a signicant proportion of low-frequency errors introduced
by the presence of diffusion in the system (20,27). The
application of MG in such situations represents a feasible
alternative. In this context, one can use the so-called
smart time-stepping algorithms (for their application
in uid mechanics see Ref. 27). In these algorithms (based
on the predictorcorrector schemes), the time step size is
adjusted adaptively to the physics of the problem. If a
system that needs to be solved within the corrector is
non linear, the explicit predictor method is used to provide
a good initial guess for the solution, thus reducing the
overall computational cost. MG can be used as a building
block for an effective solver within the corrector (see the
section on preconditioning for the example of such solver in
uid mechanics).
Multigrid with Locally Rened Meshes
Many practically important applications require the reso-
lution of small-scale physical phenomena, which are loca-
lized in areas much smaller than the simulation domain.
Examples include shocks, singularities, boundary layers,
or non smooth boundaries. Using uniformly rened grids
with the mesh size adjusted to the small-scale phenomena
is costly. This problem is addressed using adaptive mesh
renement, which represents a process of dynamic intro-
duction of local ne grid resolution in response to unre-
solved error in the solution. Adaptive mesh renement
techniques were introduced rst by Brandt (6). The criteria
for adaptive grid renement are provided by a posteriori
error estimation (2).
The connectivity among mesh points is generally speci-
edbythe small subgroupof nodes that formthe meshcells.
Typically, each cell is a simplex (for example, lines in one
dimension; triangles or quadrilaterals in two dimensions;
and tetrahedra, brick elements, or hexahedra in three
dimensions). The renement of a single grid cell is achieved
by placing one or more new points on the surface of, or
inside, each grid cell and connecting newly created and
existing mesh points to create a new set of ner cells. A
union of all rened cells at the given discretization level
forms a new, locally rened, grid patch. Because adaptive
local renement and MG both deal with grids of varying
meshsize, the twomethods naturallyt together. However,
it is necessary to performsome adaptations on both sides to
make the most effective use of both procedures. To apply
MGmethods, the local gridrenement shouldbe performed
with the possibility of accessing locally rened grids at
different levels.
The adaptive mesh renement (AMR) procedure starts
froma basic coarse grid covering the whole computational
domain. As the solution phase proceeds, the regions
requiring a ner grid resolution are identied by an error
estimator, which produces an estimate of the discretiza-
tion error, one specic example being e
h
=
^
R
h
u u
h
(28).
Locally rened grid patches are created in these regions.
This adaptive solution procedure is repeated recursively
until either a maximal number of renement levels is
reached or the estimated error is below the user-specied
tolerance. Such a procedure is compatible with the FMG
algorithm. Notice that a locally rened coarse grid should
contain both the information on the correction of the
discrete solution in a part covered by it and the discrete
solution, itself, in the remainder of the grid. The simplest
and most natural way to achieve this goal is to employ a
slightly modied FAS MGalgorithm. The main difference
between the FAS MG algorithm and the AMR MG algo-
rithm is the additional interpolation that is needed at the
interior boundary of each locally rened grid patch. For
unstructured grids, it is possible to rene the grid cells
close to the interior boundary in such a way to avoid
interpolation at the interor boundaries altogether. For
structured meshes, the internal boundaries of locally
rened patches contain the so-called hanging nodes
that require the interpolation from the coarse grid to
preserve the FE solution continuity across the element
boundaries. This interpolation may need to be dened in a
recursive fashion if more than one level of renement is
introduced on a grid patch (this procedure resembles the
long-range interpolation from the algebraic multigrid).
The interpolation operator at the internal boundaries of
the locally rened grids could be the same one used in the
standard FMG algorithm. Figure 1 shows an example of
8 MULTIGRID METHODS
the adaptive grid structure comprising several locally
rened grid patches dynamically created in the simula-
tion of dopant redistribution during semiconductor fabri-
cation (20). Other possibilities exist in the context of the
FAS MG scheme, for example, one could allow the exis-
tence of hanging nodes and project the non conforming
solution at each stage of the FAS to the space where the
solution values at the hanging nodes are the interpolants
of their coarse grid parent values (29).
Various error estimators have been proposed to support
the process of the local grid renement. The multi level
locallyrenedgridstructure andthe MGalgorithmprovide
additional reliable and numerically effective evaluation of
discretization errors. Namely, the grid correction operator
t
h
H
[introduced in Equation (20)] represents the local dis-
cretization error at the coarse grid level V
H
(up to a factor
that depends on the ratio H/h). It is the information
inherent to the FAS MG scheme and can be used directly
as a part of the local grid renement process. The other
possibility for error estimation is to compare discrete solu-
tions obtained on different grid levels of FMGalgorithms to
extrapolate the global discretization error. In this case, the
actual discrete solutions are used, rather than those
obtained after ne-to-coarse correction.
Another class of adaptive multi level algorithms are the
fast adaptive composite-grid (FAC) methods developed in
the 1980s (30). The main strength of the FAC is the use of
the existing single-grid solvers dened on uniform meshes
to solve different renement levels. Another important
advantage of the FACis that the discrete systems onlocally
rened grids are given in the conservative form. The FAC
allows concurrent processing of grids at given renement
levels, and its convergence rate is bounded independently
of the number of renement levels. One potential pitfall of
both AMR MG and the FAC is in the multiplicative way
various renement levels are treated, thus implying
sequential processing of these levels.
Grid-Coarsening Techniques and Algebraic Multigrid (AMG)
Amulti level grid hierarchy can be created in a straightfor-
ward way by successively adding ner discretization levels
to an initial coarse grid. To this end, nested global and local
mesh renement as well as non-nested global mesh gen-
eration steps can be employed. However, many practical
problems are dened in domains with complex geometries.
In such cases, an unstructured mesh or a set of composite
block-structured meshes are required to resolve all the
geometric features of the domain. The resulting grid could
contain a large number of nodes to be used as the coarsest
grid level in the MG algorithms. To take full advantage of
MGmethods insuchcircumstances, a variety of techniques
have been developed to provide multi level grid hierarchy
and generate intergrid transfer operators by coarsening a
given ne grid.
The rst task in the grid-coarsening procedure is to
choose of the coarse-level variables. In practice, the qual-
ity of the selected coarse-level variables is based on heur-
istic principles (see Refs. 4,7, and 8). The aim is to achieve
both the quality of interpolation and a signicant reduc-
tion in the dimension of a discrete problem on the coarse
grids. These two requirements are contradictory, and in
practice some tradeoffs are needed to meet themas closely
as possible. The coarse-level variables commonly are
identied as a subset of the ne grid variables based on
the mesh connectivity and algebraic dependencies. The
mesh connectivity can be employed in a graph-based
approach, by selecting coarse-level variables at the ne
mesh points to formthe maximal independent set (MIS) of
the ne mesh points. For the construction of effective
coarse grids and intergrid transfer operators, it often is
necessary to include the algebraic information in the
coarse nodes selection procedure. One basic algebraic
principle is to select coarse-level variables with a strong
connection to the neighboring ne-level variables
(strength of dependence principle) (31). This approach
leads to a class of MG methods referred to as algebraic
multigrid (AMG) methods. Whereas geometric MG meth-
ods operate on a sequence of nested grids, AMG operates
on a hierarchy of progressively smaller linear systems,
which are constructed in an automatic coarsening pro-
cess, based on the algebraic information contained in the
coefcient matrix.
Another class of methods identies acoarse gridvariable
as a combination of several ne-grid variables. It typically
is used in combination with a nite-volume discretization
method (10), where the grid variables are associated with
the corresponding mesh control volumes. The volume
agglomeration method simply aggregates the ne control
volumes into larger agglomerates to form the coarse grid
space. The agglomeration can be performed by a greedy
algorithm. Alternatively, the ne grid variables can be
clustered directly (aggregated) into coarse-level variables
based on the algebraic principle of strongly or weakly
coupled neighborhoods. The aggregation method is intro-
duced in Ref. 32, and some similar methods can be found in
Refs. 33 and 34. The agglomeration method works also in
the nite element setting (35).
The most common approach in creating the intergrid
transfer operators within the grid-coarsening procedure is
to formulate rst the prolongation operator P
h
H
and to use
R
H
h
= (P
h
H
)
T
as a restriction operator. One way of construct-
ing the prolongation operator from a subset of ne grid
variables is to apply the Delauney triangulation procedure
to the selected coarse mesh points. A certain nite element
space associated to this triangulation can be used to create
the interpolation operator P
h
H
. The prolongation operators
Figure 1. Composite multi level adaptive mesh for the MG solu-
tion of a dopant diffusion equation in a semiconductor fabrication.
MULTIGRID METHODS 9
for the agglomerationand aggregationmethods are dened
in the way that each ne grid variable is represented by a
single (agglomerated or aggregated) variable of the coarse
grid.
The interpolation operator also can be derived using
purely algebraic information. We start from a linear dis-
crete problem in Equation (4) and introduce a concept of
algebraically smooth error e
h
. Algebraically smooth error is
the error that cannot be reduced effectively by a xed-point
iteration. Note that the graph of an algebraically smooth
error may not necessarily be a smooth function (7). The
smooth errors are characterized by small residuals, for
example, componentwise [r
i
[ a
ii
[e
i
[. This condition can
be interpreted broadly as
Ae ~0 (22)
For the cases of modelproblems in Equations (5) and (6)
and Equations (8) and (9), where the coefcient matrices
are characterized by zero row sums, Equation (22) means
that we have component-wise
a
ii
e
i
~
j ,=i
a
i j
e
j
(23)
The relation in Equation (23) means that the smooth error
can be approximated efciently at a particular point if one
knows its values at the neighboring points. This fact is the
starting point for the development of the interpolation
operator P
h
H
for AMG. Another important feature in con-
structing the AMG interpolation is the fact that the
smooth error varies slowly along the strong connections
(associated with large negative off-diagonal elements in
the coefcient matrix). Each equation in the Galerkin
system describes the dependence between the neighbor-
ing unknowns. The i-thequationinthe systemdetermines
which unknowns u
j
affect the unknown u
i
the most. If the
value u
i
needs to be interpolated accurately, the best
choice would be to adopt the interpolation points u
j
with large coefcients a
i j
in the Galerkin matrix. Such
points u
j
are, in turn, good candidates for coarse grid
points. The quantitative expression that u
i
depends
strongly on u
j
is given by
a
i j
_u max
1_k_n
a
ik
; 0<u _ 1 (24)
Assume that we have partitioned the grid points into the
subset of coarse grid points C and ne grid points F. For
each ne grid point i, its neighboring points, dened by the
nonzero off-diagonal entries in the i-th equation, can be
subdivided into three different categories, with respect to
the strength of dependence: coarse grid points with strong
inuence on i(C
i
), ne grid points with strong inuence on
i(F
s
i
), and the points (both coarse and ne) that weakly
inuence i(D
i
). Then, the relation in Equation (23) for
algebraically smooth error becomes (for more details, see
Ref. 31)
a
ii
e
i
~
j C
i
a
i j
e
j
j F
s
i
a
i j
e
j
j D
i
a
i j
e
j
(25)
The sum over weakly connected points D
i
can be lumped
with the diagonal terms, as a
i j
is relatively small compared
with a
ii
. Also, the nodes j F
s
i
should be distributed to the
nodes in the set C
i
. In deriving this relation, we must
ensure that the resulting interpolation formula works cor-
rectly for the constant functions. The action of the inter-
polation operator P
h
H
obtained in this way can be
represented as
(P
h
H
e)
i
=
e
i
i C
j C
i
v
i j
e
j
i F
_
_
_
(26)
where
v
i j
=
a
i j
k F
s
i
a
ik
a
kj
l C
i
a
kl
a
ii
mD
i
a
im
(27)
Finally, the coarse grid discrete equations should be
determined. The volume agglomeration methods allow
us, in some cases, to combine the ne grid discretization
equations in the formulation of the coarse grid ones. How-
ever, in most cases, the explicit discretization procedure is
not possible and coarse grid equations should be con-
structed algebraically. An efcient and popular method
to obtain a coarse-level operator is through the Galerkin
form
A
H
= R
H
h
A
h
P
h
H
(28)
For other, more sophisticated methods based on con-
strained optimization schemes see Ref. 36.
The AMG approach that was introduced previously is
suitable for the approximate solution of linear algebraic
systems that arise in discretizations of scalar elliptic
PDEs. The most robust performance of AMG is observed
for linear systems with coefcient matrices being SPD and
M-matrices (the example being the discrete Laplace opera-
tor) (24). A reasonably good performance also can be
expected for the discrete systems obtained from the discre-
tizations of the perturbations of elliptic operators (such as
the convectiondiffusionreaction equation). However,
when the AMG solver, based on the standard coarsening
approach, is applied to the linear systems obtained from
the discretization of non scalar PDEs or systems of PDEs
(where the coefcient matrices are substantially different
fromthe SPDM-matrices), its performance usually deterio-
rates signicantly(if convergingat all). Thereasonfor this is
that the classical AMG uses the variable-based approach.
This approach treats all the unknowns in the system
equally. Thus, such approach cannot work effectively for
systems of PDEs, unless a very weak coupling exists
between the different unknowns. If the different types of
unknowns in a PDE system are coupled tightly, some mod-
ications and extensions of the classical approach are
needed to improve the AMG convergence characteristics
and robustness (see Ref. 37).
An initial idea of generalizing the AMG concept to sys-
tems of PDEs resulted in the unknown-based approach. In
10 MULTIGRID METHODS
this approach, different types of unknowns are treated
separately. Then, the coarsening of the set of variables
that corresponds to each of the unknowns in a PDE system
is based on the connectivity structure of the submatrix that
corresponds to these variables (the diagonal block of the
overall coefcient matrix, assuming that the variables cor-
responding to each of the unknowns are enumerated con-
secutively), and interpolation is based on the matrix entries
in each submatrix separately. In contrast, the coarse-level
Galerkin matrices are assembled using the whole ne level
matrices. The unknown-based approach is the simplest way
of generalizing the AMG for systems of PDEs. The addi-
tional information that is needed for the unknown-based
approach (compared with the classical approach) is the
correspondence between the variables and the unknowns.
The best performance of AMG with the unknown-based
approach is expected if the diagonal submatrices that corr-
spond to all the unknowns are close to M-matrices. This
approach is proved to be effective for the systems involving
non scalar PDEs [such as linear elasticity (38)]. However,
this approachloses its effectiveness if different unknowns in
the system are coupled tightly.
The latest development in generalizations of AMG for
the PDEsystems is referredto as the point-basedapproach.
In this approach, the coarsening is based on a set of points,
and all the unknowns share the same grid hierarchy. In a
PDE setting, points can be regarded as the physical grid
points in space, but they also can be considered in an
abstract setting. The coarsening process is based on an
auxiliary matrix, referred to as the primary matrix. This
matrix should be dened to be the representative of the
connectivitypatterns for all the unknowns inthe system, as
the same coarse levels are associated to all the unknowns.
The primary matrix is dened frequently based on the
distances between the grid points. Then, the associated
coarsening procedure is similar to that of geometric MG.
The interpolation procedure can be dened as block inter-
polation (by approximating the block equations), or one
can use the variable-based approaches, which are the same
(s-interpolation) or different (u-interpolation) for each
unknown. The interpolation weights are computed based
onthe coefcients inthe Galerkinmatrix, the coefcients of
the primary matrix, or the distances and positions of the
points. The interpolation schemes deployed in classical
AMG [such as direct, standard, or multi pass interpolation
(8)] generalize to the point-based AMG concept.
Although great progress has beenmade indeveloping AMG
for systems of PDEs, no unique technique works well for all
systemsof PDEs. Concreteinstancesof thisconcept arefoundto
workwell for certainimportant applications, suchas CFD(seg-
regated approaches in solving the NavierStokes equations),
multiphase ow in porous media (39), structural mechanics,
semiconductor process and device simulation (40,41), and so
further. Finally, it should be pointed out that the code SAMG
(42), basedonthe point approach, was developedat Fraunhofer
Institute in Germany.
Multigrid as a Preconditioner
In the Basic Concepts section, the xed-point iteration
methods were introduced. Their ineffectiveness when deal-
ing withthe low-frequency error components motivatedthe
design of MG techniques. Fixed-point iterations are just
one member of the broad class of methods for the solution of
sparse linear systems, which are based on projection. A
projectionprocess enables the extractionof anapproximate
solution of a linear system from a given vector subspace.
The simplest case of the projection methods are one-dimen-
sional projection schemes, where the search space is
spanned by a single vector. The representatives of one-
dimensional projection schemes are the steepest descent
algorithm and the minimum residual iteration (4).
One way of improving the convergence characteristics of
the projection methods is to increase the dimension of the
vector subspace. Such approach is adopted in the Krylov
subspace methods (2,4). Krylov methods are based on the
projection process onto the Krylov subspace dened as
K
m
(A; y) = span (y; Ay; . . . ; A
m1
y) (29)
InEquation(29), Ais the coefcient matrix andthe vector y
usually is taken to be the initial residual r
[0[
= f Ax
[0[
.
The main property of the Krylov subspaces is that their
dimension increases by 1 at each iteration. Fromthe deni-
tion in Equation (29) of the Krylov subspace, it follows that
the solution of a linear system Ax = f is approximated by
x = A
1
f ~q
m1
(A)r
[0[
, where q
m1
(A) is the matrix poly-
nomial of degree m1. All Krylov subspace methods work
on the same principle of polynomial approximation. Differ-
ent types of Krylov methods are obtained from different
types of orthogonality constraints imposed to extract the
approximate solution from the Krylov subspace (4). The
well-knownrepresentative of the Krylov methods aimedfor
the linear systems with SPD coefcient matrices is the
conjugate gradient (CG) algorithm.
The convergence characteristics of Krylov iterative sol-
vers depend crucially on the spectral properties of the
coefcient matrix. In the case of the CG algorithm, the
error norm reduction at the k-th iteration can be repre-
sented as (2,4)
|e
[k[
|
A
|e
[0[
|
A
_ min
q
k1
P
0
k1
max
j
[q
k1
(l
j
)[ (30)
where q
k1
is the polynomial of degree k 1 such that
q
k1
(0) = 1 and | |
2
A
= (A; ). The estimate in Equation
(30) implies that the contractionrate of the CGalgorithmat
the iteration k is determined by the maximum value of the
characteristic polynomial q
k1
evaluated at the eigenva-
lues l
j
of the coefcient matrix. This implies that the rapid
convergence of the CG algorithm requires the construction
of the characteristic polynomial q of relatively low degree,
which has small value at all eigenvalues of A. This con-
struction is possible if the eigenvalues of A are clustered
tightly together. Unfortunately, the coefcient matrices
that arise in various discretizations of PDEs have the
eigenvalues spread over the large areas of the real axis
(or the complex plane). Moreover, the spectra of such
coefcient matrices change when the discretization para-
meter or other problem parameters vary. The prerequisite
for rapid and robust convergence of Krylov solvers is the
MULTIGRID METHODS 11
redistributionof the coefcient matrixeigenvalues. Ideally,
the eigenvalues of the transformed coefcient matrix
should be clustered tightly, and the bounds of the clusters
should be independent on any problem or discretization
parameters. As a consequence, the Krylov solver should
converge to a prescribed tolerance within a small, xed
number of iterations. The numerical technique that trans-
forms the original linear system Ax = f to an equivalent
system, withthe same solution but a newcoefcient matrix
that has more favorable spectral properties (such as a
tightly clustered spectrum), is referred to as precondition-
ing (2,4). Designing sophisticated preconditioners is pro-
blem-dependent and represents a tradeoff between
accuracy and efciency.
Preconditioning assumes the transformation of the ori-
ginal linear system Ax = f to an equivalent linear system
M
1
Ax = M
1
f . Inpractice, the preconditioning matrix (or
the preconditioner) Mis selectedto be spectrally close to the
matrix A. The application of a preconditioner requires at
each iteration of a Krylov solver that one solves (exactly or
approximately) a linear system Mz = r, where r is the
current residual. This operation potentially can represent
a considerable computational overhead. This overhead is
the main reason why the design of a good preconditioner
assumes the construction of such matrix M that can be
assembled andcomputed the actionof its inverse to a vector
at optimal cost. Moreover, for preconditioning to be effec-
tive, the overall reduction in number of iterations of the
preconditioned Krylov solver and, more importantly, the
reduction in the execution time should be considerable,
when compared with a non preconditioned Krylov solver
(43). To design a (nearly) optimal Krylov solver for a parti-
cular application, the preconditioner needs to take into
account the specic structure and spectral properties of
the coefcient matrix.
In previous sections, we established MG as the optimal
iterative solver for the linear systems that arise in discre-
tizations of second-order, elliptic, PDEs. As an alternative
to using MG as a solver, one may consider using it as a
preconditioner for Krylov subspace solvers. When MG is
used as a preconditioner, one applies a small number
(typically 1 to 2) of V-cycles, with a small number of pre-
and post smoothing iterations, at each Krylov iteration to
approximate the action of M
1
to a residual vector. This
approach works well for SPD linear systems arising from
the discretization of second-order, elliptic PDEs. Anumber
of similar MG-based approaches work well in the same
context.
One group of MG-based preconditioners creates expli-
citly the transformed systemmatrix M
1
L
AM
1
R
, where M
L
and M
R
are the left and the right preconditioning
matrices, respectively. A typical representative is the
hierarchical basis MG (HBMG) method (44). The HBMG
method is introduced in the framework of the FE approx-
imations based on nodal basis. In the HBMGmethod, the
right preconditioning matrix M
R
is constructed in such a
way that the solution x is contained in the space of
hierarchical basis functions. Such space is obtained by
replacing, in a recursive fashion, the basis functions
associated with the ne grid nodes that also exist at
the coarser grid level, by the corresponding coarse grid
nodal basis functions. The process is presented in Fig. 2,
where the ne grid basis functions that are replaced by
the coarse grid basis functions are plotted with the
dashed lines. The HBMG may be interpreted as the
standard MG method, with the smoother at each level
applied to the unknowns existing only at this level (and
not present at coarser levels). The common choice M
L
=
M
t
R
extends the important properties of the original
coefcient matrix, such as symmetry and positive de-
nitness, to the preconditioned matrix. The HBMG
method is suitable for implementation on adaptive,
locally rened grids. Good results are obtained for the
HBMGin two dimensions (with uniformly/locally rened
meshes), but the efciency deteriorates in 3-D.
Another important MG-like preconditioning method is
the so-called BPX preconditioner (45). It belongs to a class
of parallel multi level preconditioners targeted for the
linear systems arising in discretizations of second-order,
elliptic PDEs. The main feature of this preconditioner is
that it is expressed as a sum of independent operators
dened on a sequence of nested subspaces of the nest
approximation space. The nested subspaces are asso-
ciated with the sequence of uniformly or adaptively
rened triangulations and are spanned by the standard
FEMbasis sets associated with them. Given a sequence of
nested subspaces S
h
1
S
h
2
S
h
L
with S
h
l
= spanf
l
k
N
l
k=1
,
the action of the BPX preconditioner to a vector can be
represented by
P
1
v =
L
l=1
N
l
k=1
(v; f
l
k
)f
l
k
(31)
The condition number of the preconditioned discrete
operator P
1
A is at most O(L
2
), and this result is inde-
pendent onthe spatial dimensionand holds for bothquasi-
uniform and adaptively rened grids. The preconditioner
in Equation (31) also can be represented in the operator
form as
P
1
=
L
l=1
R
l
Q
l
(32)
where R
l
is an SPD operator S
h
l
S
h
l
and Q
l
: S
h
L
S
h
l
is
the projection operator dened by (Q
l
u; v) = (u; v) for
uS
h
L
and v S
h
l
[(; ) denotes the l
2
-inner product].
The cost of applying the preconditioner P will depend
on the choice of R
l
. The preconditioner in Equation (32)
Figure 2. The principles of the hierarchical basis MG methods.
12 MULTIGRID METHODS
is linked closely to the MGV-cycle. The operator R
l
has the
role of a smoother; however, in BPX, preconditioner
smoothing at every level is applied to a ne grid residual
(which can be done concurrently). In MG, alternately,
smoothing of the residual at a given level cannot proceed
until the residual on that level is formed using the infor-
mation fromthe previous grids. This process also involves
an extra computational overhead in the MG algorithm.
Parallelization issues associated with the BPX precondi-
tioner will be discussed in the nal section.
Many important problems in fundamental areas, such
as structural mechanics, uidmechanics, and electromag-
netics, do not belong to the class of scalar, elliptic PDEs.
Some problems are given in so-called mixed forms (2) or as
the systems of PDEs. The latter is the regular case when
multiphysics and multiscale processes and phenomena
are modeled. When the systems of PDEs or PDEs with
unknown vector elds are solved, the resulting discrete
operators (coefcient matrices) have spectral properties
that are not similar to the spectral properties of discrete
elliptic, self-adjoint problems, for which MG is particu-
larly suited. Thus, direct application of standard MG as a
preconditioner in such complex cases is not effective. In
such cases, it is possible to construct the block precondi-
tionersanapproachthat proved efcient inmany impor-
tant cases (2). A coefcient matrix is subdivided into
blocks that correspond to different PDEs/unknowns in
the PDE system (similarly to the unknown-based
approach in the AMG for systems of PDEs). If some
main diagonal blocks represent the discrete second-order,
elliptic PDEs (or their perturbations), they are suitable
candidates to be approximated by the standard MG/AMG
with simple relaxation methods. The action of the inverse
of the whole preconditioner is achieved by the block back-
substitution. Successful block preconditioners have been
developed for important problems instructural mechanics
(46), uid mechanics (2,47), and electromagnetics (48). In
the following list, a brief review of some of these techni-
ques is given.
1. Mixed formulation of second-order elliptic proble-
ms. The boundary value problem is given by
A
1
u
\p = 0
\ u
= f
(33)
in VR
d
, subject to suitable BCs. In Equation (33)
A : RR
dd
is a symmetric and uniformly positive denite
matrix function, p is the pressure, and u
= A
1
p is the
velocity. Problems of this type arise inmodeling of uidow
through porous media. The primal variable formulation of
Equation (33) is \ A\p = f , but if u
is the quantity of
interest, anumerical solutionof the mixedforminEquation
(33) is preferable. Discretization of Equation (33) using
RaviartThomas FEM leads to a linear system with an
indenite coefient matrix
M
A
B
t
B 0
_ _
u
p
_ _
=
g
f
_ _
(34)
with (M
A
)
i j
= (Aw
i
; w
j
); i; j = 1; . . . ; m. For details, see
Ref. 47. The optimal preconditioner for the linear system
in Equation (34) is given by
M =
M
A
0
0 BM
1
A
B
t
_ _
(35)
The exact implementation of this preconditioner requires
the factorization of the dense Schur complement matrix
S = BM
1
A
B
t
). The matrix S can be approximated with-
out signicant loss of efciency by the matrix
S
D
= Bdiag (M
1
A
)B
t
, which is a sparse matrix. The action
of S
1
D
to a vector can be approximated even better by a
small number of AMG V-cycles (47).
2. Lames equations of linear elasticity. This problem is a
fundamental problem in structural mechanics. It also
occurs in the design of integrated electronic components
(microfabrication). The equations by Lame represent a
displacement formulation of linear elasticity:
r
@
2
u
@t
2
mDu
(l m)\(D u
) = f
(36)
subject to a suitable combination of Dirichlet and Neu-
mann BCs. Equation (36) is a special case of non linear
equations of visco-elasticity. In Equation (36), r denotes
density of a continuous material body, u
is the displace-
ment of the continuum from the equilibrium position
under the combination of external forces f
, and l and m
are the Lame constants (see Ref. 49 for more details). After
the discretization of Equation (36) by the FEM, a linear
system A
e
u = f is obtained, with A
e
R
ndnd
; u, and
f R
nd
, where n is the number of unconstrained nodes
and d is the spatial dimension. If the degrees of freedom
are enumerated in such a way that the displacements
along one Cartesian direction (unknowns that correspond
to the same displacement component) are grouped
together, the PDE system in Equation (36) consists of d
scalar equations andcanbe preconditionedeffectively by a
block-diagonal preconditioner. For this problem, effective
solvers/preconditio-ners based on the variable and the
point approach in AMG also are developed. However,
the block-diagonal preconditioner, developed in Equation
(46) uses classical (unknown-based) AMG. FromEquation
(36) it can be seen that each scalar PDE represents a
perturbation of the standard Laplacian operator. Thus,
the effectiveness of MGapplied in this context depends on
the relative size of the perturbation term. For compres-
sible materials (l away from ), the block-diagonal
preconditioner (46) is very effective. For nearly incom-
pressible cases (l ), an alternative formulation of the
linear elasticity problem in Equation (36) is needed (see
the Stokes problem).
3. The biharmonic equation. The biharmonic problem
D
2
u = f arises in structural and uid mechanics (2,49).
FEMdiscretization of the original formulation of the bihar-
monic problemrequires the use of Hermitian elements that
result in a systemwith SPDcoefcient matrix, which is not
MULTIGRID METHODS 13
an M-matrix. Consequently, direct application of the stan-
dard GMG/AMG preconditioners will not be effective.
An alternative approach is to reformulate the biharmo-
nic problem as the system of two Poisson equations:
Du = v
Dv = f
(37)
subject to suitable BCs (2). Discretization of Equation (37)
anbe done using the standard LagrangianFEM, producing
an indenite linear system
M
B
A
t
A 0
_ _
v
u
_ _
=
f
0
_ _
(38)
with (M
B
)
i j
= (f
i
; f
j
); i; j = 1; . . . ; m, and (A)
i j
= (\f
j
;
\f
i
); i = 1; . . . ; n; j = 1; . . . ; m. In Ref. 50, the application
of a modied MG method to the Schur complement
system obtained from Equation (38) is discussed. Another
nearly optimal preconditioner for the system in Equation
(38) can be obtained by decoupling the free and constr-
ained nodes in the rst equation of Equation (37). This
decoupling introduces a 3 3 block splitting of the coef-
cient matrix, andthe preconditioner is givenby(see Ref. 51)
M =
0 0 A
t
I
0 M
B
A
t
B
A
I
A
B
0
_
_
_
_
(39)
InEquation(39), the matrix blockA
I
R
nn
corresponds
to a discrete Dirichlet Laplacian operator and it can be
approximated readily by a small xed number of GMG/
AMG V-cycles.
The Stokes problem.. The Stokes equations represent a
system of PDEs that model visous ow:
D u
\p = 0
\ u
= 0
(40)
subject to a suitable set of BCs (see Ref. 2). The variable u
\u
\p = f
\ u
= 0
(43)
InEquation(43), u
. MIJALKOVIC
n=1
A
n
T.
Finally, the function P maps subsets in T to real num-
bers in the range zero to one in a countably additive
manner. That is,
v
P : T [0; 1[:
v
P(f) = 0.
v
P(V) = 1.
v
P(
n=1
A
n
) =
n=1
P(A
n
) for pairwise disjoint subsets
A
1
, A
2
,. . ..
2 PROBABILITY AND STATISTICS
The events in T are called measurable sets because the
function P species their sizes as probability allotments.
You might well wonder about the necessity of the s-eld T.
As noted, when V is a nite outcome set, T is normally
taken to be the collection of all subsets of V. This maximal
collection clearly satises the requirements. It contains all
singleton outcomes as well as all possible groupings. More-
over, the rule that P must be countably additive forces the
probabilityof any subset to be the sumof the probabilities of
its members. This happy circumstance occurs in experi-
ments with dice, coins, and cards, and subsequent sections
will investigate some typical probability assignments in
those cases.
A countable set is one that can be placed in one-to-one
correspondence with the positive integers. The expedient
introduced for nite V extends to include outcome spaces
that are countable. That is, we can still choose T to be all
subsets of the countably innite V. The probability of a
subset remains the sumof the probabilities of its members,
although this sum now contains countably many terms.
However, the appropriate Vfor many scientic purposes is
the set of real numbers. In the nineteenth century, Georg
Cantor proved that the real numbers are uncountable, as is
any nonempty interval of real numbers.
When V is an uncountable set, set-theoretic conun-
drums beyond the scope of this article force T to be smaller
than the collection of all subsets. In particular, the coun-
table additivity requirement on the function P cannot
always be achieved if T is the collection of all subsets of
V. On most occasions associated with engineering or
computer science applications, the uncountable V is the
real numbers or some restricted span of real numbers. In
this case, we take T to be the Borel sets, which is the
smallest s-eld containing all open intervals of the form
(a, b).
Consider again the example where our outcomes are
interarrival times between requests in a database input
queue. These outcomes can be any positive real numbers,
but our instrumentation cannot measure them in innite
detail. Consequently, our interest normally lies withevents
of the form (a, b). The probability assigned to this interval
should reect the relative frequency with which the inter-
arrival time falls betweena andb. Hence, we do not need all
subsets of the positive real numbers among our measurable
sets. We need the intervals and interval combinations
achieved through intersections and unions. Because T
must be a s-eld, we must therefore take some s-eld
containing the intervals and their combinations. By deni-
tion the collection of Borel sets satises this condition, and
this choice has happy consequences in the later develop-
ment of random variables.
Although not specied explicitly in the dening con-
straints, closure under countable intersections is also a
property of a s-eld. Moreover, we may interpret countable
as either nite or countably innite. Thus, every s-eld is
closed under nite unions and nite intersections. From
set-theoretic arguments, we obtainthe following properties
of the probability assignment function P:
v
P(A) = 1 P(A).
v
P(
n=1
A
n
) _
n=1
P(A
n
), without regard to disjoint-
ness among the A
n
.
v
If A_B, then P(A) _ P(B).
v
If A
1
_A
2
_ . . ., then P(
n=1
A
n
) = lim
n
P(A
n
).
v
If A
1
_A
2
_ . . ., then P(
n=1
A
n
) = lim
n
P(A
n
).
The nal two entries above are collectively known as the
Continuity of Measure Theorem. Amore precise continuity
of measure result is possible with more rened limits. For a
sequence a
1
, a
2
,. . . of real numbers and a sequence A
1
,
A
2
, . . . of elements from T, we dene the limit supremum
and the limit inmum as follows:
limsup a
n
n
= limsup a
n
kn_k
limsup A
n
n
=
k=1
n=k
A
n
liminf a
n
n
= liminf a
n
kn_k
liminf A
n
n
=
k=1
n=k
A
n
We adopt the term extended limits for the limit supremum
and the limit inmum. Although a sequence of bounded
real numbers, such as probabilities, may oscillate inde-
nitely and therefore fail to achieve a well-dened limiting
value, every suchsequence nevertheless achieves extended
limits. We nd that liminf a
n
_limsup a
n
in all cases, and if
the sequence converges, then liminf a
n
= lim a
n
= limsup
a
n
. For a sequence of subsets in T, we adopt equality of the
extended limits as the denition of convergence. That is,
lim A
n
= A if liminf A
n
= limsup A
n
= A.
When applied to subsets, we nd liminf A
n
_limsup A
n
.
Also, for increasing sequences, A
1
_A
2
_ . . ., or decreasing
sequences, A
1
_A
2
_ . . ., it is evident that the sequences
converge to the countable unionandcountable intersection,
respectively. Consequently, the Continuity of Measure
Theorem above is actually a statement about convergent
sequences of sets: P(limA
n
) = limP(A
n
). However, even if
the sequence does not converge, we can derive a relation-
ship among the probabilities: P(liminf A
n
) _liminf P(A
n
) _
limsup P(A
n
) _ P(limsup A
n
).
We conclude this sectionwithtwo more advancedresults
that are useful in analyzing sequences of events. Suppose
A
n
, for n = 1; 2; . . ., is a sequence of events in T for which
n=1
P(A
n
) <. The Borel Lemmastates that, under these
circumstances, P(liminf A
n
) = P(limsup A
n
) = 0.
The secondresult is the BorelCantelli Lemma, for which
we need the denition of independent subsets. Two subsets
A, B in T are independent if P(AB) = P(A)P(B). Now
suppose that A
i
and A
j
are independent for any two distinct
subsets in the sequence. Then the lemma asserts that if
n=1
P(A
n
) = , then P(limsup A
n
) = 1. A more general
result, the KochenStone Lemma, provides a lower bound
on P(limsup A
n
) under slightly more relaxed conditions.
COMBINATORICS
Suppose a probability space (V; T; P) features a nite
outcome set V. If we use the notation |A| to denote the
number of elements in a set A, then this condition is [V[ = n
PROBABILITY AND STATISTICS 3
for some positive integer n. In this case, we take T to be the
collection of all subsets of V. We nd that [T[ = 2
n
, and a
feasible probability assignment allots probability 1/n to
each event containing a singleton outcome. The countable
additivity constraint then forces each composite event to
receive probability equal to the sum of the probabilities of
its constituents, which amounts to the size of the event
divided by n. In this context, combinatorics refers to meth-
ods for calculating the size of V and for determining the
number of constituents in a composite event.
In the simplest cases, this computation is mere enu-
meration. If V contains the six possible outcomes from the
roll of a single die, then V = 1; 2; 3; 4; 5; 6. We observe that
n = 6. If an event Eis described as the outcome is odd or it
is greater than 4, then we note that outcomes 1, 3, 5, 6
conformto the description, and we calculate P(E) = 4=6. In
more complicated circumstances, neither n nor the size of
E is so easily available. For example, suppose we receive
5 cards from a shufed deck of 52 standard playing cards.
What is the probability that we receive ve cards of the
same suit with consecutive numerical values (a straight
ush)? Howmany possible hands exist? Howmany of those
constitute a straight ush?
A systematic approach to such problems considers
sequences or subsets obtained by choosing from a common
pool of candidates. A further distinction appears when we
consider two choice protocols: choosing with replacement
and choosing without replacement. Two sequences differ if
they differ inany position. For example, the sequence 1, 2, 3
is different from the sequence 2, 1, 3. However, two sets
differ if one contains an element that is not in the other.
Consequently, the sets 1, 2, 3 and 2, 1, 3 are the same set.
Suppose we are choosing k items froma pool of nwithout
replacement. That is, each choice reduces that size of the
pool available for subsequent choices. This constraint
forces k _ n. Let P
k,n
be the number of distincts sequences
that might be chosen, and let C
k,n
denote the number of
possible sets. We have
P
k;n
= (n)
|
k
=n(n 1)(n 2) (n k 1) =
n!
k!
C
k;n
=
n
k
_ _
=
n!
k!(n k)!
Consider again the scenario mentioned above in which we
receive ve cards from a shufed deck. We receive one of
(52)
|
5
= 311875200 sequences. To determine whether we
have received a straight ush, however, we are allowed to
reorder the cards in our hand. Consequently, the size of the
outcome space is the number of possible sets, rather than
the number of sequences. As there are (
52
5
) = 2598960 such
sets, we conclude that the size of the outcome space is
n = 2598960. Now, how many of these possible hands con-
stitute a straight ush?
For this calculation, it is convenient to introduce another
useful counting tool. If we can undertake a choice as a
succession of subchoices, then the number of candidate
choices is the product of the number available at eachstage.
A straight ush, for example, results from choosing one of
four suits and then one of nine low cards to anchor the
ascending numerical values. That is, the rst subchoice has
candidates: clubs, diamonds, hearts, spades. The second
subchoice has candidates: 2, 3,. . ., 10. The number of can-
didate hands for a straight ush and the corresponding
probability of a straight ush are then
N(straight flush) =
4
1
_ _
9
1
_ _
= 36
P(straight flush) =
36
2598960
= 0:0000138517
The multiplicative approach that determines the num-
ber of straight ush hands amounts to laying out the hands
in four columns, one for each suit, and nine rows, one for
each low card anchor value. That is, each candidate from
the rst subchoice admits the same number of subsequent
choices, nine in this example. If the number of subsequent
subchoices is not uniform, we resort to summing the values.
For example, how many hands exhibit either one or two
aces? One-ace hands involve a choice of suit for the ace,
followed by a choice of any 4 cards from the 48 non-aces.
Two-ace hands require a choice of two suits for the aces,
followed by a selection of any 3 cards from the 48 non-aces.
The computation is
N(oneor twoaces) =
4
1
_ _
48
4
_ _
4
2
_ _
48
3
_ _
=882096
P(oneor twoaces) =
882096
2598960
=0:3394
When the selections are performed with replacement,
the resulting sequences or sets may contain duplicate
entries. In this case, a set is more accurately described
as amultiset, whichis aset that admits duplicate members.
Moreover, the size of the selection k may be larger than
the size of the candidate pool n. If we let P
k;n
and C
k;n
denote
the number of sequences and multisets, respectively, we
obtain
P
k;n
= n
k
C
k;n
=
n k 1
k
_ _
These formulas are useful in occupancy problems. For
example, suppose we have n bins into which we must
distribute k balls. As we place each ball, we choose one of
the n bins for it. The chosen bin remains available for
subsequent balls, so we are choosing with replacement. A
generic outcome is (n
1
, n
2
,. . ., n
k
), where n
i
denotes the bin
selected for ball i. There are n
k
such outcomes correspond-
ing to the number of such sequences.
If we collect birthdays from a group of 10 persons, we
obtain a sequence n
1
, n
2
,. . ., n
10
, in which each entry is an
integer in the range 1 to 365. As each such sequence
represents one choice from a eld of P
365;10
= 365
10
, we
4 PROBABILITY AND STATISTICS
can calculate the probability that there will be at least
one repetition among the birthdays by computing the
number of such sequences with at least one repetition
and dividing by the size of the eld. We can construct a
sequence with no repetitions by selecting, without replace-
ment, 10 integers fromthe range 1 to 365. There are P
365,10
such sequences, and the remaining sequences must all
correspond to least one repeated birthday among the 10
people. The probability of a repeated birthday is then
P(repeated birthday) =
365
10
P
365;10
365
10
= 1
(365)(364) . . . (365)
365
10
= 0:117
As we consider larger groups, the probability of a
repeated birthday rises, although many people are sur-
prised by how quickly it becomes larger than 0.5. For
example, if we redo the above calculation with 23 persons,
we obtain 0.5073 for the probability of a repeated birthday.
Multisets differ from sequences in that a multiset is
not concerned with the order of its elements. In the bin-
choosing experiment above, a generic multiset outcome is
k
1
, k
2
,. . ., k
n
, where k
i
counts the number of times bin i was
chosen to receive a ball. That is, k
i
is the number of
occurrences of i in the generic sequence outcome n
1
,
n
2
,. . ., n
k
, with a zero entered when a bin does not appear
at all. In the birthday example, there are C
365;10
such day-
count vectors, but we would not consider them equally
likely outcomes. Doing so would imply that the probability
of all 10 birthdays coinciding is the same as the probability
of them dispersing across several days, a conclusion that
does not accord with experience.
As an example of where the collection of multisets
correctly species the equally likely outcomes, consider
the ways of writing the positive integer k as a sum of n
nonnegative integers. Aparticular sumk
1
k
2
. . . k
n
=
k is called a partition of k into nonnegative components. We
can construct such a sum by tossing k ones at n bins. The
rst bin accumulates summand k
1
, which is equal to
the number of times that bin is hit by an incoming one.
The second bin plays a similar role for the summand k
2
and
so forth. There are C
3;4
= 15 ways to partition 4 into 3
components:
0 0 4 0 1 3 0 2 2 0 3 1 0 4 0
1 0 3 1 1 2 1 2 1 1 3 0
2 0 2 2 1 1 2 2 0
3 0 1 3 1 0
4 0 0
We can turn the set of partitions into a probability space
by assigning probability 1/15 to each partition. We would
then speak of a random partition as 1 of these 15 equally
likely decompositions.
When the bin-choosing experiment is performed with
distinguishable balls, then it is possible to observe the
outcome n
1
, n
2
,. . ., n
k
, where n
i
is the bin chosen for ball
i. There are P
n;k
such observable vectors. If the balls are
not distinguishable, the outcome will not contain enough
information for us to know the numbers n
i
. After the
experiment, we cannot locate ball i, and hence, we cannot
specify its bin. In this case, we know only the multiset
outcome k
1
, k
2
,. . ., k
n
, where k
i
is the number of balls in bin
i. There are only C
n;k
observable vectors of this latter type.
In certain physics contexts, probability models with P
n;k
equally likely outcomes accurately describe experiments
with distinguishable particles across a range of energy
levels (bins). These systems are said to obey Maxwell
Boltzmann statistics. On the other hand, if the experiment
involves indistinguishable particles, the more realistic
model use C
n;k
outcomes, and the system is said to obey
BoseEinstein statistics.
The discussion above presents only an introduction to
the vast literature of counting methods and their inter-
relationships that is known as combinatorics. For our
purposes, we take these methods as one approach to estab-
lishing a probability space over a nite collection of out-
comes.
RANDOM VARIABLES AND THEIR DISTRIBUTIONS
A random variable is a function that maps a probability
space into the real numbers 1. There is a subtle constraint.
Suppose (V; T; P) is aprobabilityspace. ThenX : V1is a
randomvariable if X
1
(B) T for all Borel sets B1. This
constraint ensures that all events of the form
oV[a<X(o) <b do indeed receive a probability assign-
ment. Suchevents are typicallyabbreviated(a<X <
b) andare interpretedtomeanthat therandomvariable (for
the implicit outcome o) lies inthe interval (a, b). The laws of
s-elds then guarantee that related events, those obtained
by unions, intersections, and complements from the open
intervals, also receive probability assignments. In other
words, X constitutes a measurable function from (V; T) to
1.
If the probability space is the real numbers, then the
identity function is a random variable. However, for any
probability space, we can use a randomvariable to transfer
probability to the Borel sets B via the prescription P
/
(B) =
P(oV[X(o) B) and thereby obtain a new probability
space (1; B; P
/
). Frequently, all subsequent analysis takes
place in the real number setting, and the underlying out-
come space plays no further role.
For any x 1, the function F
X
(x) = P
/
(X _ x) is called
the cumulative distribution of random variable X. It is
frequently written F(x) when the underlying random vari-
able is clear from context. Distribution functions have the
following properties:
v
F(x) is monotone nondecreasing.
v
lim
x
F(x) = 0; lim
x
F(x) = 1.
v
At each point x, F is continuous from the right and
possesses a limit from the left. That is, lim
y x
F(y) _
F(x) = lim
y x
F(y).
v
The set of discontinuities of F is countable.
Indeed, any function F with these properties is the
distribution of some randomvariable over some probability
PROBABILITY AND STATISTICS 5
space. If a nonnegative function f exists such that F(x) =
_
x
n=1
t
n
P(X = t
n
)
where t
1
, t
2
,. . . enumerates the discrete values that X
may assume. The expected value represents a weighted
average across the possible outcomes. The variance of a
discrete random variable is
Var[X[ =
n=1
(t
n
E[X[)
2
The variance provides a summary measure of the disper-
sion of the X values about the expected value with low
variance corresponding to a tighter clustering. For a
Bernoulli random variable X with parameter p, we have
E[X[ = p and Var[X[ = p(1 p).
An indicator random variable is a Bernoulli random
variable that takes on the value 1 precisely when some
other random variable falls in a prescribed set. For exam-
ple, suppose we have a random variable X that measures
the service time (seconds) of a customer with a bank teller.
We might be particularly interested in lengthy service
times, say those that exceed 120 seconds. The indicator
I
(120;)
=
1; X >120
0; X _ 120
_
is a Bernoulli random variable with parameter p =
P(X >120).
Random variables X
1
, X
2
,. . ., X
n
are independent if, for
any Borel sets B
1
, B
2
,. . ., B
n
, the probability that all n
random variables lie in their respective sets is the product
of the individual occupancy probabilities. That is,
P
n
i=1
(X
i
B
i
)
_ _
=
n
i=1
P(X
i
B
i
)
This denition is a restatement of the concept of inde-
pendent events introduced earlier; the events are now
expressed in terms of the random variables. Because
the Borel sets constitute a s-eld, it sufces to check the
above condition on Borel sets of the form (X _ t). That is,
X
1
, X
2
,. . ., X
n
are independent if, for any n-tuple of real
numbers t
1
, t
2
,. . .,t
n
, we have
P
n
i=1
(X
i
_ t
i
)
_ _
=
n
i=1
P(X
i
_ t
i
) =
n
i=1
F
X
(t
i
)
The sum of n independent Bernoulli random variables,
each with parameter p, exhibits a binomial distribution.
That is, if X
1
, X
2
,. . ., X
n
are Bernoulli with parameter p
and Y =
n
i=1
X
i
, then
P(Y = i) =
n
i
_ _
p
i
(1 p)
ni
for i = 0; 1; 2; . . . ; n. This randomvariable models, for exam-
ple, the number of heads in n tosses of a coin for which the
probability of a head on any given toss is p. For linear
combinations of independent random variables, expected
values andvariances are simple functions of the component
values.
E[a
1
X
1
a
2
X
2
. . .[ = a
1
E[X
1
[ a
2
E[X
2
[ . . .
Var[a
1
X
1
a
2
X
2
. . .[ = a
2
1
Var[X
1
[ a
2
2
Var[X
2
[ . . .
For the binomial random variable Y above, therefore, we
have E[Y[ = np and Var[Y[ = np(1 p).
A Poisson random variable X with parameter l has
P(X = k) = e
l
l
k
=k!. This random variable assumes all
nonnegative integer values, and it is useful for modeling
the number of events occurring in a specied interval when
it is plausible to assume that the count is proportional to the
interval widthinthe limit of verysmall widths. Specically,
the following context gives rise to a Poisson random vari-
able Xwithparameter l. Suppose, as time progresses, some
random process is generating events. Let X
t;D
count the
number of events that occur during the time interval
[t; t D[. Now, suppose further that the generating process
obeys three assumptions. The rst is a homogeneity con-
straint:
v
P(X
t1; D
= k) = P(X
t2; D
= k) for all integers k_0
That is, the probabilities associated with an interval
of width do not depend on the location of the interval.
This constraint allows a notational simplication, and
we can now speak of X
D
because the various random
6 PROBABILITY AND STATISTICS
variables associated with different anchor positions t all
have the same distribution. The remaining assumptions
are as follows:
v
P(X
D
= 1) = lD o
1
(D)
v
P(X
D
>1) = o
2
(D)
where the o
i
(D) denote anonymous functions with the prop-
erty that o
i
(D)=D0 as D0. Then the assignment P(X =
k) = lim
D0
P(X
D
= k) produces a Poissonrandomvariable.
This model accurately describes such diverse phenom-
ena as particles emitted in radioactive decay, customer
arrivals at an input queue, aws in magnetic recording
media, airline accidents, and spectator coughing during a
concert. The expected value and variance are both l. If we
consider a sequence of binomial random variables B
n,p
,
where the parameters n and p are constrained such that
nand p0 in a manner that allows npl>0, then
the distributions approach that of a Poisson random vari-
able Ywith parameter l. That is, P(B
n; p
= k) P(Y = k) =
e
l
l
k
=k!.
A geometric random variable X with parameter p exhi-
bits P(X = k) = p(1 p)
k
for k = 0; 1; 2; . . .. It models, for
example, the number of tails before the rst head in
repeated tosses of a coin for which the probability of heads
is p. We have E[X[ = (1 p)= p and Var[X[ = (1 p)= p
2
.
Suppose, for example, that we have a hash table in which j
of the N addresses are unoccupied. If we generate random
address probes in search of an unoccupied slot, the prob-
ability of success is j/N for each probe. The number of
failures prior to the rst success then follows a geometric
distribution with parameter j/N.
The sum of n independent geometric random variables
displays a negative binomial distribution. That is, if X
1
,
X
2
,. . ., X
n
are all geometric with parameter p, then Y =
X
1
X
2
. . . X
n
is negative binomial withparameters (n,
p). We have
P(Y = k) = C
nk1;k
p
n
(1 p)
k
E[Y[ =
n(1 p)
p
Var[Y[ =
n(1 p)
p
2
where C
nk1;k
is the number of distinct multisets avail-
able when choosing k from a eld of n with replacement.
This random variable models, for example, the number of
tails before the nth head in repeated coin tosses, the num-
ber of successful ghts prior to the nthaccident inanairline
history, or the number of defective parts chosen (with
replacement) from a bin prior to the nth functional one.
For the hash table example above, if we are trying to ll n
unoccupied slots, the number of unsuccessful probes in the
process will follow a negative binomial distribution with
parameters n, j/N. In this example, we assume that n is
signicantly smaller than N, so that insertions do not
materially change the probability j/N of success for each
address probe.
Moving on to random variables that assume a conti-
nuum of values, we describe each by giving its density
function. The summation formulas for the expected value
and variance become integrals involving this density. That
is, if random variable X has density f, then
E[X[ =
_
t f (t)dt
Var[X[ =
_
[t E[X[[
2
f (t)dt
In truth, precise work in mathematical probability uses
a generalization of the familiar Riemann integral known
as a measure-theoretic integral. The separate formulas,
summation for discrete random variables and Riemann
integration against a density for continuous random vari-
ables, are then subsumed under a common notation. This
more general integral also enables computations in cases
where a density does not exist. When the measure in
question corresponds to the traditional notion of length
on the real line, the measure-theoretic integral is known
as the Lebesgue integral. In other cases, it corresponds to
a notion of length accorded by the probability distri-
bution: P(a < X < t) for real a and b. In most instances of
interest in engineering and computer science, the form
involving ordinary integration against a density sufces.
The uniform random variable U on [0, 1] is described
by the constant density f (t) = 1 for 0 _ t _ 1. The prob-
ability that U falls in a subinterval (a, b) within [0, 1] is
simply b a, the length of that subinterval. We have
E[U[ =
_
1
0
tdt =
1
2
Var[U[ =
_
1
0
t
1
2
_ _
2
dt =
1
12
The uniform distribution is the continuous analog of the
equally likely outcomes discussed in the combinatorics
section above. It assumes that any outcome in the interval
[0, 1] is equally likely to the extent possible under the
constraint that probability must now be assigned to Borel
sets. In this case, all individual outcomes receive zero
probability, but intervals receive probability in proportion
to their lengths. This random variable models situations
such as the nal resting position of a spinning pointer,
where no particular location has any apparent advantage.
The most famous continuous random variable is the
Gaussian or normal random variable Z
m;s
. It is character-
ized by two parameters, m and s, and has density, expected
value, and variance as follows:
f
z
m; s
(t) =
1
s
2p
_ e
(tm)
2
=2s
2
E[Z
m;s
[ = m
Var[Z
m;s
[ = s
2
PROBABILITY AND STATISTICS 7
The well-known Central Limit Theorem states that the
average of a large number of independent observations
behaves like a Gaussian random variable. Specically, if
X
1
, X
2
,. . . are independent random variables with identical
nite-variance distributions, say E[X
i
[ = a and
Var[X
i
[ = c
2
, then for any t,
lim
n
P
1
nc
2
_
n
i=1
(X
i
a) _ t
_ _
= P(Z
0;1
_ t)
For example, if we toss a fair coin 100 times, what is the
probability that we will see 40 or fewer heads? To use the
Central Limit Theorem, we let X
i
= 1 if heads occurs on the
ith toss and zero otherwise. With this denition, we have
E[X
i
[ = 0:5 and Var[X
i
[ = 0:25, which yields
P
100
i=1
X
i
_ 40
_ _
= P
1
100(0:25)
_
100
i=1
(X
i
0:5) _
40 100(0:5)
100(0:25)
_
_ _
~P(Z
0;1
_ 2) = 0:0288
The last equality was obtained from a tabulation of such
values for the standard normal random variable with
expected value 0 and variance 1.
Because it represents acommonlimiting distributionfor
an average of independent observations, the Gaussian
random variable is heavily used to calculate condence
intervals that describe the chance of correctly estimating
a parameter from multiple trials. We will return to this
matter in a subsequent section.
The Gamma function is G(t) =
_
0
x
t1
e
x
dx, dened for
t <0. The Gammarandomvariable Xwithparameters (g, l)
(both positive) is described by the density
f (x) =
l
g
x
g1
e
lx
=G(g);
0;
for x _0
for x < 0
_
It has E[X[ = g=l and Var[X[ = g=l
2
. For certain specic
values of g, the random variable is known by other names.
If g = 1, the density reduces to f (x) = le
lx
for x _ 0, and
X is then called an exponential random variable. The
exponential random variable models the interarrival
times associated with events such as radioactive decay
and customer queues, which were discussed in connnection
with Poisson random variables above. Specically, if a
Poisson random variable with parameter lT models the
number of events ininterval T, thenanexponential random
variable with parameter l models their interarrival times.
Consequently, the exponential random variable features
prominently in queueing theory.
Exponential random variables possess a remarkable
feature; they are memoryless. To understand this concept,
we must rst dene the notion of conditional probability.
We will use the exponential random variable as an exam-
ple, although the discussion applies equally well to random
variables in general. Notationally, we have a probability
space (V; T; P) and a random variable X, for which
PoV : X(o) >t =
_
t
le
lx
dx = e
lt
for t _0. Let t
1
be a xed positive real number, and consider
a related probability space (
^
V;
^
T;
^
P), obtained as follows:
^
V = oV : X(o) >t
1
^
T = A
^
V : AT
^
P(B) = P(B)=P(
^
V); for all B
^
T
By restricting its domain, we can consider X to be a ran-
domvariable on
^
V. For any o
^
V, we have X(o) >t
1
, but we
can legitimately ask the probability, using the new mea-
sure
^
P, that X(o) exceeds t
1
by more than t.
^
P(X >t
1
t) =
P(X >t
1
t
2
)
P(X >t
1
)
=
e
l(t
1
t)
e
lt
1
= e
lt
= P(X >t)
The probability
^
P(B) is known as the conditional prob-
ability of B, given
^
V. From the calculation above, we see
that the conditional probability that X exceeds t
1
by t or
more, given that X>t
1
is equal to the unconditional prob-
ability that X>t. This is the memoryless property. If Xis an
exponential random variable representing the time
between query arrivals to a database input queue, then
the probability that 6 microseconds or more elapses before
the next arrival is the same as the probability that an
additional 6 microseconds or more elapses before the
next arrival, given that we have already waited in vain
for 10 microseconds.
In general, we can renormalize our probability assign-
ments by restricting the outcome space to some particular
event, such as the
^
V in the example. The more general
notationis P(B|A) for the conditional probability of Bgiven
A. Also, we normally allowBto be any event in the original
T with the understanding that only that part of B that
intersects A carries nonzero probability under the new
measure. The denition requires that the conditioning
event A have nonzero probability. In that case,
P(B[A) =
P(BA)
P(A)
species the revised probabilities for all events B. Note that
P(B[A) =
P(AB)
P(A)
=
P(AB)
P(AB) P(AB)
=
P(A[B)P(B)
P(A[B)P(B) P(A[B)P(B)
This formula, a simple form of Bayes Law, relates the
conditional probability of B given A to that of A given B.
It nds frequent use inupdating probability assignments to
reect newinformation. Specically, suppose we knowP(B)
and therefore P(B) = 1 P(B). Such probabilities are
8 PROBABILITY AND STATISTICS
calledprior probabilities because they reect the chances of
a B occurrence in the absence of further knowledge about
the underlying random process. If the actual outcome
remains unknown to us, but we are told that event A has
occurred, we may want to update our probability assign-
ment to reect more accurately the chances that B has
also occurred. That is, we are interested in the posterior
probability P(B|A). Bayes Law allows us to compute this
new value, provided we also have the reverse conditional
probabilities.
For example, suppose a medical test for a specic disease
is applied to a large population of persons known to have
the disease. In99%of the cases, the disease is detected. This
is a conditional probability. If we let S be the event person
is sick and be the event medical test was positive,
we have P([S) = 0:99 as an empirical estimate. Applying
the test to a population of persons known not to have the
disease might reveal P([S) = 0:01 as a false alarm rate.
Suppose further that the fraction P(S) = 0:001 of the
general population is sick with the disease. Now, if you
take the test with positive results, what is the chance that
you have the disease? That is, what is P(S[)? Applying
Bayes Law, we have
P(S[) =
P([S)P(S)
P([S)P(S) P([S)P(S)
=
0:99(0:001)
0:99(0:001) 0:01(0:99)
= 0:0909
You have only a 9% chance of being sick, despite having
scored positive on a test with an apparent 1% error rate.
Nevertheless, your chance of being sick has risen from a
prior value of 0.001 to a posterior value of 0.0909. This is
nearly a hundredfold increase, which is commensurate
with the error rate of the test.
The full formof Bayes Lawuses anarbitrary partitionof
the outcome space, rather than a simple two-event decom-
position, such as sick and not sick. Suppose the event
collectionA
i
: 1 _ i _ n is apartitionof the outcome space
V. That is, the A
i
are disjoint, each has nonzero probability,
and their union comprises all of V. We are interested in
whichA
i
has occurred, given knowledge of another event B.
If we know the reverse conditional probabilities, that is if
we knowthe probability of BgiveneachA
i
, thenBayes Law
enables the computation
P(A
i
[B) =
P(B[A
i
)P(A
i
)
n
j=1
P(B[A
j
)P(A
j
)
:
Returning to the Gamma random variable with para-
meters (g, l), we can distinguish additional special cases. If
g = n, a positive integer, then the corresponding distri-
bution is called an Erlang distribution. It models the time
necessary to accumulate nevents ina process that follows a
Poisson distribution for the number of events in a specied
interval. An Erlang distribution, for example, describes
the time span covering the next n calls arriving at a tele-
phone exchange.
If g = n=2 for a positive integer n and l = 1=2, then the
corresponding Gamma random variable is called a chi-
square random variable. It exhibits the distribution of
the sum of n independent squares, Y =
n
i=1
Z
2
i
, where
each Z
i
is a Gaussian random variable with (m; s
2
) =
(0; 1). These distributions are useful in computing con-
dence intervals for statistical estimates.
Gamma distributions are the limiting forms of negative
binomial random variables in the same manner that the
Poisson distribution is the limit of binomials. That is,
suppose we have a sequence C
n
of negative binomial ran-
domvariables. The parameters of C
n
are (m, p
n
). As n,
we assume that p
n
0 in a manner that allows
np
n
l>0. Then the limiting distribution of C
n
/n is the
Gamma (Erlang) distribution with parameters (m, g). In
particular, if m = 1, the C
n
are geometric and the limit is
exponential.
Figure 1 summarizes the relationships among the ran-
dom variables discussed in this section.
The renewal count arrow from exponential to Poisson
refers to the fact that a phenomenon in which the event
interarrival time is exponential (l) will accumulate
events in an interval T according to a Poisson distribution
with parameter lT . That is, if the sequence X
1
, X
2
,. . . of
random variables measures time between successive
events, then the random variable
N
T
= max k[
k
i=1
X
i
_ T
_ _
is called a renewal count for the sequence. If the X
i
are
independent exponentials with parameter l, then N
T
has
a Poisson distribution with parameter lT.
Asimilar relationship holds between a sequence G
1
1,
G
2
1,. . . of geometric random variables with a common
parameter p. The difference is that the observationinterval
T is now a positive integer. The renewal count N
T
then
exhibits a binomial distribution with parameters (T, p).
CONVERGENCE MODES
For a sequence of real numbers, there is a single mode
of convergence: A tail of the sequence must enter and
remain within any given neighborhood of the limit. This
property either holds for some (possibly innite) limiting
value or it does not. Sequences of random variables
exhibit more variety in this respect. There are three
modes of convergence.
A sequence X
1
, X
2
,. . . of random variables converges
pointwise to a random variable Y if X
n
(o) Y(o) as a
sequence of real numbers for every point o in the under-
lying probability space. We may also have pointwise
convergence on sets smaller than the full probability space.
If pointwise convergence occurs on a set of probability one,
then we say that the sequence converges almost surely. In
this case, we use the notation X
n
Ya.s.
The sequence converges in probability if, for every posi-
tive e, the measure of the misbehaving sets approaches
PROBABILITY AND STATISTICS 9
zero. That is, as n,
P(o : [X
n
(o) Y(o)[ > ) 0
If X
n
Y a.s, then it also converges in probability. How-
ever, it is possible for a sequence to converge in probability
and at the same time have no pointwise limits.
The nal convergence mode concerns distribution func-
tions. The sequence converges in distribution if the corre-
sponding cumulative distribution functions of the X
n
converge pointwise to the distribution function of Y at all
points where the cumulative distribution function of Y is
continuous.
The WeakLawof Large Numbers states that the average
of a large number of independent, identically distributed
random variables tends in probability to a constant, the
expected value of the common distribution. That is, if X
1
,
X
2
,. . . is an independent sequence with a common distri-
bution such that E[X
n
[ = m and Var[X
n
[ = s
2
<, then for
every positive ,
P o :
n
i=1
X
i
n
m
>
_ _ _ _
0
as n.
Suppose, for example, that a random variable T mea-
sures the time between query requests arriving at a data-
base server. This random variable is likely to exhibit an
exponential distribution, as described in the previous sec-
tion, with some rate parameter l. The expected value and
variance are 1/l and 1/l
2
, respectively. We take n observa-
tions of T and label them T
1
, T
2
,. . . , T
n
. The weak law
suggests that the number
n
i=1
T
i
=nwill be close to 1/l. The
precise meaning is more subtle. As an exponential random
variable can assume any nonnegative value, we can ima-
gine asequence of observations that are all larger than, say,
twice the expected value. In that case, the average would
also be much larger than 1/l. It is then clear that not all
sequences of observations will produce averages close to
1/l. The weak law states that the set of sequences that
misbehave in this fashion is not large, when measured in
terms of probability.
We envision an innite sequence of independent data-
base servers, each with its separate network of clients. Our
probability space now consists of outcomes of the form
o = (t
1
; t
2
; . . .), which occurs when server 1 waits t
1
seconds
for its next arrival, server 2 waits t
2
seconds, and so forth.
Any event of the form (t
1
_ x
1
; t
2
_ x
2
; . . . ; t
p
_ x
p
) has
probability equal to the product of the factors P(t
1
_ x
i
),
which are in turn determined by the common exponential
distribution of the T
i
. By taking unions, complements, and
intersections of events of this type, we arrive at a s-eld
that supports the probability measure. The random vari-
ables
n
i=1
T
i
=n are well dened on this new probability
space, and the weak law asserts that, for large n, the set of
sequences (t
1
, t
2
,. . .) withmisbehaving prexes (t
1
, t
2
,. . . , t
n
)
has small probability.
Agiven sequence can drift into and out of the misbehav-
ing set as n increases. Suppose the average of the rst 100
entries is close to 1/l, but the next 1000entries are all larger
than twice 1/l. The sequence is then excluded from the
misbehaving set at n = 100 but enters that set before
n = 1100. Subsequent patterns of good and bad behavior
can migrate the sequence into and out of the exceptional
set. With this additional insight, we can interpret more
precisely the meaning of the weak law.
Suppose 1=l = 0:4. We can choose = 0:04 and let Y
n
=
n
i=1
T
i
=n. The weak law asserts that P([Y
n
0:4[ >
0:04) 0, which is the same as P(0:36 _ Y
n
_ 0:44) 1.
Although the law does not inform us about the actual size
of n required, it does say that eventually this latter
probability exceeds 0.99. Intuitively, this means that if
we choose a large n, there is a scant 1% chance that our
average will fail to fall close to 0.4. Moreover, as we choose
larger and larger values for n, that chance decreases.
The Strong Law of Large Numbers states that the
average converges pointwise to the common expected
value, except perhaps on a set of probability zero. speci-
cally, if X
1
, X
2
,. . . is is an independent sequence with
a common distribution such that E[X
n
[ = m (possibly in-
nite), then
n
i=1
X
i
n
m a:s:
as n.
Applied in the above example, the strong law asserts
that essentially all outcome sequences exhibit averages
that draw closer and closer to the expected value as n
increases. The issue of a given sequence forever drifting
into and out of the misbehaving set is placed in a pleasant
perspective. Such sequences must belong to the set with
probability measure zero. This reassurance does not mean
that the exceptional set is empty, because individual out-
comes (t
1
, t
2
,. . .) have zero probability. It does mean that
we can expect, with virtual certainty, that our average of
n observations of the arrival time will draw ever closer
to the expected 1/l.
Although the above convergence results can be obtained
with set-theoretic arguments, further progress is greatly
facilitated with the concept of characteristic functions,
which are essentially Fourier transforms in a probability
space setting. For a random variable X, the characteristic
functionof Xis the complex-valuedfunctionb
X
(u) = E[e
iuX
[.
The exceptional utility of this device follows because
there is a one-to-one relationship between characteristic
functions and their generating random variables (distri-
butions). For example, X is Gaussian with parameters
m = 0 and s
2
= 1 if and only if b
X
(u) = e
u
2
=2
. X is Poisson
with parameter l if and only if b
X
(u) = exp(l(1 e
iu
)).
If Xhas a density f(t), the computation of b
X
is a common
integration: b
X
(u) =
_
e
iut
f (t)dt. Conversely, if b
X
is
absolutely integrable, then X has a density, which can be
recovered by an inversion formula. That is, if
_
[b(u)[
du<, then the density of X is
f
X
(t) =
1
2p
_
e
iut
b(u)du
10 PROBABILITY AND STATISTICS
These remarks have parallel versions if X is integer-
valued. The calculation of b
X
is a sum: b
X
(u) =
n=
e
iun
P(X = n). Also, if b
X
is periodic with period 2p, then
the corresponding X is integer-valued and the point prob-
abilities are recovered with a similar inversion formula:
P(X = n) =
1
2p
_
p
p
e
iun
b(u)du
In more general cases, the b
X
computation requires
the measure-theoretic integral referenced earlier, and
the recovery of the distribution of X requires more complex
operations on b
X
. Nevertheless, it is theoretically possible
to translate in both directions between distributions and
their characteristic functions.
Some useful properties of characteristic functions are as
follows:
v
(Linear combinations) If Y = aX b, then b
Y
(u) =
e
iub
b
X
(au).
v
(Independent sums) If Y = X
1
X
2
. . . X
n
, where
the X
i
are independent, then b
Y
(u) =
n
i=1
b
X
i
(u).
v
(Continuity Theorem) Asequence of randomvariables
X
1
, X
2
,. . . converges in distribution to a random vari-
able X if and only if the corresponding characteristic
functions converge pointwise to a function that is
continuous at zero; inwhich case, the limiting function
is the characteristic function of X.
v
(Moment Theorem) If E[[X[
n
[ < , then b
X
has deri-
vatives through order n and E[X
n
[ = (i)
n
b
(n)
X
(0),
where b
(n)
X
(u) is the n
th
derivative of b
X
.
These features allow us to study convergence in distribu-
tion of random variables by investigating the more tract-
able pointwise convergence of their characteristic
functions. In the case of independent, identically distrib-
uted random variables with nite variance, this method
leads quickly to the Central Limit Theorem cited earlier.
For a nonnegative random variable X, the moment gen-
erating function f
X
(u) is less difcult to manipulate. Here
f
X
(u) = E[e
uX
[. For a random variable X that assumes
only nonnegative integer values, the probability generating
function r
X
(u) is another appropriate transform. It is
dened by r
X
(u) =
n=0
P(X = n)u
n
. Both moment and
probability generating functions admit versions of the
moment theoremandthe continuitytheoremandare there-
fore useful for studying convergence in the special cases
where they apply.
COMPUTER SIMULATIONS
In various programming contexts, particularly with simu-
lations, the need arises to generate samples from some
particular distribution. For example, if we know that
P(X = 1) = 0:4 and P(X = 0) = 0:6, we may want to realize
this random variable as a sequence of numbers x
1
; x
2
; . . ..
This sequence should exhibit the same variability as
would the original X if outcomes were directly observed.
That is, we expect athoroughmixing of ones andzeros, with
about 40% ones. Notice that we can readily achieve this
result if we have a method for generating samples from a
uniformdistributionUon[0, 1]. Inparticular, eachtime we
need a new sample of X, we generate an observation U and
report X = 1 if U _ 0:4 and X = 0 otherwise.
This argument generalizes in various ways, but the
gist of the extensionis that essentially any randomvariable
can be sampled by rst sampling a uniform random
variable and then resorting to some calculations on the
observed value. Although this reduction simplies the
problem, the necessity remains of simulating observations
from a uniform distribution on [0, 1]. Here we encounter
two diculties. First, the computer operates with a nite
register length, say 32 bits, which means that the values
returned are patterns from the 2
32
possible arrangements
of 32 bits. Second, a computer is a deterministic device.
To circumvent the rst problem, we put a binary point at
the left end of each such pattern, obtaining 2
32
evenly
spaced numbers in the range [0, 1). The most uniform
probability assignment allots probability 1/2
32
to each
such point. Let U be the random variable that operates
on this probability space as the identity function. If we
calculate P(a < U < b) for subintervals (a, b) that are
appreciably wider than 1/2
32
, we discover that these prob-
abilities are nearly b a, which is the value required for a
true uniformrandomvariable. The second diculty is over-
come by arranging that the values returned on successive
calls exhaust, or very nearly exhaust, the full range of
patterns before repeating. In this manner, any determinis-
tic behavior is not observable under normal use. Some
modern supercomputer computations may involve more
than 2
32
random samples, an escalation that has forced
the use of 64-bit registers to maintain the appearance of
nondeterminism.
After accepting an approximation based on 2
32
(or more)
closely spaced numbers in [0, 1), we still face the problemof
simulating a discrete probability distribution on this nite
set. This problemremains an area of active research today.
One popular approach is the linear congruential method.
We start with a seed sample x
0
, which is typically obtained
in some nonreproducible manner, such as extracting a
32-bit string from the computer real-time clock. Subse-
quent samples are obtained with the recurrence x
n1
=
(ax
n
b) mod c, where the parameters a, b, c are chosen
to optimize the criteriaof alongperiodbefore repetitionand
a fast computer implementation. For example, c is fre-
quently chosen to be 2
32
because the (ax
n
b) mod 2
32
operation involves retaining only the least signicant 32
bits of (ax
n
b). Knuth (1) discusses the mathematics
involved in choosing these parameters.
On many systems, the resulting generator is called
rand(). A program assignment statement, such as x =
rand(), places a new sample in the variable x. From this
point, we manipulate the returned value to simulate
samples from other distributions. As noted, if we wish to
sample B, a Bernoulli random variable with parameter p,
we continue by setting B = 1 if x _ p and B = 0 otherwise.
If we need anrandomvariable U
a,b
, uniformonthe interval
[a, b], we calculate U
a;b
= a (b a)x.
If the desired distribution has a continuous cumulative
distribution function, a general technique, called distri-
PROBABILITY AND STATISTICS 11
bution inversion, provides a simple computation of sam-
ples. Suppose X is a random variable for which the cumu-
lative distribution F(t) = P(X _ t) is continuous and
strictly increasing. The inverse F
1
(u) then exists for 0
< u < 1, and it can be shown that the derived random
variable Y = F(X) has a uniform distribution on (0, 1).
It follows that the distribution of F
1
(U) is the same as
that of X, where Uis the uniformrandomvariable approxi-
mated by rand(). To obtain samples from X, we sample U
instead and return the values F
1
(U).
For example, the exponential random variable X with
parameter l has the cumulative distribution function
F(t) = 1 e
lt
, for t _0, which satises the required con-
ditions. The inverse is T
1
(u) = [log(1 u)[=l. If U is
uniformly distributed, so is 1 U. Therefore, the samples
obtained from successive [log(rand())[=l values exhibit
the desired exponential distribution.
A variation is necessary to accommodate discrete ran-
dom variables, such as those that assume integer values.
Suppose we have a random variable X that assumes non-
negative integer values n with probabilities p
n
. Because
the cumulative distribution now exhibits a discrete jump
at each integer, it no longer possesses an inverse. Never-
theless, we can salvage the idea by acquiring a rand()
sample, say x, and then summing the p
n
until the accumu-
lation exceeds x. We return the largest n such that
n
i=0
p
i
_ x. A moments reection will show that this is
precisely the method we used to obtain samples from a
Bernoulli random variable above.
For certain cases, we can solve for the required n. For
example, suppose X is a geometric random variable with
parameter p. In this case, p
n
= p(1 p)
n
. Therefore if x is
the value obtained from rand(), we nd
max n :
n
k=0
p
k
_ x
_ _
=
log
x
log(1 p)
_ _
For more irregular cases, we may need to perform the
summation. Suppose we want to sample a Poisson random
variable X with parameter l. In this case, we have
p
n
= e
l
l
n
=n!, and the following pseudocode illustrates
the technique. We exploit the fact that p
0
= e
l
and
p
n1
= p
n
l=(n 1).
x = rand();
p = exp(l);
cum = p;
n = 0;
while (x > cum){
n = n 1;
p = p * l/(n 1);
cum = cum p;}
return n;
Various enhancements are available to reduce the num-
ber of iterations necessary to locate the desired n to return.
In the above example, we could start the search near
n = l| , because values near this expected value are
most frequently returned.
Another method for dealing with irregular discrete
distributions is the rejection lter. If we have an algorithm
to simulate distribution X, we can, under certain con-
ditions, systematically withhold some of the returns to
simulate a related distribution Y. Suppose X assumes
nonnegative integer values with probabilities p
0
; p
1
; . . .,
Y assumes the same values but with different probabilities
q
0
; q
1
; . . .. The required condition is that a positive K exists
such that q
n
_ K p
n
for all n. The following pseudocode
shows how to reject just the right number of X returns so
as to correctly adjust the return distribution to that of Y .
Here the routine X() refers to the existing algorithm that
returns nonnegative integers according to the X distri-
bution. We also require that the p
n
be nonzero.
while (true) {
n = X();
x = rand();
if (x < q
n
/(K * p
n
))
return n;}
STATISTICAL INFERENCE
Suppose we have several random variables X,Y,. . . of
interest. For example, X might be the systolic blood pres-
sure of a person who has taken a certain drug, whereas Y
is the blood pressure of an individual who has not taken
it. In this case, X and Y are dened on different probability
spaces. Each probability space is a collection of persons
who either have or have not used the drug in question. X
and Y then have distributions in a certain range, say
[50, 250], but it is not feasible to measure X or Y at each
outcome (person) to determine the detailed distributions.
Consequently, we resort to samples. That is, we observe X
for various outcomes by measuring blood pressure for a
subset of the X population. We call the observations
X
1
; X
2
; . . . ; X
n
. We follow a similar procedure for Y if we
are interested in comparing the two distributions. Here,
we concentrate on samples from a single distribution.
A sample from a random variable X is actually another
random variable. Of course, after taking the sample, we
observe that it is a specic number, which hardly seems to
merit the status of a random variable. However, we can
envision that our choice is just one of many parallel obser-
vations that deliver a range of results. We canthenspeak of
events such as P(X
1
_ t) as they relate to the disparate
values obtained across the many parallel experiments as
they make their rst observations. We refer to the distri-
bution of X as the population distribution and to that of X
n
as the nth sample distribution. Of course, P(X
n
_ t) =
P(X _ t) for all n and t, but the term sample typically
carries the implicit understanding that the various X
n
are independent. That is, P(X
n
_ t
1
; . . . ; X
n
_ t
n
) =
n
i=1
P(X
i
_ t
i
). In this case, we say that the sample is a
random sample.
With a random sample, the X
n
are independent, identi-
cally distributed random variables. Indeed, each has the
same distribution as the underlying population X. In prac-
tice, this property is assured by taking precautions to avoid
any selection bias during the sampling. In the blood pres-
sure application, for example, we attempt to choose persons
12 PROBABILITY AND STATISTICS
in a manner that gives every individual the same chance of
being observed.
Armed with a random sample, we now attempt to infer
features of the unknown distribution for the population
X. Ideally, we want the cumulative distribution of F
X
(t),
which announces the fraction of the population with blood
pressures less than or equal to t. Less complete, but still
valuable, information lies with certain summary features,
such as the expected value and variance of X.
A statistic is simply a function of a sample. Given the
sample X
1
; X
2
; . . . ; X
n
;, the new random variables
X =
1
n
n
k=1
X
k
S
2
=
1
n 1
n
k=1
(X
k
X)
2
are statistics known as the sample mean and sample vari-
ance respectively. If the population has E[X[ = m and
Var[X[ = s
2
, then E[X[ = m and E[S
2
[ = s
2
. The expected
value and variance are called parameters of the population,
and a central problem in statistical inference is to estimate
such unknown parameters through calculations on sam-
ples. At any point we can declare a particular statistic to be
an estimator of some parameter. Typically we only do so
when the value realized through samples is indeed an
accurate estimate.
Suppose y is some parameter of a population distribu-
tion X. We say that a statistic Y is an unbiased estimator
of y if E[Y[ = y. We then have that the sample mean and
sample variance are unbiased estimators of the population
mean and variance. The quantity
^
S
2
=
1
n
n
k=1
(X
k
X)
2
is also called the sample variance, but it is a biased esti-
mator of the population variance s
2
. If context is not clear,
we need to refer to the biased or unbiased sample variance.
In particular E[
^
S
2
[ = s
2
(1 1=n), which introduces a bias
of b = s
2
=n. Evidently, the bias decays to zero with in-
creasing sample size n. A sequence of biased estimators
with this property is termed asymptotically unbiased.
Astatistic canbeavector-valuedquantity. Consequently,
the entire sample (X
1
, X
2
,. . ., X
n
) is a statistic. For any
given t, we can compute the fraction of the sample values
that is less than or equal to t. For a given set of t values,
these computation produce a sample distribution function:
F
n
(t) =
#k : X
k
_ t
n
Here we use #{. . .} to denote the size of a set. For each t,
the GlivenkoCantelli Theorem states that the F
n
(t) con-
stitute an asymptotically unbiased sequence of estimators
for F(t) = P(X _ t).
Suppose X
1
, X
2
,. . ., X
n
is a random sample of the popu-
lation randomvariable X, which has E[X[ = m and Var[X[ =
s
2
<. The Central Limit Theorem gives the limiting
distribution for
n
_
(X m)=s as the standard Gaussian
Z
0,1
. Let us assume (unrealistically) for the moment that
we know s
2
. Then, we can announce X as our estimate of
m, and we can provide some credibility for this estimate
in the form of a condence interval. Suppose we want a
90% condence interval. From tables for the standard
Gaussian, we discover that P([Z
0;1
[ _ 1:645) = 0:9. For
large n, we have
0:9 = P([Z
0;1
[ _ 1:645) ~P
n
_
(X m)
s
_ 1:645
_ _
= P [X m[ _
1:645s
n
_
_ _
If we let d = 1:645s=
n
_
, we can assert that, for large n,
there is a 90% chance that the estimate X will lie within
d of the population parameter m. We can further mani-
pulate the equation above to obtain P(X d _ m _
X d) ~0:9. The specic interval obtained by substi-
tuting the observed value of X into the generic form
[X d; X d[ is known as the (90%) condence interval.
It must be properly interpreted. The parameter m is an
unknown constant, not a random variable. Consequently,
either m lies in the specied condence interval or it does
not. The random variable is the interval itself, which
changes endpoints when new values of X are observed.
The width of the interval remains constant at d. The
proper interpretation is that 90% of these nondeter-
ministic intervals will bracket the parameter m.
Under more realistic conditions, neither the mean m nor
the variance s
2
of the population is known. In this case, we
can make further progress if we assume that the indi-
vidual X
i
samples are normal random variables. Various
devices, such as composing each X
i
as a sum of a subset of
the samples, render this assumption more viable. In any
case, under this constraint, we can showthat n(X m)
2
= s
2
and (n 1)o
2
=s
2
are independent random variables with
known distributions. These random variables have chi-
squared distributions.
A chi-squared random variable with m degrees of free-
dom is the sum of the squares of m independent standard
normal randomvariables. It is actually a special case of the
gamma distributions discussed previously; it occurs when
the parameters are g = m and l = 1=2. If Y
1
is chi-squared
with m
1
degrees of freedom and Y
2
is chi-squared with m
2
degrees of freedom, then the ratio m
2
Y
1
/(m
1
Y
2
) has an F
distribution with (n,m) degree of freedom. A symmetric
random variable is said to follow a t distribution with m
2
degrees of freedom if its square has an F distribution with
(1,m
2
) degrees of freedom. For a given random variable R
and a givenvalue p inthe range (0, 1), the point r
p
for which
P(R _ r
p
) = p is called the pth percentile of the random
variable. Percentiles for F and t distributions are available
in tables.
Returning to our sample X
1
, X
2
,. . ., X
n
, we nd that
under the normal inference constraint, the two statistics
mentioned above have independent chi-squared distribu-
tions with 1 and n1 degrees of freedom, respectively.
Therefore the quantity
n
_
[X m[=
S
_
2
has a t distribution
with n1 degrees of freedom. Given a condence level, say
PROBABILITY AND STATISTICS 13
90%, we consult a table of percentiles for the t distribution
with n1 degrees of freedom. We obtain a symmetric
interval [r, r] such that
0:9 = P
n
_
[X m[
S
_
2
_ r
_ _
= P [X m[ _
r
S
_
2
n
_
_ _
Letting d = r
S
_
2
=
n
_
, we obtain the 90% condence inter-
val [X d; X d[ for our estimate X of the population
parameter m. The interpretation of this interval remains
as discussed above.
This discussion above is an exceedingly abbreviated
introduction to a vast literature on statistical inference.
The references below provide a starting point for further
study.
FURTHER READING
B. Fristedt and L. Gray, A Modern Approach to Probability
Theory. Cambridge, MA: Birkhuser, 1997.
A. Gut, Probability: AGraduate Course. NewYork: Springer, 2006.
I. Hacking, The Emergence of Probability. Cambridge, MA:
Cambridge University Press, 1975.
I. Hacking, The Taming of Chance. Cambridge, MA: Cambridge
University Press, 1990.
J. L. Johnson, Probability and Statistics for Computer Science.
New York: Wiley, 2003.
O. Ore, Cardano, the Gambling Scholar. Princeton, NJ: Princeton
University Press, 1953.
O. Ore, Pascal andthe inventionof probabilitytheory, Amer. Math.
Monthly, 67: 409419, 1960.
C. A. Pickover, Computers andthe Imagination. St. Martins Press,
1991.
S. M. Ross, Probability Models for Computer Science. New York:
Academic Press, 2002.
H. Royden, Real Analysis, 3rd ed. Englewood Cliffs, NJ: Prentice-
Hall, 1988.
BIBLIOGRAPHY
1. D. E. Knuth, The Art of Computer Programming, Vol. 2
3rd ed. Reading, MA: Addison-Welsey, 1998.
JAMES JOHNSON
Western Washington University
Bellingham, Washington
14 PROBABILITY AND STATISTICS
P
PROOFS OF CORRECTNESS IN MATHEMATICS
AND INDUSTRY
THE QUALITY PROBLEM
Buying a product from a craftsman requires some care.
For example, in the Stone Age, an arrow, used for hunting
and hence for survival, needed to be inspected for its
sharpness and proper xation of the stone head to the
wood. Complex products of more modern times cannot be
checked in such a simple way and the idea of warranty
was born: A nonsatisfactory product will be repaired or
replaced, or else you get your money back. This puts the
responsibility for quality on the shoulders of the manufac-
turer, who has to test the product before selling. In con-
temporary IT products, however, testing for proper
functioning in general becomes impossible. If we have an
array of 1717 switches in a device, the number of possible
positions is 2
17
2
= 2
289
~10
87
, more than the estimated
number of elementary particles in the universe. Modern
chips have billions of switches onthem, hence, a state space
of a size that is truly dwarng astronomical numbers.
Therefore, in most cases, simple-minded testing is out of
the question because the required time would surpass by
far the lifetime expectancy of the universe. As these chips
are used in strategic applications, like airplanes, medical
equipment, and banking systems, there is a problem with
how to warrant correct functioning.
Therefore, the need for special attention to the quality of
complexproducts is obvious, bothfromausers point of view
and that of a producer. This concern is not just academic. In
1994 the computational number theorist T. R. Nicely dis-
covered by chance a bug
1
in a widely distributed Pentium
chip. After an initial denial, the manufacturer eventually
had to publicly announce a recall, replacement, and
destruction of the awed chip with a budgeted cost of US
$475 million.
Fortunately, mathematics has found a way to handle
within a nite amount of time a supra-astronomical num-
ber of cases, in fact, an innity of them. The notion of proof
provides a way to handle all possible cases with certainty.
The notion of mathematical induction is one proof method
that can deal with an innity of cases: If a property P is
valid for the rst natural number 0 (or if you prefer 1) and if
validityof Pfor nimplies that for n 1, thenPis validfor all
natural numbers. For example, for all n one has
X
n
k=0
k
2
=
1
6
n(n 1)(2n 1): P(n)
This can be proved by showing it for n = 0; and then
showing that if P(n) holds, then also p(n 1). Indeed P(0)
holds:
P
0
k=0
k
2
= 0. If P(n) holds, then
X
n1
k=0
k
2
=
X
n
k=0
k
2
!
(n 1)
2
=
1
6
n(n 1)(2n 1) (n 1)
2
=
1
6
(n 1)(n 2)(2n 3);
hence P(n 1). Therefore P(n) holds for all natural num-
bers n.
Another method to prove statements valid for an
innite number of instances is to use symbolic rewriting:
From the usual properties of addition and multiplication
over the natural numbers (proved by induction), one can
derive equationally that (x 1)(x 1) = x
2
1, for all
instances of x.
Proofs have been for more than two millennia the
essence of mathematics. For more than two decades, proofs
have become essential for warranting quality of complex IT
products. Moreover, by the end of the twentieth century,
proofs in mathematics have become highly complex. Three
results deserve mention: the Four Color Theorem, the
Classication of the Finite Simple Groups, and the correct-
ness of the Kepler Conjecture (about optimal packing of
equal three-dimensional spheres). Part of the complexity
of these proofs is that they rely on large computations by a
computer (involving up to a billioncases). Anewtechnology
for showing correctness has emerged: automated verica-
tion of large proofs.
Two methodological problems arise (1). How do proofs
in mathematics relate to the physical world of processors
and other products? (2) How can we be sure that complex
proofs are correct? The rst question will be addressed in
the next section, and the second in the following section.
Finally, the technology is predicted to have a major impact
on the way mathematics will be done in the future.
SPECIFICATION, DESIGN, ANDPROOFS OF CORRECTNESS
The Rationality Square
The ideas in this section come fromRef. 2 and make explicit
what is known intuitively by designers of systems that use
proofs. The rst thing to realize is that if we want quality of
a product, then we need to specify what we want as its
behavior. Both the product and its (desired) behavior are in
reality, whereas the specication is written in some pre-
cise language. Then we make a design with the intention to
realize it as the intended product. Also the design is a
formal (mathematical) object. If one can prove that the
designed object satises the formal specication, then it
is expected that the realization has the desired behavior
1
It took Dr. Nicely several months to realize that the inconsistency
he noted in some of his output was not due to his algorithms, but
caused by the (microcode onthe) chip. See Ref. 1 for a descriptionof
the mathematics behind the error.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
(see Fig. 1). For this it is necessary that the informal
(desired) behavior and the specication are close to each
other and can be inspected in a clearly understandable
way. The same holds for the design and realization. Then
the role of proofs is in its place: They do not apply to an
object anddesired behavior inreality but to a mathematical
descriptions of these.
In this setup, the specication language should be close
enough to the informal specication of the desired beha-
vior. Similarly, the technology of realization should also be
reliable. The latter again may depend on tools that are
constructed component wise and realize some design (e.g.,
silicon compilers that take as input the design of a chip
andhave as output the instructions to realize them). Hence,
the rationality square may have to be used in an earlier
phase.
This raises, however, two questions. Proofs should be
based on some axioms. Which ones? Moreover, how do we
know that provability of a formal (mathematical) property
implies that we get what we want? The answers to these
questions come together. The proofs are based on some
axioms that hold for the objects of which the product is
composed. Based on the empirical facts that the axioms
hold, the quality of the realized product will follow.
Products as Chinese Boxes of Components
Now we need to enter some of the details of how the
languages for the design and specication of the products
should look. The intuitive idea is that a complex product
consists of components b
1
,. . ., b
k
put together in a specic
wayyieldingF
(k)
(b
1
; . . . ; b
k
). The superscript (k) indicates
the number of arguments that F needs. The components
are constructed in a similar way, until one hits the basic
components O
0
; O
1
; . . . that no longer are composed. Think
of a playing music installation B. It consists of a CD, CD-
player, amplier, boxes, wires and anelectric outlet, all put
together in the right way. So
B = F
(6)
(CD; CD-player; amplifyer; boxes; wires; outlet);
where F
(6)
is the action that makes the right connections.
Similarly the amplier and other components can be
described as a composition of their parts. A convenient
way to depict this idea in general is the so-called Chinese
box (see Fig. 2). This is a box witha lid. After opening the lid
one nds a (nite) set of neatly arranged boxes that either
are open and contain a basic object or are again other
Chinese boxes (with lid). Eventually one will nd some-
thing inadecreasing chainof boxes. This corresponds to the
component-wise construction of anything, in particular of
hardware, but also of software
2
. It is easy to construct a
grammar for expressions denoting these Chinese boxes.
The basic objects are denoted by o
0
; o
1
; . . .. Then there are
constructors that turn expressions into new expressions.
Each constructor has an arity that indicates how many
arguments it has. There may be unary, binary, ternary, and
so on constructors. Such constructors are denoted by
f
(k)
0
; f
(k)
1
; . . . ;
where k denotes the arity of the constructor. If b
1
,. . ., b
k
are expressions and f
(k)
i
is a constructor of arity k, then
f
(k)
i
(b
1
; . . . ; b
k
)
is an expression. Aprecise grammar for such expressions is
as follows.
Denition
1. Consider the following alphabet:
X
= o
i
[i N f
k
i
[i; k N ;
;
(; ):
2. Expressions S form the smallest set of words over S
satisfying
b
1
; . . . ; b
k
S = f
(k)
i
o
i
S;
(b
1
; . . . ; b
k
) S:
An example of a fully specied expression is
f
(2)
1
(o
0
; f
1
3
(o
1
; o
2
; o
0
)):
Design
proof
realization
Specification
requirement
Product
warranty
Behavior
Figure 1. Wuppers rationality square.
2
In order that the sketched design method works well for software,
it is preferable to have declarative software, i.e., inthe functional or
logic programming style, in particular without side effects.
Figure 2. Partially opened Chinese box.
2 PROOFS OF CORRECTNESS IN MATHEMATICS AND INDUSTRY
The partially opened Chinese box in Fig. 2 can be denoted
by
f
(5)
1
(b
1
; b
2
; o
1
; b
4
; b
5
);
where
b
4
= f
(6)
2
(o
2
; b
4;2
; b
4;3
; b
4;4
; b
4;5
; b
4;6
);
b
4;4
= f
(12)
4
(b
4;4;1
; o
3
; b
4;4;3
; o
4
; b
4;3;5
; b
4;4;6
; o
5
;
b
4;4;8
; o
6
; o
7
; o
8
; o
9
);
and the other b
k
still have to be specied.
Denition. A design is an expression bS.
Specication and Correctness of Design
Following the rationality square, one now can explain the
role of mathematical proofs in industrial design.
Some mathematical language is needed to state in a
precise way the requirements of the products. We suppose
that we have such a specication language L, in which the
expressions inS are terms. We will not enter into the details
of suchalanguage, but we will mentionthat for ITproducts,
it often is convenient to be able to express relationships
between the states before and after the execution of a
command or to express temporal relationships. Temporal
statements include eventually the machine halts or
there will be always a later moment in which the system
receives input. See Refs. 36 for possible specication
languages, notably for reactive systems, and Ref. 7 for a
general introduction to the syntax and semantics of logical
languages used in computer science.
Denition. A specication is a unary formula
3
S() in L.
Suppose we have the specication S and a candidate
design b as given. The task is to prove in a mathematical
way S(b), i.e., that S holds of b. We did not yet discuss any
axioms, or a way to warrant that the proved property is
relevant. For this we need the following.
Denition. A valid interpretation for L consists of the
following.
1. For basic component expressions o, there is an inter-
pretation O in the reality of products.
2. For constructors f
(k)
, there is a way to put together k
products p
1
; . . . ; p
k
to form F
(k)
( p
1
; . . . ; p
k
).
3. By (1) and (2), all designs have a realization. For
example, the design f
(2)
1
(o
0
; f
(1)
3
(o
1
; o
2
; o
0
)) is inter-
preted as F
(2)
1
(O
0
; F
(1)
3
(O
1
; O
2
; O
0
)).
4. (4) There are axioms of the form
P(c)
\ x
1
. . . x
k
[Q(x
1
; . . . ; x
k
) =R( f
(k)
(x
1
; . . . ; x
k
))[:
Here P, Q, and R are formulas (formal statements)
about designs: P and Rabout one design and Qabout
k designs.
5. The formulas of L have a physical interpretation.
6. By the laws of physics, it is known that the inter-
pretation given by (5) of the axioms holds for the
interpretation described in the basic components
and constructors. The soundness of logic then implies
that statements proved fromthe axioms will also hold
after interpretation.
This all may sound a bit complex, but the idea is simple
and can be found in any book on predicate logic and its
semantics (see Refs. 7 and 8). Proving starts from the
axioms using logical steps; validity of the axioms and
soundness of logic implies that the proved formulas are
also valid.
The industrial task of constructing a product with a
desired behavior can be fullled as follows.
Design Method (I)
1. Find a language L with a valid interpretation.
2. Formulate a specication S, such that the desired
behavior becomes the interpretation of S.
3. Construct an expression b, intended to solve the task.
4. Prove S(b) from the axioms of the interpretation men-
tioned in (1).
5. The realization of b is the required product.
Of course the last step of realizing designs may be non-
trivial. For example, transforming a chip design to an
actual chip is an industry by itself. But that is not the
concern now. Moreover, such a realization process can be
performed by a tool that is the outcome of a similar speci-
cation-design-proof procedure.
The neededproofs have to be givenfromthe axioms inthe
interpretation. Design method I builds up products from
scratch. Inorder not to reinvent the wheel all the time, one
can base new products on previously designed ones.
Design Method (II). Suppose one wants to construct b
satisfying S.
1. Find subspecications S
1
; . . . ; S
k
and a constructor
f
(k)
such that
S
1
(x
i
) & . . . & S
k
(x
k
) =S(f
(k)
(x
1
; . . . ; x
k
)):
2. Find (on-the-shelf) designs b
1
; . . . ; b
k
such that for
1 _ i _ k, one has
S
i
(b
i
):
3. Then the design b = f
(k)
(b
1
; . . . ; b
k
) solves the task.
Again this is done in a context of a language L with a valid
interpretation and the proofs are from the axioms in the
interpretation.
3
Better: A formula S = S(x) = S() with one free variable x in S.
PROOFS OF CORRECTNESS IN MATHEMATICS AND INDUSTRY 3
After having explained proofs of correctness, the cor-
rectness of proofs becomes an issue. In an actual non-
trivial industrial design, a software system controlling
metro-trains in Paris without a driver, one needed to prove
about 25,000 propositions in order to get reliability. These
proofs were provided by a theoremprover. Derivation rules
were added to enhance the proving power of the system. It
turnedout that if no care was taken, 2%to5%of theseadded
derivation rules were awed and led to incorrect state-
ments; see Ref. 9. The next section deals with the problem
of getting proofs right.
CORRECTNESS OF PROOFS
Methodology
Both in computer science and in mathematics proofs can
become large. In computer science, this is the case because
the proofs that products satisfy certain specications, as
explained earlier, may depend on a large number of cases
that need to be analyzed. In mathematics, large proofs occur
as well, in this case because of the depth of the subject. The
example of the Four Color Theoreminwhichbillions of cases
need to be checked is well known. Then there is the proof of
the classication theorem for simple nite groups needing
thousands of pages (in the usual style of informal rigor).
That there are long proofs of short statements is not an
accident, but a consequence of a famous undecidability
result.
Theorem(Turing). Provability in predicate logic is unde-
cidable.
PROOF. See, for example, Ref. 10. &
Corollary. For predicate logic, there is a number n and
a theorem of length n, with the smallest proof of length at
least n
n!
.
PROOF. Suppose that for every n theorems of length at
least n, a proof of length < n
n!
exists. Then checking all
possible proofs of such length provides a decision method
for theoremhood, contradicting the undecidablity result. &
Of course this does not imply that there are interesting
theorems with essentially long proofs.
The question now arises, how one can verify long proofs
and large numbers of shorter ones? This question is both
of importance for pure mathematics and for the industrial
applications mentioned before.
The answer is that the state of the foundations of mathe-
matics is such that proofs can be written in full detail,
makingit possible for acomputer to checktheir correctness.
Currently, it still requires considerable effort to make such
formalizations of proofs, but there is good hope that inthe
future this will become easier. Anyway, industrial design,
as explained earlier, already has proved the viability and
value of formal proofs. For example, the Itanium, a succes-
sor of the Pentiumchip, has a provably correct arithmetical
unit; see Ref. 11.
Still one may wonder how one can assure the correct-
ness of mathematical proofs via machine verication, if
such proofs need to assure the correctness of machines. It
seems that there is here a vicious circle of the chicken-and-
the-egg type. The principal founder of machine verication
of formalized proofs is the Dutch mathematician N. G. de
Bruijn
4
; see Ref. 13. He emphasized the following criterion
for reliable automated proof-checkers: Their programs
must be small, so small that a human can (easily) verify
the code by hand. In the next subsection, we will explain
whyit is possible to satisfythis so-calledde Bruijncriterion.
Foundations of Mathematics
The reason that fully formalized proofs are possible is that
for all mathematical activities, there is a solid foundation
that has been laid in a precise formal system. The reason
that automated proof-checkers exist that satisfy the de
Bruijn criterion is that these formal systems are simple
enough, allowing a logician to write them down from
memory in a couple of pages.
Mathematics is createdby three mental activities: struc-
turing, computing, and reasoning. It is an art and crafts-
manship with a power, precision and certainty, that is
unequalled elsewhere in life
5
. The three activities, res-
pectively, provide denitions and structures, algorithms
and computations, and proofs and theorems. These activ-
ities are taken as a subject of study by themselves, yielding
ontology (consisting either of set, type, or category theory),
computability theory, and logic.
During the history of mathematics these activities enjoyed
attention in different degrees. Mathematics started with
the structures of the numbers and planar geometry.
BabylonianChineseEgyptian mathematics, was mainly
occupied with computing. In ancient Greek mathematics,
reasoning was introduced. These two activities came
together in the work of Archimedes, al-Kwarizmi, and New-
ton. For alongtime onlyoccasional extensions of the number
systems was all that was done as structuring activity. The
art of dening more and more structures started in the
ninteenth century with the introduction of groups by Galois
and non-Euclideanspaces by Lobachevsky andBolyai. Then
mathematics ourished as never before.
4
McCarthy described machine proof-checking some years earlier
(see, Ref. 12), but did not come up with a formal system that had a
sufciently powerful and convenient implementation.
5
From: The man without qualities, R. Musil, Rohwolt.
6
Formerly called Recursion Theory.
Activity Tools Results Meta study
Structuring Axioms
Definitions
Structures Ontology
Computing Algorithms Answers Computability
6
Reasoning Proofs Theorems Logic
Figure 3. Mathematical activity: tools, results, and meta study.
4 PROOFS OF CORRECTNESS IN MATHEMATICS AND INDUSTRY
Logic. Thequest forndingafoundationforthethreeacti-
vities started with Aristotle. This search for foundation
does not imply that one was uncertain how to prove theo-
rems. Plato had already emphasized that any human being
of normal intelligence had the capacity to reason that was
required for mathematics. What Aristotle wanted was a
survey and an understanding of that capacity. He started
the quest for logic. At the same time Aristotle introduced
the synthetic way of introducing new structures: the
axiomatic method. Mathematics consists of concepts and
of valid statements. Concepts can be dened from other
concepts. Valid statements can be proved from other such
statements. To prevent an innite regress, one had to start
somewhere. For concepts one starts with the primitive
notions and for valid statements with the axioms. Not
longafter this description, Eucliddescribedgeometryusing
the axiomatic method in a way that was only improved by
Hilbert, more than 2000 years later. Also Hilbert gave the
right view on the axiomatic method: The axioms form an
implicit denition of the primitive notions.
Frege completedthe quest of Aristotle bygivingaprecise
description of predicate logic. Godel proved that his system
was complete, i.e., sufciently strong to derive all valid
statements within a given axiomatic system. Brouwer and
Heyting rened predicate logic into the so-called intuitio-
nistic version. In their system, one can make a distinction
between a weak existence (there exists a solution, but it is
not clear how to nd it) and a constructive one (there
exists a solution and from the proof of this fact one can
construct it) (see Ref. 14).
Ontology. An early contribution to ontology came from
Descartes, who introduced what is now called Cartesian
products (pairs or more generally tuples of entities), thereby
relating geometrical structures to arithmetical (in the sense
of algebraic) ones. When in the nineteenth century, there
was a need for systematic ontology, Cantor introduced set
theory in which sets are the fundamental building-blocks of
mathematics. His system turned out to be inconsistent, but
Zermelo and Fraenkel removed the inconsistency and
improved the theory so that it could act as an ontological
foundation for large parts of mathematics, (see Ref 15).
Computability. As soon as the set of consequences of an
axiom system had become a precise mathematical object,
results about this collection started to appear. From the
work of Godel, it followed that the axioms of arithmetic
are essentially incomplete (for any consistent extension of
arithmetic, there is an independent statement A that is
neither provable nor refutable). An important part of the
reasoning of Godel was that the notion p is a proof of A is
after coding a computable relation. Turing showed that
predicate logic is undecidable (it cannot be predicted by
machine whether a given statement can be derived or not).
To prove undecidability results, the notion of computation
needed to be formalized. To this end, Church came with a
systemof lambda-calculus (see Ref. 16), later leading to the
notion of functional programming with languages such as
Lisp, ML, and Haskell. Turing came with the notion of the
Turing machine, later leading to imperative programming
with languages such as Fortran and C and showed that it
gave the same notion of computability as Churchs. If we
assume the so-called ChurchTuring thesis that humans
and machines can compute the same class of mathematical
functions, something that most logicians and computer
scientists are willing to do, then it follows that provability
in predicate logic is also undecidable by humans.
Mechanical Proof Verication
As soonas logic was fully described, one startedto formalize
mathematics. In this endeavor, Frege was unfortunate
enough to base mathematics on the inconsistent version
of Cantorian set theory. Then Russell and Whitehead came
with an alternative ontology, type theory, and started to
formalize very elementary parts of mathematics. In type
theory, that currently exists in various forms, functions
are the basic elements of mathematics and the types form
a way to classify these. The formal development of mathe-
matics, initiated by Russell and Whitehead, lay at the
basis of the theoretical results of Godel and Turing. On
the other hand, for practical applications, the formal proofs
become so elaborate that it is almost undoable for a human
to produce them, let alone to check that they are correct.
It was realized by J. McCarthy and independently by
N. G. de Bruijn that this verication should not be done by
humans but by machines. The formal systems describing
logic, ontology, and computability have an amazingly small
number of axioms and rules. This makes it possible to
construct relatively small mathematical assistants. These
computer systems help the mathematician to verify
whether the denitions and proofs provided by the human
are well founded and correct.
Based on an extended form of type theory, de Bruijn
introduced the system AUTOMATH (see Ref. 17), in
which this idea was rst realized, although somewhat
painfully, because of the level of detail in which the proofs
needed to be presented. Nevertheless, proof-checking by
mathematical assistants based on type theory is feasible
and promising. For some modern versions of type theory
and assistants based on these, see Refs. 1721.
Soon after the introduction of AUTOMATH, other
mathematical assistants were developed, based on different
foundational systems. There is the system MIZAR based on
set theory; the system HOL(-light) based on higher order
logic; and ACL
2
based on the computational model primi-
tive recursive arithmetic. See Ref. 22 for an introduction
and references and Ref. 23 for resulting differences of views
in the philosophy of mathematics. To obtain a feel of the
different styles of formalization, see Ref. 24.
In Ref. 25, an impressive full development of the Four
Color Theoremis described. TomHales of the University of
Pittsburgh, assisted by a group of computer scientists,
specializing in formalized proof-verication, is well on
his way to verifying his proof of the Kepler conjecture
(26); see Ref. 27. The Annals of Mathematics published
that proof andconsideredaddingbut nally didnot do so
a proviso, that the referees became exhausted (after 5
years) from checking all of the details by hand; therefore,
the full correctness depends on a (perhaps not so reliable)
computer computation. If Hales and his group succeed in
formalizing and verifying the entire proof, then that will be
PROOFS OF CORRECTNESS IN MATHEMATICS AND INDUSTRY 5
of a reliability higher than most mathematical proofs, one
third of which is estimated to contain real errors, not just
typos.
7
The possibility of formalizing mathematics is not in
contradiction with Godels theorem, which only states
the limitations of the axiomatic method, informal or formal
alike. The proof of Godels incompleteness theorem does in
fact heavily rely onthe fact that proof-checking is decidable
anduses this byreectingover the notionof provability(the
Godel sentence states: This sentence is not provable).
One particular technology to verify that statements are
valid is the use of model-checking. In IT applications the
request statement A can be proved from assumptions G
(the situation) often boils down to A is valid in a model
/ = /
G
depending on G. (In logical notation
G A=/
G
[ A:
This is so because of the completeness theorem of logic
and because of the fact that the IT situation is related to
models of digital hardware that are nite by its nature.)
Now, despite the usual huge size of the model, using some
cleverness the validity in several models in some indust-
riallyrelevant cases is decidable withinafeasible ammount
of time. One of these methods uses the so-called binary
decision diagrams (BDDs). Another ingredient is that uni-
versal properties are checked via some rewriting rules, like
(x 1)(x 1) = x
2
1.
For an introduction to model-checkers, see Ref. 20.
For successful applications, see Ref. 29. The method of
model-checking is often somewhat ad hoc, but nevertheless
important. Using automated abstraction that works in
many cases (see Refs. 30 and 31), the method becomes
more streamlined.
SCALING-UP THROUGH REFLECTION
As to the question of whether fully formalized proofs are
practically possible, the opinions have been divided.
Indeed, it seems too much work to work out intuitive steps
in full detail. Because of industrial pressure, however, full
developments have been given for correctness of hardware
and frequently used protocols. Formalizations of substan-
tial parts of mathematics have been lagging behind.
There is a method that helps in tackling larger proofs.
Suppose we want to prove statement A. Then it helps if
we can write AB( f (t)), where t belongs to some col-
lection X of objects, and we also can see that the truth of
this is independent of t; i.e., one has a proof \x X:B( f (x)).
Then B( f (t)), hence A.
An easy example of this was conveyed to me by A.
Mostowski in1968. Consider the following formula as proof
obligation in propositional logic:
A = p( p( p( p( p( p( p( p( p
( p( p)))))))))):
Then AB(12), with B(1) = p; B(n 1) = ( pB(n)). By
induction on n one can show that for all natural numbers
n_1, one has B(2 n). Therefore, B(12) and hence A,
because 2 6 = 12. A direct proof from the axioms of
propositonal logic would be long. Much more sophisticated
examples exist, but this is the essence of the method of
reection. It needs some form of computational reasoning
inside proofs. Therefore, the modern mathematical assis-
tants contain a model of computation for which equalities
like 2 6 = 12 and much more complex ones become prov-
able. There are two ways to do this. One possibility is that
there is a deduction rule of the form
A(s)
A(t)
s H
R
t:
This so-called Poincare Principle should be interpreted
as follows: From the assumption A(s) and the side con-
dition that s computationally reduces in several steps to
t according to the rewrite system R, it follows that A(t).
The alternative is that the transition from A(s) to A(t) is
only allowed if s = t has been proved rst. These two ways
of dealing with proving computational statements can be
compared with the styles of, respectively, functional and
logical programming. In the rst style, one obtains proofs
that can be recorded as proof-objects. In the second style,
these full proofs become too large to record as one object,
because computations may take giga steps. Nevertheless
the proof exists, but appearing line by line over time, and
one speaks about an ephemeral proof-object.
In the technology of proof-verication, general state-
ments are about mathematical objects and algorithms,
proofs show the correctness of statements and computa-
tions, and computations are dealing with objects and
proofs.
RESULTS
The state-of-the-art of computer-veried proofs is as fol-
lows. To formalize one page of informal mathematics, one
needs four pages in a fully formalized style and it takes
about ve working days to produce these four pages (see
Ref. 22). It is expected that both numbers will go down.
There have been formalized several nontrivial statements,
like the fundamental theorem of algebra (also in a con-
structive fashion; it states that every non-constant poly-
nomial over the complex numbers has a root), the prime
number theorem (giving an asymptotic estimation of the
number of primes below a given number), and the Jordan
curve theorem (every closed curve divides the plane into
two regions that cannot be reached without crossing this
curve; onthe torus surface, this is not true). One of the great
success stories is the full formalization of the Four Color
Theorem by Gonthier (see Ref. 25). The original proof of
7
It is interestingto note that, althoughinformal mathematics often
contains bugs, the intuition of mathematicians is strong enough
that most of these bugs usually can be repaired.
6 PROOFS OF CORRECTNESS IN MATHEMATICS AND INDUSTRY
this result was not completely trustable for its correctness,
as a large number of cases needed to be examined by
computer. Gonthiers proof still needs a computer-aided
computation, but all steps have beenformally veriedby an
assistant satisfying the de Bruijn principle.
BIBLIOGRAPHY
1. Alan Edelman, The mathematics of the Pentium division bug,
SIAM Review, 37: 5467, 1997.
2. H. Wupper, Design as the discovery of a mathematical
theorem What designers should knowabout the art of mathe-
matics, inErtas, et al., (eds.), Proc. Third Biennial WorldConf.
onIntegratedDesignandProcess Technology (IDPT), 1998, pp.
8694; J. Integrated Des. Process Sci., 4 (2): 113, 2000.
3. Z. Manna and A. Pnueli, The Temporal Logic of Reactive and
Concurrent Systems: Specication, New York: Springer, 1992.
4. K. R. Apt and Ernst-Ru diger Olderog, Verication of Sequen-
tial and Concurrent Programs, Texts and Monographs in
Computer Science, 2nd ed. New York: Springer-Verlag, 1997.
5. C.A.PHoare andH. Jifeng, UnifyingTheories of Programming,
Englewood Cliffs, N.J.: Prentice Hall, 1998.
6. J. A. Bergstra, A. Ponse, and S. A. Smolka, eds., Handbook of
Process Algebra, Amsterdam: North-Holland Publishing Co.,
2001.
7. B.-A. Mordechai, Mathematical Logic for Computer Science,
New York: Springer, 2001.
8. W. Hodges, A Shorter Model Theory, Cambridge, U.K.: Cam-
bridge University Press, 1997.
9. J.-R. Abrial, On B, in D. Bert, (ed.), B98: Recent Advances in
the Development and Use of the B Method: Second Interna-
tional B Conference Montpellier, in Vol. 1393 of LNCS, Berlin:
Springer, 1998, pp. 18.
10. M. Davis, ed., The Undecidable, Mineola, NY: Dover Publica-
tions Inc., 2004.
11. B. Greer, J. Harrison, G. Henry, W. Li, and P. Tang, Scientic
computing on the Itanium
r
processor, Scientic Prog., 10 (4):
329337, 2002,
12. J. McCarthy, Computer programs for checking the correctness
of mathematical proofs, in Proc. of a Symposium in Pure
Mathematics, vol. V., Providence, RI, 1962, pp. 219227.
13. N. G. de Bruijn, The mathematical language AUTOMATH, its
usage, and some of its extensions, in Symposiumon Automatic
Demonstration, Versailles, 1968, Mathematics, 125: 2961,
Berlin: Springer, 1970.
14. D. vanDalen, Logic andStructure, Universitext, 4thed. Berlin:
Springer-Verlag, 2004.
15. P. R. Halmos, Naive Set Theory, New York: Springer-Verlag,
1974.
16. H. P. Barendregt, Lambda calculi with types, in Handbook of
Logic in Computer Science, Vol. 2, New York: Oxford Univ.
Press, 1992, pp. 117309.
17. R. P. Nederpelt, J. H. Geuvers, and R. C. de Vrijer, Twenty-ve
years of Automath research, in Selected Papers on Automath,
volume 133 of Stud. Logic Found, Math., North-Holland,
Amsterdam, 1994, pp. 354.
18. P. Martin-Lof, Intuitionistic type theory, volume 1 of Studies
inProof Theory, Lecture Notes, Bibliopolis, Naples, Italy, 1984.
19. R. L. Constable, The structure of Nuprls type theory, in Logic
of Computation (Marktoberdorf, 1995), vol. 157 of NATO Adv.
Sci. Inst. Ser. F Comput. Systems Sci., Berlin: Springer, 1997,
pp. 123155.
20. H. P. Barendregt and H. Geuvers, Proof-assistants using
dependent type systems, in A. Robinson and A. Voronkov,
(eds.), Handbook of Automated Reasoning, Elsevier Science
Publishers B.V., 2001, pp. 11491238.
21. Y. Bertot and P. Casteran, CoqArt: The Calculus of Inductive
Constructions, Texts in Theoretical Computer Science, Berlin:
Springer, 2004.
22. H. P. Barendregt and F. Wiedijk, The challenge of computer
mathematics, Philos. Trans. R. Soc. Lond. Ser. A Math. Phys.
Eng. Sci., 363 (1835): 23512375, 2005.
23. H. P. Barendregt, Foundations of mathematics from the per-
spective of computer verication, in Mathematics, Computer
Science, Logic A Never Ending Story. New York: Springer-
Verlag, 2006. To appear. Available www.cs.ru.nl/
~
henk/
papers.html.
24. F. Wiedijk, The Seventeen Provers of the World, vol. 3600 of
LNCS, New York: Springer, 2006.
25. G. Gonthier, A computer checked proof of the Four Colour
Theorem, 2005. Available: research.microsoft.com/
~
gonthier/
4colproof.pdf).
26. T. C. Hales, Aproof of the Kepler conjecture, Ann. of Math. (2),
162 (3): 10651185, 2005.
27. T. C. Hales, The yspeck project fact sheet. Available:
www.math.pitt.edu/~thales/yspeck/index.html.
28. E. M. Clarke Jr., O. Grumberg, and D. A. Peled, Model Check-
ing, Cambridge, MA: MIT Press, 1999.
29. G. J. Holzmann, The SPINmodel checker, primer andreference
manual, Reading, MA: Addison-Wesley, 2003.
30. S. Bensalem, V. Ganesh, Y. Lakhnech, C. Mu noz, S. Owre, H.
Rue, J. Rushby, V. Rusu, H. Sadi, N. Shankar, E. Singerman,
and A. Tiwari, An overview of SAL, in C. M. Holloway, (ed.),
LFM 2000: Fifth NASA Langley Formal Methods Workshop,
2000, pp. 187196.
31. F. W. Vaandrager, Does it payoff? Model-basedvericationand
validation of embedded systems! in F. A. Karelse, (ed.), PRO-
GRESS White papers 2006. STW, the Netherlands, 2006.
Available: www.cs.ru.nl/ita/publications/papers/fvaan/white-
paper.
HENK BARENDREGT
Radboud University Nijmegen
Nijmegen, The Netherlands
PROOFS OF CORRECTNESS IN MATHEMATICS AND INDUSTRY 7
R
REGRESSION ANALYSIS
Instatistics, regressionis the study of dependence. The goal
of regression is to study how the distribution of a response
variable changes as values of one or more predictors are
changed. For example, regression can be used to study
changes in automobile stopping distance as speed is varied.
In another example, the response could be the total prot-
ability of a product as characteristics of it like selling price,
advertising budget, placement in stores, and so on, are
varied. Key uses for regression methods include prediction
of future values and assessing dependence of one variable
on another.
The study of conditional distributions dates at least to
the beginning of the nineteenth century and the work of
A. Legendre and C. F. Gauss. The use of the termregression
is somewhat newer, dating to the work of F. Galton at the
end of the nineteenth century; see Ref. 1 for more history.
GENERAL SETUP
For the general univariate regression problem, we use the
symbol Y for a response variable, whichis sometimes called
the dependent variable. The response can be a continuous
variable like a distance or a prot, or it could be discrete,
like success or failure, or some other categorical outcome.
The predictors, also called independent variables, carriers,
or features, can also be continuous or categorical; in the
latter case they are often called factors or class variables.
For now, we assume only one predictor and use the symbol
X for it, but we will generalize to many predictors shortly.
The goal is to learn about the conditional distribution of Y
given that X has a particular value x, written symbolically
as F(Y[X = x).
For example, Fig. 1 displays the heights of n = 1375
motherdaughter pairs, with X = mothers height on the
horizontal axis and Y = daughters height on the vertical
axis, in inches. The conditional distributions F(Y[X = x)
correspond to the vertical spread of points in strips in this
plot. In Fig. 1, three of these conditional distributions are
highlighted, corresponding to mothers heights of 58, 61,
and 65 inches. The conditional distributions almost cer-
tainly differ in mean, with shorter mothers on average
having shorter daughters than do taller mothers, but there
is substantial overlap between the distributions.
Most regression problems center on the study of the
mean function, to learn about the average value of Y given
X = x. We write the most general mean function as
m(Y[X = x), the mean of Y when X = x. The mean function
for the heights datawouldbe asmoothcurve, withm(Y[X = x)
increasing as x increases. Other characteristics of the con-
ditional distributions, such as conditional variances
var(Y[X = x) may well be constant across the range of
values for mothers height, but in general the variance or
indeedany other moment or percentile functioncandepend
on X.
Most regression models are parametric, so the mean
function m(Y[X = x) depends only on a few unknown para-
meters collected into a vector b. We write m(Y[X =
x) = g(x; b), where g is completely known apart from the
unknown value of b.
In the heights data described above, data are generated
obtaining a sample of units, here motherdaughter pairs,
and measuring the values of height for each of the pairs.
Study of the conditional distribution of daughters height
giventhe mothers height makes more sense thanthe study
of the mothers height given the daughters height because
the mother precedes the daughter, but in principle either
conditional distribution could be studied via regression. In
other problems, the values of the predictor or the predictors
maybeset byanexperimenter. For example, inalaboratory
setting, samples of homogeneous material could be
assigned to get different levels of stress, and then a
response variable is measured with the goal of determining
the effect of stress on the outcome. This latter scenario will
usually include random assignment of units to levels of
predictors and can lead to more meaningful inferences.
Considerations for allocating levels of treatments to experi-
mental units are part of the design of experiments; see
Ref. 3. Both cases of predictors determined by the experi-
menter and predictors measured on a sample of units can
often be analyzed using regression analysis.
SIMPLE LINEAR REGRESSION
Model
Linear regression is the most familiar and widely used
method for regression analysis; see, for example, Ref. 4
for book-length treatment of simple and multiple regres-
sion. This method concentrates almost exclusively on the
mean function. Data consist of n independent pairs
(x
1
; y
1
); . . . ; (x
n
; y
n
) as with the heights data in Fig. 1. The
independence assumptionmight be violatedif, for example,
a mother were included several times inthe data, eachwith
a different daughter, or if the mothers formed several
groups of sisters.
The simple linear regression model requires the follow-
ing mean and variance functions:
m(Y[X = x) = g(x; b) = b
0
b
1
x
Var(Y[X = x) = s
2
(1)
so for this model b = (b
0
; b
1
)
/
. b
1
is slope, which is the
expected change in Y when X is increased by one unit.
The intercept b
0
is the mean value of Y when X = 0,
although that interpretation may not make sense if X
cannot equal zero. The line shown on Fig. 1 is an estimate
of the simple regression mean function, computed using
least squares, to be described below. For the heights data,
the simple regression mean function seems plausible, as it
matches the data in the graph.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
The simple regression model also assumes a constant
variance function, with s
2
>0 generally unknown. This
assumption is not a prerequisite for all regression models,
but it is a feature of the simple regression model.
Estimation
We can obtain estimates of the unknown parameters, and
thus of the mean and the variance functions, without any
further assumptions. The most common method of estima-
tion is via least squares, which chooses the estimates b =
(b
0
; b
1
)
/
of b = (b
0
; b
1
)
/
via a minimization problem:
b = arg min
b
+
n
i=1
y
i
g(x; b
+
)
2
(2)
Ageneric notationis usedinEquation(2) because this same
objective function can be used for other parametric mean
functions.
The solution to this minimization problem is easily
found by differentiating Equation (2) with respect to
each element of b
+
and setting the resulting equations to
zero, and solving. If we write m
x
and m
y
as the sample
means of the x
i
and the y
i
respectively, SD
x
and SD
y
as
the sample standard deviations, and r
xy
as the sample
correlation, then
b
1
= r
xy
SD
y
SD
x
b
0
= m
y
b
1
m
x
(3)
These are linear estimators because both m
y
and r
xy
are
linear combinations of the y
i
. They are also unbiased,
E(b
0
) = b
0
and E(b
1
) = b
1
. According to the Gauss-Markov
theorem, the least-squares estimates have minimum var-
iance among all possible linear unbiased estimates. Details
are given in Refs. 4 and 5, the latter reference at a higher
mathematical level.
As s
2
is the mean-squared difference between each data
point and its mean, it should be no surprise that the
estimate of s
2
is similar to the average of the squaredtting
errors. Let d be the degrees of freedom for error, which in
linear regression is the number of observations minus the
number of parameters in the mean function, or n 2 in
simple regression. Then
s
2
=
1
d
n
i=1
(y g(x
i
; b))
2
(4)
The quantity
(y
i
g(x
i
; b))
2
is called the residual sum of
squares. We divide by d rather than the more intuitive
sample size n because this results in an unbiased estimate,
E(s
2
) = s
2
. Manycomputingformulas for the residuals sum
of squares depend only on summary statistics. One that is
particularly revealing is
n
i=1
(y g(x
i
; b))
2
= (n 1)SD
2
y
(1 R
2
) (5)
In both simple and multiple linear regression, the quantity
R
2
is the square of the sample correlation between the
observed response, the y
i
, and the tted values g(x
i
; b). In
simple regression R
2
= r
2
xy
.
Distribution of Estimates
The estimates (b
0
; b
1
) are randomvariables with variances
Var(b
0
) = s
2
1
n
m
2
x
(n 1)SD
2
x
_ _
;
Var(b
1
) = s
2
1
(n 1)SD
2
x
(6)
The estimates are correlated, with covariance
Cov(b
0
; b
1
) = s
2
m
x
=(n 1)SD
2
x
. The estimates are
uncorrelated if the predictor is rescaled to have sample
mean m
x
= 0; that is, replace X by a new predictor
X
+
= X m
x
. This will also change the meaning of the
intercept parameter from the value of E(Y[X = 0) to the
value of E(Y[X = m
x
). Estimates of the variances and cov-
ariances are obtained by substituting s
2
for s
2
. For exam-
ple, the square root of the estimated variance of b
1
is called
its standard error, and is given by
se(b
1
) = s
1
(n 1)SD
2
x
1=2
(7)
Tests and condence statements concerning the para-
meters require the sampling distribution of the statistic
(b
0
; b
1
). This information can come about in three
ways. First, we might assume normality for the conditional
distributions, F(Y[X = x) = N(g(x; b); s
2
). Since the least
squares estimates are linear functions of the y
i
, this leads
to normal sampling distributions for b
0
and b
1
. Alterna-
tively, by the central limit theoremb
0
and b
1
will be approxi-
mately normal regardless of the true F, assuming only mild
Figure 1. Heights of asampleof n = 1375mothers anddaughters
as reported by Ref. 2. The line shown on the plot is the ordinary
least-squares regression line, assuming a simple linear regression
model. The darker points display all pairs with mothers height
that would round to 58, 61, or 65 inches.
2 REGRESSION ANALYSIS
regularity conditions and a large enough sample. A third
approach uses the data itself to estimate the sampling
distribution of b, and thereby get approximate inference.
This last method is generally called the bootstrap, and is
discussed briey in Ref. 4 and more completely in Ref. 6.
Regardless of distributional assumptions, the estimate
s
2
has a distribution that is independent of b. If we add the
normality assumption, then ds
2
~s
2
x
2
(d), a chi-squared
distribution with ddf. The ratio (b
1
b
1
)=se(b
1
) has a
t-distribution with ddf, written t(d). Most tests in linear
regression models under normality or in large samples
are based either on t-distributions or on the related
F-distributions.
Suppose we write t
g
(d) to be the quantile of the
t-distribution with ddf that cuts off probability g in its
upper tail. Based on normal theory, either a normality
assumption or large samples, a test of b
1
= b
+
1
versus the
alternative b
1
,=b
+
1
is rejected at level a if t = (b
1
b
+
1
)=se(b
1
) exceeds t
1a=2
(d), where d is the number of df
used to estimate s
2
. Similarly, a (1 a) 100% condence
interval for b
1
is given by the set
b
1
(b
1
t
1a=2
(d)se(b
1
); b
1
t
1a=2
(d)se(b
1
))
Computer Output from Heights Data
Typical computer output from a packaged regression pro-
gram is shown in Table 1 for the heights data. The usual
output includes the estimates (3) and their standard errors
(7). The tted meanfunction is g(x; b) = 29:9174 0:5417x;
this function is the straight line drawn on Fig. 1. The
columnmarkedt-value displays the ratio of eachestimate
to its standard error, which is an appropriate statistic for
testing the hypotheses that each corresponding coefcient
is equal to zero against either a one-tailed or two-tailed
alternative. The column marked Pr( >[t[) is the signi-
cance level of this test assuming a two-sided alternative,
based on the t(n 2) distribution. In this example the p-
values are zero to four decimals, and strong evidence is
present that the intercept is nonzero given the slope, and
that the slope is nonzero given the intercept. The estimated
slope of about 0.54 suggests that each inch increase on
mothers height corresponds to an increase in daughters
height of only about 0.54 inches, which indicates that tall
mothers have tall daughters but not as tall as themselves.
This could have been anticipated from (3): Assuming that
heights of daughters and mothers are equally variable, we
will have SD
x
~SD
y
and so b
1
~r
xy
, the correlation. As the
scale-free correlation coefcient is always in [1; 1[, the
slope must also be in the range. This observation of regres-
sion toward the mean is the origin of the termregression for
the study of conditional distributions.
Also included in Table 1 are the estimate s of s, the
degrees of freedom associated with s
2
, and R
2
= r
2
xy
. This
latter value is usually interpreted as a summary of the
comparison of the t of model (1) with the t of the null
mean function
m
0
(Y[X = x) = b
0
(8)
Mean function (8) asserts that the mean of Y[X is the same
for all values of X. Under this mean function the least
squares estimate of b
0
is just the sample mean m
y
and
the residual sum of squares is (n 1)SD
2
y
. Under mean
function (1), the simple linear regression mean function,
the residual sum of squares is given by Equation (5). The
proportion of variability unexplained by regression on X is
just the ratio of these two residual sums of squares:
Unexplained variability =
(n 1)SD
2
y
(1 R
2
)
(n 1)SD
2
y
= 1 R
2
and so R
2
is the proportion of variability in Y that is
explained by the linear regression on X. This same inter-
pretation also applies to multiple linear regression.
Animportant use of regressionis the predictionof future
values. Consider predicting the height of a daughter whose
mothers height is X = 61. Whether data collected on
English motherdaughter pairs over 100 years ago is rele-
vant to contemporary motherdaughter pairs is question-
able, but if it were, the point prediction would be the
estimated mean, g(61; b) = 29:9174 0:5417 61 ~63
inches. From Fig. 1, even if we knew the mean function
exactly, we would not expect the prediction to be prefect
because mothers of height 61 inches have daughters of a
variety of heights. We therefore expect predictions to have
two sources of error: a prediction error of magnitude s due
to the new observation, and an error from estimating the
mean function,
Var(Prediction[X = x
+
) = s
2
Var(g(x; b))
For simple regression Var(g(x; b)) = Var(b
0
b
1
x) =
Var(b
0
) x
2
Var(b
1
) xCov(b
0
; b
1
). Simplifying and repla-
cing s
2
by s
2
and taking square roots, we get
se(Prediction[X = x
+
) = s 1
1
n
(x
+
m
x
)
2
(n 1)SD
2
x
_ _
1=2
where the sample size n, sample mean m
x
, and sample
standard deviation SD
x
are all from the data used to esti-
mate b. For the heights data, this standard error at x
+
= 61
is about 2.3 inches. A 95%prediction interval, based on the
t(n 2) distribution, is from 58.5 to 67.4 inches.
MULTIPLE LINEAR REGRESSION
The multiple linear regression model is an elaboration of
the simple linear regression model. We now have a pre-
dictor X with p_1 components, X = (1; X
1
; . . . ; X
p
). Also,
let x = (1; x
1
; . . . ; x
p
) be a vector of possible observed values
Table 1. Typical simple regression computer output, for
the heights data
Estimate Std. Error t-value Pr( >[t[)
b
0
29.9174 1.6225 18.44 0.0000
b
1
0.5417 0.0260 20.87 0.0000
s = 2:27, 1373 df, R
2
= 0:241.
REGRESSION ANALYSIS 3
for X; the 1 is appended to the left of these quantities to
allow for an intercept. Then the mean function in
Equation (1) is replaced by
m(Y[X = x) = g(x; b)
= b
0
b
1
x
1
b
p
x
p
= b
/
x
Var(Y[X = x) = s
2
(9)
The parameter vector b = (b
0
; . . . ; b
p
)
/
now has p 1
components. Equation(9) describes aplane in p 1-dimen-
sional space. Each b
j
for j >0 is called a partial slope, and
gives the expected change in Y when X
j
is increased by one
unit, assuming all other X
k
, k,= j, are xed. This interpre-
tation can be problematical if changing X
j
would require
that one or more of the other X
k
be changed as well. For
example, if X
j
was tax rate and X
k
was savings rate,
changing X
j
may necessarily change X
k
as well.
Linear models are not really restricted to tting straight
lines and planes because we are free to dene Xas we wish.
For example, if the elements of X are different powers or
other functions of the same base variables, then when
viewedas afunctionof the base variables the meanfunction
will be curved. Similarly, by including dummy variables,
which have values of zero and one only, denoting two
possible categories, we can t separate mean functions to
subpopulations in the data (see Ref. 4., Chapter 6).
Estimation
Given data (y
i
; x
i1
; . . . ; x
i p
) for i = 1; . . . ; n, we assume that
eachcase inthe data is independent of eachother case. This
mayexclude, for example, time-orderedobservations onthe
same case, or other sampling plans with correlated cases.
The least-squares estimates minimize Equation 2, but with
g(x; b
+
) from Equation 9 substituting for the simple regres-
sion mean function. The estimate s
2
of s
2
is obtained from
Equation 4 but with d = n p 1 df rather than the
n 2 for simple regression. Numerical methods for least-
squares computations are discussed in Ref. 7. High-quality
subroutines for least squares are providedbyRef. 8. As with
simple regression, the standard least-squares calculations
are performed by virtually all statistical computing
packages.
For the multiple linear regression model, there is a
closed-form solution for b available in compact form in
matrix notation. Suppose we write } to be the n 1 vector
of the response variable and A to be the n ( p 1) matrix
of the predictors, including a column of ones. The order of
rows of } and A must be the same. Then
b = (A
/
A)
1
A
/
} (10)
provided that the inverse exists. If the inverse does not
exist, then there is not a unique least-squares estimator. If
the matrix A is of rank r _ p, then most statistical comput-
ing packages resolve the indeterminacy by nding r line-
arly independent columns of A, resulting in a matrix A
1
,
andthencomputing the estimator (10) withA
1
replacing A.
This will change interpretation of parameters but not
change predictions: All least-squares estimates produce
the same predictions.
Equation 10 should never be used in computations, and
methods based on decompositions such as the QR decom-
position are more numerically stable; see Linear systems
of equation and Ref. 8.
Distribution
If the F(Y[X) are normal distributions, or if the sample size
is large enough, then we will have, assuming A of full rank,
b ~N(b; s
2
(A
/
A)
1
) (11)
The standard error of any of the estimates is given by s
times the square root of the corresponding diagonal ele-
ment of (A
/
A)
1
. Similarly, if a
/
b is any linear combination
of the elements of b, then
a
/
b~N(a
/
b; s
2
a
/
(A
/
A)
1
a)
In particular, the tted value at X = x
+
is given by x
+/
b, and
its variance is s
2
x
+/
(A
/
A)
1
x
+
. Apredictionof afuture value
at X = x
+
is also given by x
+/
b, and its variance is given by
s
2
s
2
x
+/
(A
/
A)
1
x
+
. Both of these variances are estimated
by replacing s
2
by s
2
.
Prescription Drug Cost
As anexample, we will use datacollected on29 healthplans
with pharmacies managed by the same insurance company
in the United States in the mid-1990s. The response vari-
able is Cost, the average cost to the health plan for one
prescription for one day, in dollars. Three aspects of the
drug plan under the control of the health plan are GS, the
usage of generic substitute drugs by the plan, an index
between 0, for no substitution and 100, for complete sub-
stitution, RI, a restrictiveness index, also between 0 and
100, describing the extent to which the plan requires phy-
sicians to prescribe drugs from a limited formulary, and
Copay, the cost to the patient per prescription. Other
characteristics of the plan that might inuence costs are
the average Age of patients in the plan, and RXPM, the
number of prescriptions per year per patient, as a proxy
measure of the overall health of the members in the plan.
Although primary interest is in the rst three predictors,
the last two are included to adjust for demographic differ-
ences in the plans. The data are from Ref. (4).
Figure 2 is a scatterplot matrix. Except for the diagonal,
a scatterplot matrix is a two-dimensional array of scatter-
plots. The variable names on the diagonal label the axes. In
Fig. 2, the variable Age appears on the horizontal axis of all
plots in the fth column from the left and on the vertical
axis of all plots in the fth row from the top. Each plot in a
scatterplot matrix is relevant to a particular one-predictor
regression of the variable on the vertical axis given the
variable onthe horizontal axis. For example, the plot of Cost
versus GS in the rst plot in the second column of the
scatterplot matrix is relevant for the regression of Cost on
GSignoring the other variables. Fromthe rst rowof plots,
the meanof Cost generallydecreases as predictors increase,
4 REGRESSION ANALYSIS
except perhaps RXPM where there is not any obvious
dependence. This summary is clouded, however, by a few
unusual points, in particular one health plan with a very
low value for GS and three plans with large values of RI
that have relatively high costs. The scatterplot matrix can
be very effective in helping the analyst focus on possibly
unusual data early in an analysis.
The pairwise relationships between the predictors are
displayed in most other frames of this plot. Predictors that
have nonlinear joint distributions, or outlying or separated
points, may complicate a regression problem; Refs. 4 and 9
present methodology for using the scatterplot matrix to
choose transformations of predictors for which a linear
regression model is more likely to provide a good approx-
imation.
Table 2 gives standard computer output for the t of a
multiple linear regression model with ve predictors. As in
simple regression, the value of R
2
gives the proportion of
variability in the response explained by the predictors;
about half the variability in Cost is explained by this
regression. The estimated coefcient for GS is about
0:012, which suggests that, if all other variables could
be held xed, increasing GI by 10 units is expected to
change Cost by 10 :012 = $0:12, which is a relatively
large change. The t-test for the coefcient for GS equal to
zero has a very small p-value, which suggests that this
coefcient may indeed by nonzero. The coefcient for Age
also has a small p-value and plans with older members
have lower cost per prescription per day. Adjusted for the
other predictors, RI appears to be unimportant, whereas
the coefcient for Copay appears to be of the wrong sign.
Model Comparison. In some regression problems, we
may wish to test the null hypothesis NH that a subset of
the b
j
are simultaneously zero versus the alternative AH
that at least one in the subset is nonzero. The usual
procedure is to do a likelihood ratio test: (1) Fit both the
NH and the AH models and save the residual sum of
squares and the residual df; (2) compute the statistic
F =
RSS
NH
RSS
AH
=(df
NH
df
AH
)
RSS
AH
=df
AH
Under the normality assumption, the numerator and
denominator are independent multiples of x
2
random vari-
Figure 2. Scatterplot matrix for the drug cost example.
Table 2. Regression output for the drug cost data.
Estimate Std. Error t-value Pr( >[t[)
(Intercept) 2.6829 0.4010 6.69 0.0000
GS 0.0117 0.0028 4.23 0.0003
RI 0.0004 0.0021 0.19 0.8483
Copay 0.0154 0.0187 0.82 0.4193
Age 0.0420 0.0141 2.98 0.0068
RXPM 0.0223 0.0110 2.03 0.0543
s = 0:0828, df = 23, R
2
= 0:535.
REGRESSION ANALYSIS 5
ables, and F has an F(df
NH
df
AH
df
AH
) distribution, which
canbe used to get signicance levels. For example, consider
testing the null hypothesis that the mean function is given
by Equation 8, which asserts that the mean function does
not vary withthe predictors versus the alternative givenby
Equation 9. For the drug data, F = 5:29 with (5; 23) df,
p = 0:002, whichsuggests that at least one of the b
j
; j _1 is
nonzero.
Model Selection/Variable Selection. Although some
regression models are dictated by a theory that species
whichpredictors are neededandhowtheyshouldbe usedin
the problem, many problems are not so well specied. Inthe
drug cost example, Cost may depend on the predictors as
given, on some subset of them, or on some other functional
form other than a linear combination. Many regression
problems will therefore include a model selection phase
in which several competing specications for the mean
function are to be considered. In the drug cost example,
we might consider all 2
5
= 32 possible mean functions
obtained using subsets of the ve base predictors, although
this is clearly only a small fraction of all possible sensible
models. Comparing models two at a time is at best inef-
cient and at worst impossible because the likelihood ratio
tests canonly be used to compare models if the null model is
a special case of the alternative model.
One important method for comparing models is based on
estimating a criterion function that depends on both lack of
t and complexity of the model (see also Information
theory.) The most commonly used method is the Akaike
information criterion, or AIC, given for linear regression by
AIC = nlog(Residual sumof squares) 2( p 1)
where p 1 is the number of estimated coefcients in the
mean function. The model that minimizes AIC is selected,
evenif the difference in AICbetweentwo models is trivially
small; see Ref. 10. For the drug cost data, the meanfunction
with all ve predictors has AIC = 139:21. The mean
function with minimum AIC excludes only RI, with
AIC = 141:16. The tted mean function for this mean
function is m(Y[X = x) = 2:6572 0:0117GS 0:0181
Copay 0:0417Age 0:0229 RXPM. Assuming the multi-
ple linear regression model is appropriate for these data,
this suggests that the restrictiveness of the formulary is not
related to cost after adjusting for the other variables, plans
with more GS are associated with lower costs. Both Copay
and Age seem to have the wrong sign.
An alternative approach to model selection is model
aggregation, in which a probability or weight is estimated
for each candidate model, and the nal model is a
weighted combination of the individual models; see
Ref. 11 for a Bayesian approach and Ref. 12 for a
frequentist approach.
Parameter Interpretation. If the results in Table 2 or the
tted model after selection were a reasonable summary of
the conditional mean of Cost given the predictors, how can
we interpret the parameters? For example, can we infer
than increasing GS would decrease Cost? Or, should we be
more cautious and only infer that plans with higher GS are
associated with lower values of Cost? The answer to this
question depends on the way that the data were generated.
If GS were assigned to medical plans using a random
mechanism, and then we observed Cost after the random
assignment, then inference of causation could be justied.
The lack of randomization in these data could explain the
wrongsignfor Copay, as it is quite plausible that plans raise
the copayment in response to higher costs. For observa-
tional studies like this one, causal inference based on
regression coefcients is problematical, but a substantial
literature exists on methods for making causal inference
from observational data; see Ref. 13.
Diagnostics. Fitting regression models is predicated
upon several assumptions about F(Y[X). Should any of
these assumptions fail, then a tted regression model
may not provide a useful summary of the regression pro-
blem. For example, if the true mean function were
E(Y[X = x) = b
0
b
1
x b
2
x
2
, then the t of the simple
linear regression model (1) could provide a misleading
summary if b
2
were substantially different from zero.
Similarly, if the assumed mean function were correct but
the variance function was not constant, then estimates
would no longer be efcient, and tests and condence
intervals could be badly in error.
Regression diagnostics are a collection of graphical and
numerical methods for checking assumptions concerning
the mean function and the variance function. In addition,
these methods can be used to detect outliers, a small frac-
tion of the cases for which the assumed model is incorrect,
and inuential cases (14), which are cases that if deleted
would substantially change the estimates and inferences.
Diagnostics canalso be used to suggest remedial action like
transforming predictors or the response, or adding inter-
actions to a mean function, that could improve the match of
the model to the data. Much of the theory for diagnostics is
laid out in Ref. 15 see also Refs. 4 and 9.
Many diagnostic methods are based on examining the
residuals, which for linear models are simply the differ-
ences r
i
= y
i
g(x
i
; b); i = 1; . . . ; n. The key idea is that, if a
tted model is correct, then the residuals should be unre-
lated to the tted values, to any function of the predictors,
or indeed to any function of data that was not used in the
modeling. This suggests examining plots of the residuals
versus functions of the predictors, such as the predictors
themselves and also versus tted values. If these graphs
show any pattern, such as a curved mean function or
nonconstant variance, we have evidence that the model
used does not matchthe data. Lack of patterns inall plots is
consistent with an acceptable model, but not denitive.
Figure 3 shows six plots, the residuals versus each of the
predictors, and also the residuals versus the tted values
based on the model with all predictors. Diagnostic analysis
should generally be done before any model selection based
on the largest sensible mean function. In each plot, the
dashed line is a reference horizontal line at zero. The dotted
line is the tted least-squares regression line for a quad-
ratic regression with the response given by the residuals
and the predictor given by the horizontal axis. The t-test
that the coefcient for the quadratic term when added to
the original mean function is zero is a numeric diagnostic
6 REGRESSION ANALYSIS
that can help interpret the plot. In the case of the plot
versus tted value, this test is called Tukeys test for non-
additivity, and p-values are obtained by comparing with a
normal distributionrather thanat-distributionthat is used
for all other plots.
In this example, the residual plots display patterns that
indicate that the linear model that was t does not match
the data well. The plot for GS suggests that the case with a
very small value of GS might be quite different than the
others; the p-value for the lack-of-t test is about 0.02.
Similarly, curvature is evident for RI due to the three plans
with very high values of RI but very high costs. No other
plot is particularly troubling, particularly in view of the
small sample size. For example, the p-value for Tukeys test
corresponding to the plot versus tted values is about
p = 0:10. The seemingly contradictory result that the
mean function matches acceptably overall but not with
regard to GS or RI is plausible because the overall test
will necessarily be less powerful than a test for a more
specic type of model failure.
This analysis suggests that the four plans, one with
very low GS and the other three with very high RI,
may be cases that should be treated separately from the
remaining cases. If we ret without these cases, the result-
ing residual plots do not exhibit any particular problems.
After using AICto select a subset, we end up with the tted
model g(x; b) = 2:394 0:014GS 0:004 RI 0:024Age
0:020RXPM. In this tted model, Copay is deleted. The
coefcient estimate for GS is somewhat larger, and the
remaining estimates are of the appropriate sign. This
seems to provide a useful summary for the data. We would
call the four points that were omitted inuential observa-
tions, (14) because their exclusion markedly changes con-
clusions in the analysis.
In this example, as in many examples, we end up with a
tted model that depends on choices made about the data.
The estimated model ignores 4 of 29 data points, so we are
admitting that the mean function is not appropriate for all
data.
OTHER REGRESSION MODELS
The linear model given by Equation 9 has surprising gen-
erality, given that so few assumptions are required. For
some problems, these methods will certainly not be useful,
for example if the response is not continuous, if the variance
depends onthe mean, or if additional informationabout the
conditional distributions is available. For these cases,
methods are available to take advantage of the additional
information.
Logistic Regression
Suppose that the response variable Y can only take on two
values, say 1, corresponding perhaps to success, or 0,
corresponding to failure. For example, in a manufactur-
ing plant where all output is inspected, Y could indicate
items that either pass (Y = 1) or fail (Y = 1) inspection. We
may want to study how the probability of passing depends
on characteristics such as operator, time of day, quality of
input materials, and so on.
We build the logistic regressionmodel inpieces. First, as
each Y can only equal 0 or 1, each Y has a Bernoulli
distribution, and
m(Y[X = x) = Prob(Y = 1[X = x)
= g(x; b)
Var(Y[X = x) = g(x; b)(1 g(x; b))
(12)
Each observation can have its own probability of success
g(x; b) and its own variance.
Next, assume that Y depends on X = x only through a
linear combination h(x) = b
0
b
1
x
1
. . . b
p
x
p
= b
/
x.
The quantity h(x) is called a linear predictor. For the
multiple linear regression model, we have g(x; b) = h(x),
but for a binary response, this does not make any sense
because a probability is bounded between zero and one. We
can make a connection between g(x; b) and h(x) by assum-
ing that
g(x; b) =
1
1 exp(h(x))
(13)
This is called logistic regression because the right side of
Equation 13 is the logistic function. Other choices for g are
possible, using any function that maps from(; ) to (0,
1), but the logistic is adequate for many applications.
To make the analogy with linear regression clearer,
Equation 13 is often inverted to have just the linear
Figure 3. Residual plots for the drug cost data. The + symbol
indicates the planwithverysmall GS, x indicates plans withvery
high RI, and all other plans are indicated with a o.
REGRESSION ANALYSIS 7
predictor on the right side of the equation,
log(
g(x; b)
1 g(x; b)
) = h(x) = b
/
x (14)
In this context, the logit function, log(g(x; b)=(1 g(x; b))),
is called a link function that links the parameter of the
Bernoulli distribution, g(x; b), to the linear predictor h(x).
Estimation. If we have data (y
i
; x
i
), i = 1; . . . ; n that are
mutually independent, thenwe canwrite the log-likelihood
function as
L(b) = log P
n
i=1
(g(x
i
; b))
y
i
(1 g(x
i
; b))
1y
i
_ _
= log P
n
i=1
g(x
i
; b)
(1 g(x
i
; b)
_ _
y
i
(1 g(x
i
; b))
_ _
=
n
i=1
y
i
(x
/
i
b) log 1
1
(1 exp(x
/
i
b))
_ _ _ _
Maximum likelihood estimates are obtained to be the
values of b that maximize this last equation. Computations
are generally done using NewtonRaphson iteration
or using a variant called Fisher scoring; see Ref. 16;
for book-length treatments of this topic, see Refs. 17
and 18.
Poisson Regression
When the response is the count of the number of indepen-
dent events in a xed time period, Poisson regression
models are often used. The development is similar to the
Bernoulli case. We rst assume that Y[X = x is distributed
as a Poisson random variable with mean m(Y[X = x) =
g(x; b),
Prob(Y = y[X = x) =
g(x; b)
y
y!
exp(g(x; b))
For the Poisson, 0<m(Y[X = x) = Var(Y[X = x) = g(x; b).
The connection between Y and X is assumed to be through
the linear predictor h(x), and for a log-linear model, we
assume that
g(x; b) = exp(h(x))
giving the exponential mean function, or inverting we get
the log-link,
h(x) = log(g(x; b))
Assuming independence, the log-likelihood functioncanbe
shown to be equal to
L(b) =
n
i=1
y
i
(x
i
/b) exp(x
/
i
b)
Log-linear Poisson models are discussed in Ref. 19.
There are obvious connections between the logistic and
the Poisson models briey described here. Both of these
models as well as the multiple linear regression model
assuming normal errors, are examples of generalizedlinear
models, described in Ref. 16.
Nonlinear Regression
Nonlinear regression refers in general to any regression
problem for which the linear regression model does not
hold. Thus, for example, the logistic and log-linear Poisson
models are nonlinear models; indeed nearly all regression
problems are nonlinear.
However, it is traditional to use a narrower denitionfor
nonlinear regression that matches the multiple linear
regression model except that the mean function m(Y[X =
x) = g(x; b) is a nonlinear function of the parameters b. For
example, the mean relationship between X = age of a sh
and Y =length of the sh is commonly described using the
von Bertalanffy function,
E(Y[X = x) = L
(1 exp(K(x x
0
)))
The parameters b = (L
; K; x
0
)
/
to be estimated are the
maximum length L
n
i=1
w
i
(h)y
i
=
n
j=1
w
j
(h)
w
i
(h) =
1
h
H
x
i
x
h
_ _
One choice for H is the standard normal density function,
but other choices canhave somewhat better properties. The
bandwidth h is selected by the analyst; small values of h
weigh cases with [x
i
x[ small heavily while ignoring other
cases, giving a very rough estimate. Choosing h large
weighs all cases nearly equally, giving a very smooth,
but possibly biased, estimate, as shown in Fig. 4. The
bandwidth must be selected to balance bias and smooth-
ness. Other methods for nonparametric regression include
smoothing splines, local polynomial regression, and wave-
lets, among others; see Ref. 21.
Semiparametric Regression
A key feature of nonparametric regression is using nearby
observations to estimate the mean at a given point. If the
predictor is in many dimensions, then for most points x,
there maybe either no points or at best just afewpoints that
are nearby. As a result, nonparametric regression does not
scale well because of this curse of dimensionality, Ref. 22.
This has led to the proposal of semiparametric regres-
sion models. For example, the additive regression model,
Refs. 23, 24, suggests modeling the mean function as
m(Y[X = x) =
p
j=1
g
j
(x
j
)
where each g
j
is a function of just one predictor x
j
that can
beestimatednonparametrically. Estimates canbeobtained
by an iterative procedure that sequentially estimates each
of the g
j
, continuing until convergence is obtained. This
type of model can also be used in the generalized linear
model framework, where it is called a generalized additive
model.
Robust Regression
Robust regression was developed to address the concern
that standard estimates suchas least-squares or maximum
likelihoodestimates may be highly unstable inthe presence
of outliers or other very large errors. For example, the least-
squares criterion (2) may be replaced by
b = arg min
b+
n
i=1
r[y
i
g(x; b
+
)[
where r is symmetric about zero and may downweight
observations for which [y
i
g(x; b
+
)[ is large. The metho-
dologyis presentedinRef. 25, althoughthese methods seem
to be rarely used in practice, perhaps because they give
protection against outliers but not necessarily against
model misspecications, Ref. 26.
Regression Trees
With one predictor, a regression tree would seek to replace
the predictor byadiscrete predictor, suchthat the predicted
value of Y would be the same for all X in the same discrete
category. With two predictors, each category created by
discretizing the rst variable could be subdivided again
according to a discrete version of the second predictor,
which leads to a tree-like structure for the predictions.
Basic methods for regression trees are outlined in
Refs. 27 and 28. The exact methodology for implementing
regression trees is constantly changing and is an active
area of research; see Machine learning.
Dimension Reduction
Virtually all regression methods described so far require
assumptions concerning some aspect of the conditional
distributions F(Y[X), either about the mean function on
some other characteristic. Dimension reduction regression
seeks to learnabout F(Y[X) but withminimal assumptions.
For example, suppose X is a p-dimensional predictor, now
not including a1 for the intercept. Suppose we couldnda
r p matrix B of minimal rank r such that
F(Y[X) = F(Y[BX), which means that all dependence of
the response on the predictor is through r combinations
of the predictors. If r p, then the resulting regression
problem is of much lower dimensionality and can be much
easier to study. Methodology for nding B and r with no
assumptions about F(Y[X) is a very active area of research;
see Ref. 29 for the foundations; and references to this work
for more recent results.
BIBLIOGRAPHY
1. S. M. Stigler, The History of Statistics: the Measurement of
Uncertainly before 1900. Cambridge MA: Harvard University
Press, 1986.
Figure 4. Three NaradayaWatson kernel smoothing estimates
of the mean for the heights data, with h = 1 for the solid line and
h = 3 for the dashed line and h = 9 for the dotted line.
REGRESSION ANALYSIS 9
2. K. Pearson and S. Lee, One the laws of inheritance in man.
Biometrika, 2: 357463, 1903.
3. G. Oehlert, A First Course in Design and Analysis of Experi-
ments. New York: Freeman, 2000.
4. S. Weisberg, Applied Linear Regression, Third Edition.
New York: John Wiley & Sons, 2005.
5. R. Christensen, Plane Answers to Complex Questions: The
Theory of Linear Models. New York: Sparinger-Verlag Inc,
2002.
6. B. Efron and R. Tibshirani, An Introduction to the Bootstrap.
Boca Raton, FL: Chapman & Hall Ltd, 1993.
7. C. L. Lawson and R. J. Hanson, Solving Least Squares
Problems. SIAM [Society for Industrial and Applied Mathe-
matics], 1995.
8. LAPACK Linear Algebra PACKAGE. http://www.netlib.org/
lapack/. 1995.
9. R. Dennis Cookand S. Weisberg, AppliedRegressionIncluding
ComputingandGraphics. NewYork: JohnWiley &Sons, 1999.
10. C.-L Tsai and Allan D. R. McQuarrie. Regression and Time
Series Model Selection. Singapore: World Scientic, 1998.
11. J. A. Hoeting, D. Madigan, A. E. Reftery, and C. T. Volinsky,
Bayesian model averaging: A tutorial. Statistical Science,
14(4): 382401, 1999.
12. Z. Yuan and Y. Yang, Combining linear regression
models: When and how? Journal of the Amerian Statistical
Association, 100(472): 12021214, 2005.
13. P. R. Rosenbaum, Observational Studies. NewYork: Springer-
Verlag Inc, 2002.
14. R. D. Cook, Detection of inuential observation in linear
regression. Technometrics, 19: 1518, 1977.
15. R. D. Cook and S. Weisberg, Residuals andInuence inRegres-
sion. Boca Raton, FL: Chapman &Hall Ltd, available online at
www.stat.umn.edu/rir, 1982.
16. P. McCullagh and J. A. Nelder, Generalized Linear Models,
Second Edition. Boca Raton, FL: Chapman & Hall Ltd, 1989.
17. D. W. Hosmer and S. Lemeshow, Applied Logistic Regression,
Second Edition. New York: John Wiley & Sons, 2000.
18. D. Collett, Modelling Binary Data, Second Edition.
Boca Raton, FL: Chapman & Hall Ltd, 2003.
19. A. Agresti, Categorical Data Analysis, Second Edition.
New York: John Wiley & Sons, 2002.
20. D. M. Bates and D. G. Watts, Nonlinear Regression Analysis
and Its Applications. New York: John Wiley & Sons, 1988.
21. J. S. Simonoff, Smoothing Methods in Statistics. New York:
Springer-Verlag Inc., 1996.
22. R. E. Bellman, Adaptive Control Processes. Princeton NJ:
Princeton University Press, 1961.
23. T. Hastie andR. Tibshirani, GeneralizedAdditive Models. Boca
Raton, FL: Chapman & Hall Ltd, 1999.
24. P. J. Green and B. W. Silverman, Nonparametric Regression
and Generalized Linear Models: A Roughness Penalty
Approach. Boca Raton, FL: Chapman & Hall Ltd, 1994.
25. R. G. Staudte and S. J. Sheather, Robust Estimation and
Testing. New York: John Wiley & Sons, 1990.
26. R. D. Cook, D. M. Hawkins, and S. Weisberg, Comparison of
model misspecication diagnostics using residuals from least
meanof squares andleast medianof squares ts, Journal of the
American Statistical Association, 87: 419424, 1992.
27. D. M. Hawkins, FIRM: Formal inference-based recursive mod-
eling, The American Statistician, 45: 155, 1991.
28. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,
Classication and Regression Trees. Boca Raton, FL: Wads-
worth Adv. Book Prog., 1984.
29. R. Dennis Cook, Regression Graphics: Ideas for Studying
Regressions through Graphics. New York: John Wiley &
Sons, 1998.
SANFORD WEISBERG
University of Minnesota,
School of Statistics
Minneapolis, Minnesota
10 REGRESSION ANALYSIS
Hardware and
Architecture
C
CARRY LOGIC
Addition is the fundamental operation for performing digi-
tal arithmetic; subtraction, multiplication, and division
rely on it. How computers store numbers and perform
arithmetic should be understood by the designers of digital
computers. For a given weighted number system, a single
digit could represent a maximum value of up to 1 less than
the base or radix of the number system. A plurality of
number systems exist (1). In the binary system, for
instance, the maximum that each digit or bit could repre-
sent is 1. Numbers in real applications of computers are
multibit and are stored as large collections of 16, 32, 64, or
128 bits. If the addition of multibit numbers in such a
number system is considered, the addition of two legal
bits could result in the production of a result that cannot
t within one bit. In such cases, a carry is said to have been
generated. The generated carry needs to be added to the
sum of the next two bits. This process, called carry propa-
gation, continues from the least-signicant bit (LSB) or
digit, the one that has the least weight and is the rightmost,
to the most-signicant bit (MSB) or digit, the one with the
most weight andis the leftmost. This operationis analogous
to the usual manual computation with decimal numbers,
where pairs of digits are added with carries being propa-
gatedtowardthe high-order (left) digits. Carry propagation
serializes the otherwise parallel process of addition, thus
slowing it down.
As a carry can be determined only after the addition of a
particular set of bits is complete, it serializes the process of
multibit addition. If it takes a nite amount of time, say
(D
g
), to calculate a carry, it will take 64 (D
g
) to calculate the
carries for a 64-bit adder. Several algorithms to reduce the
carry propagation overhead have been devised to speed up
arithmetic addition. These algorithms are implemented
using digital logic gates (2) in computers and are termed
carry logic. However, the gains in speed afforded by these
algorithms come withanadditional cost, whichis measured
in terms of the number of logic gates required to implement
them.
In addition to the choice of number systems for repre-
senting numbers, they can further be represented as xed
or oating point (3). These representations use different
algorithms to calculate a sum, although the carry propaga-
tion mechanismremains the same. Hence, throughout this
article, carry propagationwithrespect to xed-point binary
addition will be discussed. As a multitude of 2-input logic
gates could be used to implement any algorithm, all mea-
surements are made in terms of the number of 2-input
NAND gates throughout this study.
THE MECHANISM OF ADDITION
Currently, most digital computers use the binary number
systemto represent data. The legal digits, or bits as theyare
called in the binary number system, are 0 and 1. During
addition, a sumS
i
anda carryout C
i
are produced by adding
a set of bits at the ith position. The carryout C
i
produced
during the process serves as the carry-in C
i1
for the
succeeding set of bits. Table 1 shows the underlying rules
for adding two bits, A
i
and B
i
, with a carry-in C
i
and
producing a sum S
i
and carry-out, C
i
.
FULL ADDER
The logic equations that represent S
i
and C
i
of Table 1 are
shown in Equations (1) and (2). A block of logic that imple-
ments these is called full adder, and it is shown in the inset
of Fig. 1. The serial pathfor datathroughafull adder, hence
its delay, is 2 gates, as shown in Fig. 1. A full adder can be
implemented using eight gates (2) by sharing terms from
Equations (1) and (2):
S
i
A
i
B
i
C
i1
A
i
B
i
C
i1
A
i
B
i
C
i1
A
i
B
i
C
i1
1
C
i
A
i
B
i
C
i1
A
i
B
i
2
RIPPLE CARRY ADDER
The obvious implementation of an adder that adds two
n-bit numbers A and B, where A is A
n
A
n1
A
n2
. . . A
1
A
0
and Bis B
n
B
n1
B
n2
. . . B
1
B
0
, is a ripple carry adder (RCA).
By serially connecting n full adders and connecting the
carryout C
1
from each full adder as the C
i1
of the succeed-
ing full adder, it is possible to propagate the carry from the
LSB to the MSB. Figure 1 shows the cascading of n full
adder blocks. It is clear that there is no special carry
propagation mechanism in the RCA except the serial con-
nection between the adders. Thus, the carry logic has a
minimal overhead for the RCA. The number of gates
required is 8n, as each full adder is constructed with eight
gates, and there are n such adders. Table 2 shows the
typical gate count andspeedfor RCAs withvarying number
of bits.
CARRY PROPAGATION MECHANISMS
In a scenario where all carries are available right at the
beginning, addition is a parallel process. Each set of inputs
A
i
, B
i
, and C
i1
could be added in parallel, and the sum for
2 n-bit numbers could be computed with the delay of a full
adder.
The input combinations of Table 1 show that if A
i
and
B
i
are both 0s, then C
i
is always 0, irrespective of the
value of C
i1
. Such a combination is called a carry kill
term. For combinations where A
i
and B
i
are both 1s, C
i
is always 1. Such a combination is called a carry generate
term. In cases where A
i
and B
i
are not equal, C
i
is equal to
C
i1
. These are called the propagate terms. Carry pro-
pagationoriginates at a generate term, propagates through
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
any successive propagate terms, and gets terminated at a
carry kill or a new carry generate term. A carry chain is a
succession of propagate terms that occur for any given
input combination of A
i
and B
i
. For the addition of two
n-bit numbers, multiple generates, kills, and propagates
could exist. Thus, many carry chains exist. Addition
between carry chains can proceed in parallel, as there is
no carry propagation necessary over carry generate or kill
terms.
Basedonthe concept of carry generates, propagates, and
kills, logic could be designed to predict the carries for each
bit of the adder. This mechanism is static in nature. It can
be readily seenthat different carry chains exist for different
sets of inputs. This introduces a dynamic dimension to the
process of addition. The dynamic nature of the inputs could
also be used and a sum computed after the carry propaga-
tion through the longest carry chain is completed. This
leads to a classication into static and dynamic carry logic.
An adder that employs static carry propagation always
produces a sum after a xed amount of time, whereas the
time taken to compute the sum in a dynamic adder is
dependent on the inputs. In general, it is easier to design
a digital system with a static adder, as digital systems are
predominantly synchronous in nature; i.e., they work in
lock step based on a clock that initiates each operation and
uses the results after completion of a clock cycle (4).
STATIC CARRY LOGIC
FromEquation(1), if A
i
andB
i
are bothtrue, thenC
i
is true.
If A
i
or B
i
is true, then C
i
depends on C
i1
. Thus, the term
A
i
B
i
inEquation(1) is the generate termor g
i
, andA
i
B
i
is
the propagate term or p
i
. Equation (1) can be rewritten as
in Equation (3):
C
i
g
i
p
i
C
i1
(3)
where g
i
A
i
B
i
and p
i
A
i
B
i
. Substituting numbers
for i in Equation (3) results in Equations (4) and (5):
C
1
g
1
p
1
C
0
4
C
2
g
2
p
2
C
1
5
Substituting the value of C
1
from Equation (4) in Equation
(5) yields Equation (6):
C
2
g
2
p
2
g
1
p
2
p
1
C
0
(6)
Generalizing Equation (6) to any carry bit i yields Equation
(7):
C
i
g
i
p
i
g
i1
p
i
p
i1
g
i2
p
i
p
i1
p
1
g
1
p
i
p
i1
p
i2
p
i
C
0
(7)
By implementing logic for the appropriate value of i in
Equation (7), the carry for any set of input bits can be
predicted.
Table 1. Addition of Bits A
i
and B
i
with a Carry-in C
i1
to
Produce Sum S
i
and Carry-out C
i
A
i
B
i
C
i1
S
i
C
i
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
Table 2. List of Gate Counts and Delay of Various Adders
Gate Count/Delay
Adder Type 16-Bit 32-Bit 64-Bit
RCA 144/36 288/68 576/132
CLA 200/10 401/14 808/14
CSA 284/14 597/14 1228/14
CKA 170/17 350/19 695/23
Figure 1. A Ripple Carry Adder ripples the carry from stage to
stage using cascaded Full Adders.
2 CARRY LOGIC
Carry Look-Ahead Adder
An adder that uses Equation (7) to generate carries for the
various bits, as soon as A and B are available, is called a
carry look-aheadadder (CLA). FromEquation(7), the carry
calculation time for such an adder is two gate delays, and a
further two gate delays are required to calculate the sum
with bits A
i
, B
i
, and the generated carry. In general, for a
large number of bits n, it is impractical to generate the
carries for every bit, as the complexity of Equation (7)
increases tremendously. It is commonly practice in such
cases to split the addition process into groups of k-bit CLA
blocks that are interconnected. A group CLA is shown in
Fig. 2. The groups now provide two new output functions
G* and P*, which are the group generate and propagate
terms. Equations (8) and (9) provide examples of howthese
terms are generated for 4-bit blocks. Equation (10) shows
the generation of C
4
using G
1
and P
1
:
G
1
g
4
p
4
g
3
p
4
p
3
g
2
p
4
p
3
p
2
g
1
8
P
1
p
4
p
3
p
2
p
1
9
C
4
G
1
P
1
C
0
10
In typical implementations, a CLA computes the sum in
log
2
n time and uses gates to the order of nlogn. Table 2
lists the gate count and delay of various CLAs. Thus, with
some additional gate investment, considerable speed-up is
possible using the CLA carry logic algorithm.
Based onthe CLAalgorithm, several methods have been
devised to speed up carry propagation even further. Three
such adders that employ circuit-level optimizations to
achieve faster carry propagation are the Manchester Carry
Adder (4), Ling Adder (5), and the Modied Ling Adder (6).
However, these are specic implementations of the CLA
and do not modify carry propagation algorithms.
Carry Select Adder
The discussion in the previous section shows that the
hardware investment on CLA logic is severe. Another
mechanism to extract parallelism in the addition process
is to calculate two sums for each bit, one assuming a carry
input of 0 and another assuming a carry input of 1, and
choosing one sum based on the real carry generated. The
idea is that the selection of one sum is faster than actually
propagating carries through all bits of the adder. An adder
that employs this mechanism is called a carry select adder
(CSA) and is shown in Fig. 3. A CSA works on groups of k-
bits, and each group works like an independent RCA. The
real carry-in is always known as the LSB, and it is used as
C
0
. In Fig. 3, C
k
is used to select one sum, like S
1
3k2k1
or
S
0
3k2k1
from the next group, gp2. In general, the selection
and the addition time per bit are approximately equal.
Thus, for a group that is k bits wide, it approximately takes
2kunits of time to compute the sums andafurther two units
of time to select the right sum, based on the actual carry.
Thus, the total time for a valid carry to propagate from one
group to another is 2(k 1) time units. Thus, for an optimal
implementation, the groups in the CSA should be unequal
in size, with each succeeding group being 1 bit wider than
the preceding group. The gate count and speed of various
CSAs is listed in Table 2.
Carry Skip Logic
If an adder is split into groups, gp 0, gp 1, and so on of RCAs
of equal width k, and if a carry-in of 0 is forced into each
group, then the carry out from each group is its generate
term. The propagate term is simple to compute and can be
computed by using Equation (9). As the group generate
terms and propagate terms are thus available, the real
carry-in at each group could be predicted and used to
calculate the sum. An adder employing this mechanism
for carry propagationis called a carry skip adder (CKA) and
is shown in Fig. 3. The logic gates outside of the groups in
Fig. 3 implement Equation (11), which is a generalization
of Equation (10) for the carry at any position i. Thus, the
process of carry propagation takes place at the group level,
and it is possible to skip cary propagation over groups of
bits:
C
i
G
i=k
P
i=k
C
ki
(11)
It takes 2k time units to calculate the carry fromany group
of size k. Carry propagation across groups takes an addi-
tional n/k 2 time units, and it takes another 2k time units
to calculate the nal sum. Thus, the total time is 4k n/
k 2 time units. By making the inner blocks larger in size,
it is possible to calculate the sumfaster, as it is thenpossible
to skip carry propagation over bigger groups. Table 2 lists
the gate count and performance of various CKAs.
Figure 2. A group Carry Look
Ahead Scheme with n/k groups
each of size k.
CARRY LOGIC 3
Prex Computation
Binary addition can be viewed as a parallel computation.
By introducing an associative operator
, carry propagation
and carry generation can be dened recursively. If C
i
G
i
in Equation (3), then Equation (12) with
as the concatena-
tion operator holds. P
i
is the propagate term, and G
i
is the
generate termat bit position i at the boundary of a group of
size k:
G
i
; P
i
g
i
; p
i
if
i 1 and g
i
; p
i
G
i1
; P
i1
if ni >1
(12)
where g
t
; p
t
g
s
; p
s
g
t
p
t
g
s
; p
t
p
s
by modifying
Equation (3). Note that
is NOT commutative. All C
i
can
be computed in parallel. As
is associative, the recursive
Equation (12) can be broken in arbitrary ways. The logic
to compute carries can be constructed recursively too.
Figure 4 shows an example of carry computation using
the prex computation strategy described in Equation (12),
with block size k 4 and how a combination of two 4-bit
carry-logic blocks can perform 8-bit carry computation.
The CLA, CKA, CSA, and Prex computation have been
discussed in detail by Swartzlander (7), Henessey (8), and
Koren (9).
DYNAMIC CARRY LOGIC
Dynamic carry propagation mechanisms exploit the nature
of the input bit patterns to speed up carry propagation and
rely on the fact that the carry propagation on an average is
of the order of log
2
n. Due to the dynamic nature of this
mechanism, valid results from addition are available at
different times for different input patterns. Thus, adders
that employ this technique have completion signals that
ag valid results.
Carry Completion Sensing Adder
The carry-completing sensing adder (CCA) works on the
principle of creating two carry vectors, C and D, the pri-
mary and secondary carry vectors, respectively. The 1s inC
are the generate terms shifted once to the left and are
determined by detecting 1s in a pair of A
i
and B
i
bits, which
represent the ith position of the addend and augend, A and
B, respectively. The 1s in D are generated by checking the
carries triggered by the primary carry vector C, and these
are the propagate terms. Figure 5 shows an example for
such a carry computation process. The sumcan be obtained
by adding A, B, C, and D without propagating carries. A
n-bit CCA has an approximate gate count of 17n 1 and a
Figure 3. The CSA and CKA propagate
carries over groups of k-bits.
4 CARRY LOGIC
speed of n 4. Hwang (10) discusses the carry-completion
sensing adder in detail. Sklansky (11) provides an evalua-
tion of several two-summand binary adders.
Carry Elimination Adder
Ignoring carry propagation, Equation (1) describes a half-
adder, which can be implemented by a single XOR gate. In
principle, it is possible to determine the sum of 2 n-bit
numbers by performing Half Addition on the input oper-
ands at all bit positions in parallel and by iteratively
adjusting the result to account for carry propagation.
This mechanism is similar to the CCA. However, the dif-
ference is that the CCA uses primary and secondary carry
vectors to account for carry propagation, whereas the carry
elimination adder (CEA) iterates. The CEA algorithm for
adding two numbers A and B is formalized by the following
steps:
1. Load A and B into two n-bit storage elements called
SUM and CARRY.
2. Bit-wise XOR and AND SUM and CARRY simulta-
neously.
3. Route the XORedresult backto SUMandleft shift the
ANDed result and route it back to the CARRY.
4. Repeat the operations until the CARRY register
becomes zero. At this point, the result is available
in SUM.
The implementation of the algorithm and detailed com-
parisons with other carry-propagation mechanisms have
been discussed by Ramachandran (12).
Figure 5 shows anexample of adding two numbers using
the CEAalgorithm. Note that the Primary carry vector Cin
the CCA is the same as the CARRY register value after the
rst iteration. The number of iterations that the CEA per-
forms before converging to a sum is equal to the maximum
length of the carry chain for the given inputs A and B. On
average, the length of a carry chain for n-bit random
patterns is log
2
n. The gate count of the CEA is about
8n 22 gates. It approaches the CLA in terms of speed
and the CRA in terms of gate count.
MATHEMATICAL ESTIMATION OF THE CARRY-CHAIN
LENGTH
For a given carry chain of length j, the probability of being
in the propagate state is k/k
2
= 1/k. Dene P
n
(j) as the
probability of the addition of two uniformly random n-bit
numbers havingamaximal lengthcarrychainof length j:
P
n
j 0 if n< j; and P
n
n 1=k
n
(13)
P
n
j can be computed using dynamic programming if all
outcomes contributing to probability P
n
j are split into
suitable disjoint classes of events, which include each con-
tributing outcome exactly once. All outcomes contributing
to P
n
j can be split into two disjoint classes of events:
Figure 4. Performing4-bit prexcomputationandextendingit to
8-bit numbers.
Figure 5. The CCA and CEA use dynamic carry propagation.
CARRY LOGIC 5
Class 1: A maximal carry chain of length j does not
start at the rst position. The events of this class
consist precisely of outcomes with initial prexes
having 0 up to ( j 1) propagate positions followed
by one nonpropagate position and then followed with
the probability that a maximal carry chain of length
j exists in the remainder of the positions. A prob-
ability expression for this class of events is shown in
Equation (14):
X
j1
t0
1
k
t
k 1
k
P
nt1
t (14)
In Equation (14), each term represents the condition
that the rst (t 1) positions are not a part of a
maximal carry chain. If a position k < t in some
term was instead listed as nonpropagating, it would
duplicate the outcomes counted by the earlier case t
k 1. Thus, all outcomes with a maximal carry chain
beginning after the initial carry chains of length less
thanj are counted, and none is counted twice. None of
the events contributing to P
n
j inclass 1is contained
by any case of class 2 below.
Class 2: Amaximal carry chainof lengthj does beginin
the rst position. What occurs after the end of each
possible maximal carry chain beginning in the rst
position is of no concern. In particular, it is incorrect
to rule out other maximal carry chains in the space
following the initial maximal carry chain. Thus,
initial carry chains of lengths j through n are con-
sidered. Carry chains of length less than n are fol-
lowed immediately by a nonpropagate position.
Equation (15) represents this condition:
1
k
m
X
m1
tj
1
k
t
k 1
k
(15)
The termP
m
m 1=k
m
handles the case of a carry
chain of full length m, and the summand handles the
individual cases of maximal length carry chains of
length j, j 1, j 2,. . ., m 1. Any outcome with a
maximal carry chain with length j not belonging to
class 2 belongs to 1. In summary, any outcome with a
maximal carry chain of length j, which contributes
to P
n
j, is included once and only once in the disjoint
collections of classes 1 and 2.
Adding the probabilities for collections 1 and 2 leads to
the dynamic programming solution to P
n
j provided
below, where P
n
j p
n
j p
n
j 1 p
n
n
1 p
n
n;, where P
n
i is the probability of the occurrence
of a maximal length carry chain of precisely length i.
Thus, the expected value of the carry length [being the
sum from i 1 to n of i
P
n
i] becomes simply the sum of
the P
n
j fromj 1 to n. Results of dynamic programming
indicate that the average carry length in the 2-ary number
system for 8 bits is 2.511718750000; for 16 bits it is
3.425308227539; for 32 bits, 4.379535542335; for 64 bits,
5.356384595083; and for 128 bits, 8.335725789691.
APPROXIMATION ADDITION
To generate the correct nal result, the calculation must
consider all input bits to obtain the nal carry out. How-
ever, carry chains are usually much shorter, a design that
considers only the previous k inputs instead of all previous
input bits for the current carry bit can approximate the
result (13). Given that the delay cost of calculating the full
carry chain length of N bits is proportional to log (N), if k
equals to the square root of N, the newapproximationadder
will perform twice as fast as the fully correct adder. With
random inputs, the probability of having a correct result
considering only k previous inputs is:
PN; k
1
1
2
k2
Nk1
This is derived with the following steps. First consider
why the prediction is incorrect. If we only consider k pre-
vious bits to generate the carry, the result will be wrong
only if the carry propagation chain is greater than k 1.
Moreover, the previous bit must be in the carry generate
condition. This canonly happenwithaprobability of 1=2
k2
if we consider a k-segment. Thus, the probability of being
correct is one minus the probability of being wrong. Second,
there are a total of N k 1 segments in an N-bit addi-
tion. To produce the nal correct result, the segment should
not have anerror condition. We multiply all probabilities to
produce the nal product. This equation could determine
the risk taken by selecting the value of k. For example,
assuming randominput data, a 64-bit approximationadder
with 8-bit look-ahead (k 8) produces a correct result 95%
of the time.
Figure 6 shows a sample approximation adder design
with k 4. The top and bottom rows are the usual carry,
propagate, and generate circuits. The gure also shows the
sum circuits used in other parallel adders. However, the
design implements the carry chain with 29, 4-bit carry
blocks and three boundary cells. These boundary cells
are similar but smaller in size. A Manchester carry chain
could implement 4-bit carry blocks. Thus, the critical path
delay is asymptotically proportional to constant with this
design, and the cost complexity approaches N. In compar-
isonwithKogge Stone or HanCarlsonadders, this designis
faster and smaller.
It is worthwhile to note that an error indication circuit
could be and probably should be implemented because we
know exactly what causes a result to be incorrect. When-
ever a carry propagation chain is longer than k bits,
the result given by the approximation adder circuit will
be incorrect. That is, for the ith carry bit, if the logic
function - (ai-1 XOR bi-1) AND (ai-2 XOR bi-2) AND. . .
AND (ai-k XOR bi-k) AND (ai-k-1 AND bi-k-1) is true, the
prediction will be wrong. We could implement this logic
function for each carry bit and performthe logical ORof all
6 CARRY LOGIC
these n-4 outputs to signal us if the approximation is
incorrect.
HIGH PERFORMANCE IMPLEMENTAION
The most recently reported adder implemented with the
state-of-art CMOS technology is (14). The adder style used
in that implementation is a variation of the prex adder
previously mentioned. Consideration was given not only to
gate delay but also to fan-in, fan-out as well as to wiring
delay in the design. Careful tuning was done to make sure
the design is balanced, and the critical path is minimized.
BIBLIOGRAPHY
1. H. L. Garner, Number systems and arithmetic, inF. Alt and M.
Rubinoff (eds.) Advances in Computers. New York: Academic
Press, 1965, pp. 131194.
2. E. J. McCluskey, Logic Design Principles. Englewood Cliffs,
NJ: Prentice-Hall, 1986.
3. S. Waser andM. J. Flynn, Introductionto Arithmetic for Digital
System Designers. New York: Holt, Reinhart and Winston,
1982.
4. N. H. E. Weste and K. Eshragian, Principles of CMOS VLSI
DesignASystems Perspective, 2nded. Reading, MA: Addison-
Wesley, 1993, chaps. 58.
5. H. Ling, High speed binary parallel adder, IEEE Trans. Com-
put., 799802, 1966.
6. Milos D. Ercegovac and Tomas Lang, Digital Arithmetic. San
Mateo, CA: Morgan Kaufmann, 2003.
7. E. E. Swartzlander Jr., Computer Arithmetic. Washington,
D.C.: IEEE Computer Society Press, 1990, chaps. 58.
8. J. L. Henessey and D. A. Patterson, Computer Architecture: A
Quantitative Approach, 2nd ed.San Mateo, CA: Morgan Kauff-
man, 1996, pp. A-38A-46.
9. I. Koren, Computer Arithmetic Algorithms, 2nded. A. K. Peters
Ltd., 2001.
10. K. Hwang, Computer Arithmetic. New York: Wiley, 1979,
chap. 3.
11. J. Sklansky, An evaluation of several two-summand binary
adders, IRE Trans. EC-9 (2): 213226, 1960.
12. R. Ramachandran, Efcient arithmetic using self-timing,
Masters Thesis, Corvallis, OR: OregonState University, 1994.
13. S.-L. Lu, Speeding up processing with approximation circuits,
IEEE Comput. 37 (3): 6773, 2004.
14. S. Mathew et al., A 4-GHz 130-nm address generation unit
with32-bit sparse-tree adder core, IEEEJ. Solid-State Circuits
38 (5): 689695, 2003.
SHIH-LIEN LU
Intel Corporation
Santa Clara, California
RAVICHANDRAN RAMACHANDRAN
National Semiconductor
Corporation
Santa Clara, California
Figure 6. An example 32-bit approximation adder.
CARRY LOGIC 7
A
AUTOMATIC TEST GENERATION
INTRODUCTION
This article describes the topic of automatic test generation
(ATG) for digital circuitsjmd systems. Considered within
the scope of ATG are methods and processes that support
computer-generated tests and supporting methodologies.
Fundamental concepts necessary to understand defect
modeling and testing are presented to support later dis-
cussions on ATG techniques. In addition, several closely
related topics are also presented that affect the ATG pro-
cess, such as design for test (DFT) methodologies and
technologies.
One can test digital systems to achieve one of several
goals. First, testing can be used to verify that a system
meets its functional specications. In functional testing,
algorithms, capabilities, and functions are veried to
ensure correct design and implementation. Once a system
has been veried to be correct, it can be manufactured in
quantity. Second, one wishes to know whether each man-
ufacturedsystemis defect free. Third, testingcanbe usedto
determine whether a systemis defect free. Functional tests
can provide the basis for defect tests but are ineffective in
providing acceptable defect tests. Defect tests can be devel-
oped by creating tests in an ad hoc fashion followed by
evaluationusingafault simulator. Incomplexsystems, this
process can be challenging and time consuming for the test
engineer. As a result, many effective techniques have been
developed to perform ATG as well as to make ATG more
effective.
Technologic trends have continued to offer impressive
increases incapability and performance incomputing func-
tionand capacity, Moores lawnoted the annual doubling of
circuit complexities in 1966 that continued through 1976
(1).
1
Although the rate of complexity doubling slowed, such
increases are both nontrivial and continuous with the
increases in complexity comes the added burden of testing
these increasingly complex systems.
Supporting the technologic improvements are comple-
mentary advances in the many supporting technologies
including design tools, design practices, simulation, man-
ufacturing, and testing. Focusing on testing, the technolo-
gic advances have impacted testing in several ways. First,
the increase in the number of pins for an integrated circuit
has not increased at the same rate as the number of devices
on the integrated circuit. In the context of testing, the
increasing relative scarcity of pins creates a testing bottle-
neck because more testing stimulus and results must be
communicated through relatively fewer pins. Second,
design methodologies have changed to reect the trends
in the introduction of increasingly more complex systems.
So-called systems on chip (SoC) approaches enable
designers to assemble systems using intellectual property
(IP) cores purchased from vendors. SoCs present their own
challenges in testing. SoCtesting is complicated even more
by observing that vendors are, understandably, reluctant
to provide sufcient detail on the inner workings of their
cores to enable the development of a suitable defect test.
Indeed, vendors may be unwilling to provide test vectors
that can provide hints on the inner workings. As a result,
the ideaof embeddedtest has grownout of these challenges.
Third, the sheer complexity of the systems can make it
prohibitively expensive to develop effective tests manually.
As a result, reliance on tools that can generate tests auto-
matically can reduce manufacturing costs. Furthermore,
effective testing schemes also rely on the integration of
testing structures to improve the coverage, reduce the
number of tests required, and complement the ATG pro-
cess. Fourth, ATG serves as an enabling technology for
other testing techniques. For example, synthesis tools
remove the burden of implementing systems down to the
gate level. At the same time, gate level detail, necessary for
assembling a testing regimen, may be hidden from the
designer. ATG lls this gap by generating tests for synth-
esis without requiring the designer to develop test synth-
esis tools as well.
FUNDAMENTALS OF TESTING
In this section, fundamental concepts from testing are
introduced. First, fault modeling is presented todenelhe
target for testing techniques. Second, testing measures are
presented to provide a metric for assessing the efcacy of a
given testing regimen. Finally, fault simulation is usedto
quantify the testing measures. We will use the three uni-
verse mode (2) to differentiate the defect from the mani-
festation of the fault and also the system malfunction. A
fault is the modelof the defect that is present physically in
the circuit. An error is the manifestation of the fault where
a signal will have a value that differs from the desired
value. Afailure is the malfunctionof the systemthat results
from errors.
Fault Modeling
Circuits can fail in many ways. The failures can result from
manufacturing defects, infant mortality, random failures,
age, or external disturbances (2). The defects qan be loca-
lized, which affect function of one circuit element, or dis-
tributed, which affect many of all circuit elements. The
failures can result in temporary or permanent circuit fail-
ure. The fault model provides an analytical target for
testing methodologies and strategies. Thus, the delity of
the fault models, in the context of the implementation
technology, can impact the efcacy of the testing (3). For
example, the stuck fault model is considered to be an
ineffective model for many faults that occur in CMOS
1
Moore actually stated his trend in terms of the number of
components per integrated circuit for minimum cost (1)
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
circuits. In addition, the fault model may inuence the
overall test strategy.
The fault models selected depend on the technology,
usedto implement the circuits. Manufacturingdefects exist
as a consequence of manufacturing the circuit. The intro-
duction and study of manufacturing defects is a heavily
studied topic because of its impact onthe protability of the
device. Dust or other aerosols inthe air canaffect the defect
statistics of a particular manufacturing run. In addition,
mask misalignment and defects in the mask can also
increase the defect densities. Other fault models can
account for additional failure processes such as transient
faults, wear out, and external disturbances. Figure 1 gives
some example faults that are discussedinmore detail inthe
following sections.
Stuck-at Fault Models. Stuck-at fault models are the
simplest and most widely used fault models. Furthermore,
the stuck-at fault model also models acommonfailure mode
in digital circuits. The stuck-at fault model requires the
adoption of several fundamental assumptions. First, a
stuck-at fault manifests itself as a node being stuck at
either of the allowable logic levels, zero or one, regardless
of the inputs that are appliedto the gate that drive the node.
Second, the stuck-at fault model assumes that the faults are
permanent. Third, the stuck-at fault model assumes a
value for the fault, but otherwise preserves the gate func-
tion. The circuit shown in Fig. 1 is used to illustrate the
fault model. The output of gate G
1
can be stuck-at 1 (SA-1)
as a result of a defect. When the fault is present, the
corresponding input to G
4
will always be one. To express
the error, a discrepancy with respect to fault-free operation
must occur in the circuit as a consequence of the fault. To
force the discrepancy, the circuit inputs are manipulated so
that A B 1, anda discrepancy is observedat the output
of G1. A second example circuit is shown in Fig. 2, which
consists of an OR gate (G
1
) that drives one input in each of
three ANDgates (G
2
, G
3
, and G
4
). Consider the presence of
an SA-1 fault on any input to G
1
fault results in the output
being1as aconsequence of the fault. InG
1
input andoutput
SA-1 faults are indistinguishable and for modeling pur-
poses can be collapsed into a single fault. A somewhat
higher delity model can also include gate input stuck
faults. For example, in the event gate input I
2
has a
stuck-at 1 fault, the situation is somewhat different. In
this case, O
1
I
3
I
4
with G
3
and G
4
not affected directly
by the fault.
Delay Fault Models. A delay fault is a fault in which a
part of the circuit operates slowlycomparedwithacorrectly
operating circuit. Because normal manufacturing varia-
tions result in delay differences, the operation must result
in a sufcient delay discrepancy to produce a circuit mal-
function. For example, if a delay fault delays a change to a
ip-op excitation input after the expected clock edge, then
a fault is manifest. Indeed, when a delay fault is present,
the circuit may operate correctly at slower clock rates, but
not at speed. Delay faults can be modeled at several levels
(4). Gate delay fault models are represented as excessive
propagation delay. The transition fault model is slow to
transition either a from 0 to 1 or from 1 to 0. A path delay
fault is present when the propagation delay through a
series of gates is longer than some tolerable worst case
delay. Indeed, a current industry practice is to perform
statistical timing analysis of parts. The manufacturer can
determine that the parts canbe runat a higher speedwitha
certain probability so that higher levels of performance can
be delivered to customers. However, this relies on the
statistical likelihood that delays will not be worst case
(4). By running the device at a higher clock rate indicated
by statistical grading, devices and structures that satisfy
worst-case timing along the critical path may not meet the
timing at the new, higher clock rate. Hence, a delay fault
cas seem to be a consequence of the manufacturing deci-
sions. Assuming the indicated delay fault in Fig. 1, Fig. 3
gives a timing diagramthat shows the manifestation of the
fault. In this circuit, the delay fault causes the ip-op
input, J, to change later, which results in a clock period
delay in the ip-op state change. Because of the nature of
delay faults, circuits must be tested at speed to detect the
delay fault. As posed here, the delay fault is dynamic,
requiring two or more test vectors to detect the fault.
D Q
Bridging Fault
Delay
Memory
SA1
A
B
C
D
E
F
G2
G3
G1
G
H
G4
J
K
I
Figure 1. An illustration of fault models.
1
I
2
I
3
I
4
O
3
G
G
G
2
4
1
G
Figure 2. Illustration of input stuck-at faults.
2 AUTOMATIC TEST GENERATION
Bridging Faults. Bridging faults exist when an undesir-
able electrical connection occurs between two nodes result-
ing in circuit performance degradation or malfunction.
Bridging faults between a circuit node and power supply
or ground may be manifest as stuck faults. Furthermore,
bridging faults may result in behavioral changes such as
wired-and, wired-or, and even sequential characteristics
when the bridging fault creates a feedback connection (5).
Bridging faults require physical proximity between the
circuit structures aficted by the bridging faults.
Figure 1 gives an example of a bridging fault that changes
the combinational circuit into a sequential circuit.
CMOS Fault Models. CMOS technology has several fault
modes that are unique to the technology (5). Furthermore,
as a consequence of the properties of the technology, alter-
native methods for detecting faults in CMOS circuits are
necessary. CMOSgates consist of complementarynetworks
of PMOS and NMOS transistors structured such that sig-
nicant currents are drawn only when signal changes
occur. In fault-free operation, when no signal changes
occur, the circuit draws very low leakage currents. Under-
standing the different fault models requires deeper
exploration into CMOS circuit structures. CMOS circuits
are constructed from complementary pull-up networks of
PMOS transistors and pull-down networks of NMOS tran-
sistors. In addition, MOS transistors switch based on vol-
tage levels relative to the other transistor terminals. The
switching input, or gate, is the input and draws no current
other than very low leakage currents. The gate does, how-
ever, have signicant parasitic capacitance that must be
charged and discharged to switch the transistor. Thus,
signicant currents are drawn when transistors are
switched.
In addition to the stuck faults, CMOS circuits have an
interesting failure mode where an ordinary gate can be
transformed into a dynamic sequential circuit for certain
types of faults. The fault is a consequence of a transistor
failure, low quiescent currents, and capacitive gate inputs.
In Fig. 4, if transistor Q
1
is stuck open and if A = 0, the past
value on node C is isolatedelectrically and will act as a
storage element through the capacitance on the inverter
input. Table 1 summarizes the sequence of inputs neces-
sary to detect the transistor Q
1
stuck open fault. To detect
this fault, node C must rst be set by assigning A = B = 0
followed by setting B = 1 to store the value at the input of
G
2
. Each of the four transistors in the NAND gate will
require a similar test.
The CMOS circuits current draw can be used as a
diagnostic for detecting faults. For example, because the
CMOS circuit should only draw signicant currents when
the circuit is switching, any signicant deviation from a
known current prole suggests faults. Indeed, the CMOS
Q
Clock
delay
fault
J
J
Q
Figure 3. Illustration of a delay faults.
2 1
1
G G
stuckopen
Input
Capacitive
B
A
C D
Q
Figure 4. An illustration of a CMOS memory fault.
Table 1. Input sequence to detect transistor Q
1
-stuck-open
A B Note
0 0 Set C to 1
0 1 C remains 1 for fault, 0 for no fault
AUTOMATIC TEST GENERATION 3
logic gates may function correctly, but when faults are
present, the circuit may draw abnormally large power
supply currents. Testing for faults based on this observa-
tion is called I
DDQ
testing. Bridging faults are common in
CMOS circuits (6) and are detected effectively with I
DDQ
testing (7). I
DDQ
faults can have a signicant impact on
portable designs where the low current drawn by CMOS
circuits is required. Increased I
DDQ
currents can result
fromtransistors that are degraded because of manufactur-
ing defects such that measurable leakage currents are
drawn. In addition, bridging faults can also showincreased
I
DDQ
currents.
Memory Faults. Semiconductor memories have struc-
tures that are very regular and very dense. As a result,
memories can exhibit faults that are not observed ordina-
rilyinother circuits that cancomplicate the testingprocess.
The faults can affect the memory behavior in unusual ways
(8). First, afault canlinktwo memorycells suchthat whena
value is written into one cell, the value the linked cell
toggles. Second, the memory cell can only be written to 0
or 1 but cannot be written the opposite value. Third, the
behavior of amemorycell maybe sensitive to the contents of
neighboring cells. For example, a particular pattern of
values stored in surrounding cells may prevent writing
into the affected cell. Fourth, the particular pattern of
values stored in the cells can result in a change in the
value in the affected cell. The nature of these faults make
detection challenging because the test must take into
account the physical locality of memory cells.
Crosspoint Faults. Crosspoint faults (9) are a type of
defect that can occur in programmable logic arrays
(PLAs). PLAs consist of AND arrays and OR arrays with
functional terms contributing through programming tran-
sistors to either include or exclude a term. In eld program-
mable devices, a transistor is programmed to be on or off,
respectively, to represent the presence or absence of a
connection. A crosspoint fault is the undesired presence
or absence of a connection in the PLA. Clearly, because the
crosspoint fault can result in a change in the logic function,
the stuck fault model cannot model crosspoint defects
effectively. A crosspoint fault with a missing connection
in the AND array results in a product term of fewer vari-
ables, whereas an extra connection results in more vari-
ables in the product term. For example, consider function
f(A, B, C, D) = AB + CD implemented on a PLA. The exis-
tence of a crosspoint fault can change the function to f
cpf
(A,
B, C, D) = ABC + CD. Figure 5 diagrams the structure of
the PLA and the functional effect of the crosspoint fault.
I
DDQ
Defects. Anideal CMOScircuit draws current only
when logic values change. In practice, because transistors
are not ideal, a small leakage current is drawn when no
circuit changes occur. Many circuit defects result in anom-
alous signicant current that are one type of fault mani-
festation. Furthermore, many circuits can have a
characteristic I
DDQ
current when switching. Again, this
characteristic current can change in response to defects.
Detectable I
DDQ
defects have no relation to the expected
correct circuit outputs, which require that testing for I
DDQ
detectable defects be supplemented with other testing
approaches. Furthermore, an integrated circuit is gener-
ally integrated with other circuit technologies that draw
signicant quiescent currents, for example, IC pads, bipo-
lar, and analog circuits. Thus, testing for I
DDQ
defects
requires that the supply for I
DDQ
testable circuits be iso-
lated from the supply for other parts of the circuit.
Deep Sub-Micron (DSM). Deep sub-micron (DSM) tech-
nologies offer the promise of increased circuit densities and
speeds. For several reasons, the defect manifestations
change with decreasing feature size (3) First, supply vol-
tages are reduced along with an associated reduction in
noise margins, which makes circuits more suscepitible to
malfunctions caused by noise. Second, higher operating
frequencies affect defect manifestations in many ways.
Capacitive coupling increases with increasing operating
frequency, which increases the likelihood of crosstalk.
Furthermore, other errors may be sensitive to operating
frequency and may not be detected if testing is conducted at
slower frequencies. Third, leakage currents increase with
decreasing feature size, which increases the difculty of
using tests to detect current anomalies. Fourth, increasing
circuit density has resulted in an increase in the number of
interconnect levels, which increases the likelihood of inter-
connect related defects. Although the classic stuck-at fault
model was not conceived with these faults in mind, the
stuck-at fault model does focus testing goals on controlling
andobserving circuits nodes, therebydetecting manyinter-
connect faults that do not conform to the stuck-at fault
model.
Measures of Testing. To gauge the success of a test meth-
odology, some metric for assessing the test regimenandany
associated overhead is needed. In this section, the mea-
sures of test set fault coverage, test set size, hardware
overhead, performance impacts, testability, and computa-
tional complexity are presented.
Fault Coverage. Fault coverage, sometimes termed test
coverage, is the percentage of targeted faults that have
been covered by the test regimen. Ideally, 100%fault cover-
Cross Point Fault
C
B
A
D
Figure 5. An illustration of a crosspoint fault.
4 AUTOMATIC TEST GENERATION
age is desired; however, this statistic can be misleading
when the fault model does not reect the types of faults
accurately that can be expected to occur (10). As noted
earlier, the stuck-at fault model is a simple and popular
fault model that works well in many situations. CMOS
circuits, however, have several failure modes that are
beyond the scope of the simple stuck-at fault model. Fault
coverage is determined through fault simulation of the
respective circuit. To assess the performance of a test, a
fault simulator should model accurately the targeted fault
to get a realistic measure of fault coverage.
Size of Test Set. The size of the test set is an indirect
measure of the complexity of the test set. Larger test sets
increase the testing time, which have a direct a direct
impact on the nal cost if expensive circuit testers are
employed. In addition, the test set size is related to the
effort in personnel required computationally to develop the
test. The size of the test set depends on many factors
including the ease with which the design can be tested
as well as integrating DFT methodologies. Use of scan path
approaches with ip-ops interconnected as shift registers
gives excellent fault coverages, yet the process of scanning
into and fromthe shift register may result inlarge test sets.
Hardware Overhead. The addition of circuity to improve
testability throughthe integrationof built-intest andbuilt-
in self test (BIST) capabilities increases the size of a system
and can have a signicant impact on circuit costs. The ratio
of the circuit size with test circuitry to the circuit without
test circuitry is a straightforward measure of the hardware
overhead. If improved testability is a requirement, then
increased hardware overhead can be used as a criterion for
evaluating different designs. The additional hardware can
simplify the test development process and enable testing
for cases that are otherwise impractical, System failure
rates are a function of the size and the complexity of the
implementation, where ordinarily larger circuits have
higher failure rates. As a result, the additional circuitry
for test can increase the likelihood of system failure.
Impact on Performance. Likewise, the addition of test
circuitry can impact system performance. The impact can
be measured in terms of reduced clock rate, higher power
requirements, and/or increased cost. For example, scan
design methods add circuitry to ip-ops that can switch
between normal and test modes and typically with have
longer delays compared with circuits not so equipped. For
devices with xed die sizes and PLAs, the addition of test
circuitry may displace circuitry that contributes to the
functional performance.
Testability. Testability is an analysis and a metric that
describes how easily a system may be tested for defects. In
circuit defect testing, the goal is to supply inputs to the
circuit so that it behaves correctly when no defects are
present, but it malfunctions if a single defect is present. In
other words, the only way to detect the defect is to force the
circuit to malfunction. Ingeneral, testabilityis measuredin
terms of the specic and the collective observability and
controllability of nodes within a design. For example, a
circuit that provides the test engineer direct access (setting
andreading) to ip-opcontents is estable more easily than
one that does not, which gives a corresponding better
testability measure. In the test community, testability
often is described in the context of controllability and to
observability. Controllability of a circuit node is the ability
to set the node to a particular value. Observability of a
circuit node is the ability to observe the value of the node
(either complemented or uncomplemented) at the circuit
outputs. Estimating the difculty to control and to observe
circuit nodes forms the basis for testability measures.
Figure 6 presents a simple illustration of the problem
and the process. The node S is susceptible to many types
of faults. The general procedure for testing the correct
operation of node S is to control the node to a value com-
plementary to the fault value. Next, the observed value of
the signal is propagated to system outputs for observation.
Detecting faults in systems that have redundancy of any
sort requires special consideration to detect all possible
faults. For example, fault-tolerant systems that employ
triple modular redundancy will not show any output dis-
crepancies when one masked fault is present (2). To make
the modules testable, individual modules must be isolated
so that the redundancy does not mask the presence of
faults. In addition, redundant gates necessary to remove
hazards from combinational circuits result in a circuit
where certain faults are untestable. Improved testability
can be achieved by making certain internal nodes obser-
vable through the addition of test points. The Sandia con-
trollability/observability analysis program is an example
application that evaluates the testability of a circuit or
system (11).
2
For example, an exponential time algorithm may double the
required resources when the problem size is increased by this
result is much like what happens when you double the number
of pennies on each successive the square of a chess board.
Rest of Circuit
Primary
S
Outputs
Primary
Inputs
Figure 6. Representative circuit with fault.
AUTOMATIC TEST GENERATION 5
Computational Complexity. The computational complex-
ity measures both the number of computations and the
storage requirements necessary to achieve a particular
algorithmic goal. From these measures, bound estimates
of the actual amount of time required canbe determined. In
testing applications, the worst-case computational com-
plexity for many algorithms used to nd tests unfortu-
nately is bad. Many algorithms fall in the class of NP-
Complete algorithms for which no polynomial time (i.e,
good) algorithmexists. Instead, the best known algorithms
to solve NP-Complete problems require exponential time.
2
Although devising a perfect test is highly desirable, in
practice, 100% coverage generally is not achieved. In
most cases, tests are found more quickly than the worst
case, andcases taking too long are stopped, whichresults in
a test not being found. Some have noted that most tests are
generated in a reasonable amount of time and provide an
empirical rationale to support this assertion (12).
Fault Simulation
Fault simulation is a simulation capable of determining
whether a set of tests can detect the presence of faults
within the circuit. In practice, a fault simulator simulates
the fault-free system concurrently with the faulty system.
Inthe event that faults produce circuit responses that differ
from the fault-free cases, the fault simulator records the
detection of the fault.
To validate a testing approach, fault simulation is
employed to determine the efcacy of the test. Fault simu-
lation can be used to validate the success of a test regimen
and to give a quantitative measure of fault coverage
achieved in the test. In addition, test engineers can use
fault simulationfor assessing functional test patterns. By
examining the faults covered, the test engineer canidentify
circuit structures that have not received adequate coverage
and can target these structures formore intensive tests.
To assess different fault models, the fault simulator
should both model the effect of the faults and also report
the faults detected. Inthe test for bridging faults detectable
by I
DDQ
testing, traditional logic and fault simulators are
incapable of detecting such faults because these.faults may
not produce a fault value that can differentiate faulty from
fault-free instances. In Ref. 13, a fault simulator capable of
detecting I
DDQ
faults is described.
BASIC COMBINATIONAL ATG TECHNIQUES
In ATG, a circuit specication is used to generate a set of
tests. In this section, several basic techniques for ATG are
presented. The stuck-at fault model described previously
provides the test objective for many ATG approaches. The
single stuck fault is a fault on a node within the circuit that
is either SA-0 or SA-l. Furthermore, only one fault is
assumed to be in the circuit at any given time. Presented
in detail here are algebraic approaches for ATG, Boolean
satisability ATG, the D-Algorithm, one of the rst ATG
algorithms, and PODEM. Subsequent developments in
ATGare compared largely with the D-Algorithmand other
derived works.
ALGEBRAIC APPROACHES FOR ATG
Algebraic-techniques may be a used to derive tests for
faults and can be used in ATG. The Boolean difference
(2,14,15) is analgebraic methodfor ndingatest shouldone
exist. GivenaBooleanfunctionF the Booleandifference is
dened as
dFX
dx
i
Fx
1
; x
2
; ; x
i1
; 0; x
i1
; ; x
n
F
x
1
; x
2
; ; x
i1
; 1; x
i1
; ; x
n
1
where
dFX
dx
i
is the Boolean difference of the Boolean func-
tion F, x
i
is an input, and is the exclusive-or. One
interpretation of the quantity
dFX
dx
i
is to show the depen-
dence of F(X) on input x
i
. If
dFX
dx
i
0, the function is
independent of x
i
, which indicates that it is impossible to
nd a test for a fault on x
i
. On the other hand, if
dFX
dx
i
1,
thenthe output depends onx
i
, and a test for a fault onx
i
can
be found.
The Boolean difference can be used to determine a test
because it can be used in an expression that encapsulates
both controllability and observability into a Boolean tau-
tology that when satised, results in a test for the fault
x
i
dFX
dx
i
1 2
for X
i
SA faults and
x
i
dFX
dx
i
1 3
for X
i
SA 1 faults. Note that the Boolean difference
represents the observability of input x
i
, and the assertion
associated with x
i
represents its controllability. Equations
(2) and (3) can be reduced to SOP or POS forms. A suitable
assignment of inputs that satises the tautology is the test
pattern. Finding a suitable test pattern is intractable com-
putationally if product and sum terms have more than two
terms (16).
Boolean Satisability ATPG
Boolean satisability SAT-ATPG (17) is related to the
Boolean difference method for determining test patterns.
As inthe Booleandifference method, SAT-ATPGconstructs
the Booleandifference betweenthe fault free and the faulty
circuits. Rather than deriving the formula to derive a test,
SAT-ATPG creates a satisability problem such that the
variable assignments to achieve satisability are a test for
the fault. The satisability problem is derived from the
combinational circuit by mapping the circuit structure into
a directedacyclic graph(DAG). Fromthe DAG, a formula in
conjunctive normal form (CNF) is derived that when satis-
ed, produces a test for a fault in the circuit. Although the
SAT problem is NP-Complete (16), the structure of the
6 AUTOMATIC TEST GENERATION
resulting formula has specic features that tend to reduce
the computational effort compared with the general SAT
problem. Indeed, the resulting CNF formula is character-
ized by having a majority of the factors being two terms.
Note that satisability of a CNF formula with two terms
(2SAT) is solvable in linear time. Reference 17 notes that as
many as 90%of the factors are two elements. This structure
suggests strongly that satisability for expressions that
results from this construction are solvable in a reasonable
amount of time, but the are not guaranteed in polynomial
time. The basic SAT-ATPGis describedinmore detail inthe
following paragraphs.
As an example, consider the circuit example used to
illustrate the D-Algorithm in Fig. 7. The DAG derived is
given in Fig. 8. The mapping of the circuit to the DAG is
straightforward with each logic gate, input, output, and
fanout point mapping to a node in the DAG. Assuming
inputs X, Y, and output Z, the conjunctive normal forms for
the NAND and OR gates are (17).
Z XZ YZ
X
Y NAND
Z
XZ
Y
Z X Y OR
4
CNF formulas for other gates and the procedure for
handling gates with three or more inputs are also summar-
izedinRef. 17. TheCFNfor the fault-free andfaultycircuits
arederivedfromthe DAG. Theexclusive-or of the outputs of
fault-free and faulty circuits must be (1) to distinguish
between the two expressed as
F
faulty
F
fault - free
1 5
where satiscation of the tautology results in a test for the
desired fault. Starting with the output, the conjunction of
all nodes is formed following the edges of the DAG. The
fault-free CNF for the circuit is the conjunction of the
following conjunctive normal forms for the various circuit
structures
F
f F
i
F f i G5
f X
1
f b
1
f
X
1
b
1
G1
i gi hi g
h G4
g b
2
g X
3
g
b
2
X
3
G2
X
2
b
1
X
2
b
1
Top fan - out at b
X
2
b
2
X
2
b
2
Bottomfan - out at b
h X
4
h X
5
h
X
4
X
5
G3
6
The CNF for the i-SA-0 fault requires a modication to
the circuit DAGto represent the presence of the fault. From
the DAG structure derived from the circuit and the target
fault, a CNF formula is derived. ACNF that represents the
fault test is formed by taking the exclusive-or of the fault
free circuit withaCNFformthat represents the circuit with
the fault. The CNF for the faulted circuit is derived by
modifying the DAGby breaking the connection at the point
of the fault andadding anewvariable to represent the fault.
The variable
i
0
is used to representthe fault i-SA-0. Note
that in the faulty circuit, the combinational circuit deter-
mining i in the fault-free circuit is redundant, and the CNF
formula for the faulted circuit is
F
0
f F
0
i
0
F
0
f
i
0
G5
i
0
The fault
f X
1
f b
1
f
X
1
b
1
G1
X
2
b
1
X
2
b
1
Top fan - out at b
7
Combining Equations (6) and (7), the exclusive-or of the
faulty and fault-free formulas, and eliminating redundant
terms gives the following formula whose satisfaction is a
test for the fault
F
f F
i
F f i
f X
1
f b
1
f
X
1
b
1
i gi hi g
h
g b
2
g X
3
g
b
2
X
3
X
2
b
1
X
2
b
1
X
2
b
2
X
2
b
2
h X
4
h X
5
h
X
4
X
5
i
0
F F
0
BDF
F
0
BD
X
X
0
B
DX X
0
B
D
8
Note that the last line of Equation (8) represents the
exclusive-or for the faulty and the fault-free circuits, and
the variable BDis the output that represents the exclusive-
2
1
X
1
X
3
X
b
X
4
X
5
F
f
i
g
h
b
2
&
+
&
&
&
Figure 7. Example combinational circuit.
1
X
2
X
3
X
4
X
5
X
F
j
i
f
g
h
a
b
c
d
e
G4
G5
G1
G2
G3
Figure 8. DAG for circuit from Fig. . Logic gate nodes labeled to
show original circuit functions.
AUTOMATIC TEST GENERATION 7
or of the two. Signicantly, most terms inEquation(8) have
two or fewer terms.
The next step is to determine an assignment that satis-
es Equation (8). The problem is broken into two parts
where one part represents the satisfaction of the trinary
terms and the second the satisfaction of binary terms
(solvable in polynomial time) that are consistent with the
trinary terms. Efforts that followed (17) concentrated on
identifying heuristics that improved the efciency of nd-
ing assignments.
D-Algorithm
The D-Algorithm (18) is an ATG for combinational logic
circuits. Furthermore, the D-Algorithm was the rst com-
binational ATGalgorithmto guarantee the ability to nd a
test for a SA-0/1 fault should a test exist. In addition, the D-
Algorithm provides a formalism for composing tests for
combinational circuits constructed modularly or hierarchi-
cally. The D-Algorithm relies on a ve-valued Boolean
algebra to generate tests, which is summarized in
Table 2. Note that the values D and
D represent a discre-
pancy betweenthe fault free andfaulty signal values where
these values can be either the seminal error of the fault or
the discrepancy attributable to the fault that has been
propagated through several layers of combinational logic.
The D-AIgorithm also requires two additional assump-
tions. First, exactly one stuck-at fault may be present at
any given time. Second, other than the faulted node, circuit
structures are assumed to operate fault free (i.e., normaly).
To begin the algorithm, the discrepancy that represents
the direct manifestation of the fault is assigned to the
output of a primitive component. For this component, the
input/output combination that forces the manifestation of
the fault is called the primitive D cube of failure (PDCF).
The PDCF provides a representation of the inputs neces-
sary to result in discrepancies for the faults of interest. The
effect of the fault is propagated through logic circuits using
the PDCfor eachcircuit. The applicationof PDCs continues
with primitive elements until the discrepancy is propa-
gated to one or more primary outputs. Next, the inputs
are justied through a backward propagation step using
the singular cover for each component in the backward
path. The singular cover is a compressed truth table for the
fault-free circuit. Singular covers, PDCFs, and PDCs for
several basic gates are shown in Table 3. Note that the
PDCFs and PDCs follow straightforwardly from the logic
functions and the ve valued Boolean logic summarized in
Table 2. Theoretic derivation of these terms are presented
in Ref. 18.
The D-Algorithmconsists principally of two phases. The
rst phase is the D-drive phase, where the fault is set
through the selection of an appropriate PDCF and then
propagated to a primary output. Once the D-drive is com-
plete, justication is performed. Justication is the process
of determining signal values for internal node and primary
inputs consistent with node assignments made during D-
drive and intermediate justication steps. In the event a
conict occurs, where at some point a node must be both 0
and 1 to satisfy the algorithm, backtracking occurs to the
various points in the algorithm where choices for assign-
ments were possible and an alternate choice is made. An
input combination that propagates the fault to the circuit
outputs and can be justied at the inputs is a test pattern
for the fault. To generate a test for a combinational logic
circuit, the D-Algorithm is applied for all faults for which
tests are desired.
The D-Algorithmis applied to the circuit given in Fig. 8.
For an example demonstration, consider the fault i-SA-0.
Figure 9 gives a summary of the algorithmic steps that
results in the determination of the test pattern. The result-
ing test pattern for the example in Fig. 9 is
X
1
X
2
X
3
X
4
X
5
111XX, where X is as dened in Table 2.
Either fault simulation can be used to identify other faults
detected by this test, or the dont cares can be used to
combine tests for two or more different faults.
Path-Oriented Decision Making (PODEM)
The D-Algorithm was pioneering in that it provided a
complete algorithmic solutionto the problemof test pattern
generation for combinational circuits. In the years after it
was introduced, researchers and practitioners noticed that
the D-Algorithm had certain undesirable asymptotic prop-
erties and in the general case was found to be NP-Complete
(10). This term means the worst case performance is an
exponential number of steps in the number of circuit nodes.
Despite this nding for many types of problems, the D-
Algorithmcan nd tests in a reasonable amount of time. In
Ref. 17, it was noted that the D-Algorithmwas particularly
inefcient in determining tests for circuit structures typi-
callyusedinerror correctingcircuits (ECC). Typically, ECC
circuits have a tree of exclusive-OR gates with reconver-
gent fanout through two separate exclusive-OR trees. The
path-oriented decision making (PODEM) test pattern gen-
eration algorithm(20) was proposed to speed the search for
tests for circuits similar to and used in ECC. In fact, the
researchers learned that their approach was in general as
effective andmore efcient computationallycomparedwith
the D-Algorithm.
PODEM is fundamentally different from the D-Algo-
rithm in that test searches are conducted on primary
inputs. As a result, the amount of backtracking that might
occur is less than the D-Algorithm because fewer places
exist where backtracking can occur. Furthermore, the
Table 2. Boolean values
Value Meaning
1 Logic one
0 Logic zero
D Discrepancy: expected one, but is zero due to fault
D
D
D
D; 011, 10
D; 0
D
D;
D01; 01
Dg we see that the pattern
repeats every eight clocks compared with the expected four
clocks. Furthermore, the effect of the fault can be latent
because discrepancies are stored in the state and not
observable immediately. Because multiple discrepancies
can be present simultaneously, the circuit may occasion-
ally show correct operation, for example at T8, despite the
occurrence of several discrepancies in prior times. In this
example, if the state is observable, the fault can be detected
at T2. Suppose that only the inputs and the clock can be
controlled and only the output, Z, can be observed. In this
case, the detection of the fault is delayed for two clock
cycles, until T4.
Synchronous ATG is challenging, but it can be under-
stood and in some cases solved using concepts from combi-
national ATG methods. Indeed, this leads to one of three
approaches. First, tests can be created in an ad hoc fashion.
Then, using a technique called fault simulation similar to
that presented in Fig. 14, detected faults can be tabulated
and reported. Second, the circuit can be transformed in
specic ways so that combinational ATG can be applied
directly. Such approaches are presented in the following
subsections. Third, the circuit operation can be modeled so
that the circuit operation gives the illusion of a combina-
tional circuit, and it allows combinational ATG to be used.
Indeed, any of these three techniques can be used together
as testing needs dictate.
Iterative Array Models for Deterministic ATPG
Clearly, ATG for sequential circuits is more challenging
compared with combinational circuits. Many sequential
ATG techniques rely on the ability to model a sequential
machine in a combina-tional fashion, which makes possible
the use of combinational ATG in sequential circuits. Ana-
lytical strategies for determining test patterns use iterative
array models (19). The iterative array models provide a
theoretic framework for making sequential machines seem
combinational from the perspective of the ATG process.
Iterative arrays follow from unrolling the operation of the
Clock
Next
State
Excitation
S
t
a
t
e
CL
X
Z
y
Present
State
E
Figure 11. General model of sequential circuit.
E=1
0
1
0 0
1
1
0
1
E=0
0/0
1/0
2/0 3/0
4/1
Enable
Clock
Z
Da
Reset*
B
C
A
D Q
CLR
D Q
CLR
D Q
CLR
(a) State Diagram (b) Implementation
Figure 12. Five state counter.
10 AUTOMATIC TEST GENERATION
state machine where each unrolling exposes another time
frame or clock edge that can update the state of the state
machine. The unrolling process replicates the excitation
functions in each time frame, applying the inputs consis-
tent with that particular state. Memory elements are mod-
eled using structures that make themseemcombinational.
Furthermore, output functions are replicated as well. Each
input and output is now represented as a vector of values
where each element gives the value within a specic time
frame. Withthis model, combinational ATGmethods canbe
used to determine circuits for specic times.
Iterative array methods complicate the application of
sequential ATG methods in our major ways: (1) the size of
the combinational circuit is not known, (2) the state
contribution is a constrained input, (3) multiple iterations
express faults multiple times, and (4) integration level
issues complicate when and how inputs are controlled
and outputs are observed. For test pattern generation,
the concatenation of the inputs from the different time
frames, X
t1
X
t
X
t1
, serve as the inputs for the com-
binational ATGalgorithms, where for the actual circuit, the
superscript species the time frame when that input is set
to the required value.. The combinational ATG algorithm
outputs are the concatenation of the frame outputs,
Z
t1
Z
t
Z
t1
, where again, the superscript identies
when an output should be observed to detect a particular
fault (Fig. 15).
The size of the circuit is determined by the initial state
and the number of array iterations necessary to control and
to observe a particular fault. In all iterations, the iterative
model replicates combinational circuitry and all inputs and
outputs for that time step. The ATG generation process
will, by necessity, be required to instantiate, on demand,
combinational logic iterations because the required num-
ber of iterations necessary to test for a specic fault is not
known. For some initial states, it is impossible to determine
a test for specic faults because multiple expressions of a
fault may mask its presence. Different algorithms will
assume either an arbitrary initial state or a xed state.
Fromthe timing diagramin Fig. 13, the iterative circuit
model for the counter example witha Da-SA-1 fault is given
in Fig. 16. Adiscrepancy occurs after the fourth clock cycle,
where the test pattern required to detect the fault is
Dc
0
Db
0
Da
0
E
0
E
1
E
2
E
3
Z
0
Z
1
Z
2
Z
3
00000000001, where
the input pattern is the concatenation of initial state
Dc
0
Db
0
Da
0
, the inputs at each time frame E
0
E
1
E
2
E
3
,
and the outputs at each time frame are Z
0
Z
1
Z
2
Z
3
. Because
circuit inFig. 16is combinational, algorithms suchas the D-
Algorithm and PODEM can be employed to determine test
patterns. Both algorithms would require modication to
handle both variable iterations and multiple faults.
Fromthe iterative array model, two general approaches
can be pursued to produce a test for a fault. The rst
approach identies a propagation path from the point of
the fault and the output. Fromthe selected output, reverse
time processing is employed to sensitize a path through the
frame iterations to the point of the fault. Inthe event a path
is not found, backtracking and other heuristics are
employed. After the propagation path for the fault is estab-
lished, reverse time processing is used to justify the inputs
required to express the fault and to maintain consistency
with previous assignments. The second approach employs
forward processing from the point of the fault to propagate
to the primary outputs and then reverse time processing to
justify the conditions to express the fault and consistency
(21,22). Additional heuristics are employed to improve the
success and the performance of the test pattern generation
process, including using hints from fault simulation (22).
Several approaches can be applied for determining test
patterns.
Genetic Algorithm-Based ATPG
Because of the challenges of devising tests for sequential
circuits, many approaches have been studied. Genetic algo-
rithms (GAs) operate by generating and evaluating popu-
1
1
1
0
E=0
E=1
0
1
0
1
0
1
0
1
E=0
5/0
0/0
1/0
3/0 2/0
4/1
Figure 13. Effect of faults on state diagram.
D
X
X
X
X D
T1
S=01D S=D01 S=0DD S=10D S=011 S=DDD S=D0D S=DDD S=DD0 S=0D1 S=01D S=001
B
Z
Clock
Reset*
T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12
C
A D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D D
Figure 14. Fault simulation of counter with Da-SA-1.
AUTOMATIC TEST GENERATION 11
lations of organisms for tness to a particular purpose and
then using the tness to identify organisms fromwhich the
next generation is created (23). Organism structure and
capabilities are dened by a genetic code represented as a
string of values that set the conguration of the organism.
The organism constructed from the genetic code is then
evaluated for tness. Organisms in the next generation are
generated from t organisms in the current generation
using simulated variations of crossover and mutation.
In the context of a DUT, the string of test patterns can
serve as the genome for the organismand the tness can be
a function of whether and how well a particular fault is
detected from the sequence. The work of Hsiao et al. (24)
used GAs to assemble test sequences for faults from pro-
mising pieces. Intheir work, the GAoperates intwo phases,
the rst targets controlling the fault and the second targets
propagating the faulted node to the primary outputs.
Sequences from which a test is assembled are from one
of three categories: (1) distinguishing sequences, (2) set/
clear sequences, and (3) pseudoregister justication
sequences. Distinguishing sequences propagate ip-op
fault effects to the primary outputs. Set/clear sequences
justify (i.e., force ip-op) to specic states. Finally, pseu-
doregister justication sequences put sets of ip-ops into
specic states. These sequences form the basis from which
tests are generated for a particular fault. The process is
initialized insome randomfashionwitha set of strings that
represent the initial GA generation of test sequences. The
likelihood that any given string can detect the fault
depends on the complexity and the size of the circuit.
Assuming the circuit is sufciently complex that no organ-
ism can detect the fault, the tness function must include
the ability to identify organisms that have good qualities
(i.e., have sequencing similar to what one might expect of
the actual test sequence). Two tness functions were
formed in Ref. 24 for the justication and the propagation
phases. Each was the weighted sumof six components that
include the ability to detect the fault, measures of controll-
ability, distinguishing measures, circuit activity measures,
and ip-op justicationmeasures. The GAtest generation
operates in three stages where the GA is run in each stage
until the fault coverage plateaus. In addition, test lengths
are allowed to grow in subsequent stages, on the assump-
tion that faults not detectable with shorter sequences may
be detected with longer sequences.
DESIGN FOR TESTABILITY
Many ATG techniques provide capabilities that are prac-
tical in a variety of situations. DFT is the process and the
discipline for designingdigital systems to make themeasier
to test. Furthermore, constructing circuits that facilitate
Function
Frame t1 Frame t Frame t+1
CL CL CL
X
t1
y
t1
Z
t1
E
t1
Excitation
FlipFlop
Function
X
t
y
t
E
t
Z
t
X
t+1
y
t+1
E
t+1
Excitation
FlipFlop
Function
Z
t+1
y
t+2
Excitation
FlipFlop
Figure 15. Excerpt from iterative array sequential circuit model.
SA0
1
Db
0
Db
2
Db
3
Db
4
Dc
4
Da
4
Da
3
Dc
3
Dc
2
Da
2
Dc
1
Da
1
Dc
0
Da
0
Enable
0
Enable
1
Z
2
Enable
2
Z
3
Enable
Z
3
4
Z
1
0
1
SA1
Db
Figure 16. Iterative array model for example circuit for Da-SA-1.
12 AUTOMATIC TEST GENERATION
testing also simplies ATG. Inthis capacity, DFTcanaffect
the entire testing process, from ATG to testing. DFT tech-
niques apply other concepts and design paradigms includ-
ing circuit fault models and techniques for detecting faults.
Modifying designs themselves to improve and to simplify
the testing process can offer many benets. First, time
devoted to test development is reduced with a higher like-
lihood of guaranteed results. Second, DFT reduces the test
set size because individual tests are generally more effec-
tive. Third, DFT can reduce the time necessary to test a
system. DFT approaches fall into one of two categories.
Starting with a traditional design, the rst category is
exemplied by adding circuit structures to facilitate the
testing process. The second category attacks the testing
process from basic design principals resulting in systems
that are inherently easy to test. In this section, DFT
approaches that support ATG are presented.
Circuit Augmentation to Facilitate ATG
Digital systems designed in an ad hoc fashion may be
difcult to test. Several basic techniques are available to
improve the testability of a given system. These techniques
include test point insertion, scan design methods, and
boundary scan techniques. These approaches preserve
the overall structure of the system.
Test Point Insertion. Test point insertion is a simple and
straightforward approach for providing direct controllabil-
ity and observability of problem circuit nodes. In test point
insertion, additional inputs and outputs are provided to
serve as inputs and outputs for testing purposes. These
additional test point inputs and outputs do not provide any
additional functional capabilities. The identication of
these points can follow from a testability analysis of the
entire systemby identifying difcult-to-test internal nodes.
Circuit nodes are selected as test points to facilitate testing
of difcult-to-test nodes or modules. As test points are
identied, the testability analysis can be repeated to deter-
mine howwell the additional test points improve testability
and to determine whether additional test points are neces-
sary. Indeed, the test points enhance clearly the efcacy of
ATG(25). Furthermore, test point insertioncanprovide the
basis for structured design for testability approaches. The
principal disadvantage of adding test points is the
increased expense and reduced performance that results
fromthe addition of test points. One study showed (26) that
adding 1% additional test points increases circuit over-
heads by only 0.5% but can impact system performance
by as much as 5%. In the study, test points were internal
andinsertedbystate of the art test point insertionsoftware.
Scan Design Methods. The success of test point insertion
relies on the selection of good test points to ease the testing
burden. Determining these optimal points for.test point
insertion can be difcult. As a result, structured
approaches can be integrated to guarantee ATG success.
One particularly successful structured design approach is
scan design. Scan design methods improve testability by
making the internal system state both easily controllable
and observable by conguring, in test mode, selected ip-
ops as ashift register (2730), effectivelymakingeachip-
op a test point. Taken to the logical extreme, all ip-ops
are part of the shift register and are therefore also test
points. The power of this conguration is that the storage
elements can be decoupled from the combinational logic
enabling the application of combinational ATG to generate
fault tests. The shift register organization has the addi-
tional benet that it can be controlled by relatively few
inputs and outputs. Because a single shift register might be
excessively long, the shift register may be broken into
several smaller shift registers.
Historically, scan path approaches follow from techni-
ques incorporated into the IBM System/360 where shift
registers were employed to improve testability of the sys-
tem (27). A typical application of a scan design is given in
Fig. 17. Note the switching of the multiplexer at the ip-op
inputs controls whether the circuit is in test or normal
operation. Differences between different scan design meth-
ods occur in the ip-op characteristics or in clocking.
Scan Path Design. Relevant aspects of the design include
the integrationof race free Dip-ops to make the ip-ops
fully testabl (29). Level sensitive scan design (28) is for-
mulated similarly. One fundamental difference is the
machine state is implemented using special master slave
ip-ops clocked with non overlapping clocks to enable
testing of all stuck faults in op-ops.
BoundaryScanTechniques. Inmany designapproaches,
the option of applying design for testability to some
components is impossible. For example, standard parts
that might be used in printed circuit boards that are not
typically designed with a full systemscandesign in mind.
As another example, more and more ASIC designs are
integrated fromcores, which are subsystems designed by
third-party vendors. The core subsystems are typically
processors, memories, and other devices that until
recently were individual integrated circuits themselves.
To enable testing in these situations, boundary scan
methods were developed. Boundary scan techniques
employ shift registers to achieve controllability and
observability for the inputs/outputs to circuit boards,
chips, and cores. An important application of boundary
scan approaches is to test the interconnect between chips
and circuit boards that employ boundary scan techni-
ques. In addition, the boundary scan techniques provide
a minimal capability to perform defect testing of the
components at the boundary. The interface to the bound-
ary scan is a test access port (TAP) that enables setting
and reading of the values at the boundary. In addition,
the TAP may also allow internal testing of the compo-
nents delimited by the boundary scan. Applications of
boundary scan approaches include BIST applications
(31), test of cores (32), and hierarchical circuits (33).
The IEEE (Piscatouoaes NJ) has created and approved
the IEEE Std 1149.1 boundary scan standard (34). This
standard encourages designers to employ boundary scan
techniques by making possible testable designs con-
structed with subsystems from different companies
that conform to the standard.
AUTOMATIC TEST GENERATION 13
ATG and Built-In Test
Background. Requiring built-intests affects test genera-
tion process in two ways. First, the mechanism for gener-
ating test patterns must be self-contained within the
circuitry itself. Although in theory, circuitry can be
designed to produce any desired test pattern sequence, in
reality, most are impractical. As a result, simpler circuitry
must be employed to generate test patterns. ATGand built-
in tests require the circuits to have the ability to generate
test sequences and also to determine that the circuits
operate correctly after being presented with the test
sequence.
Three classes of circuits are typically employed because
the patterns have good properties and also have the ability
to produce any sequence of test patterns. The rst type of
circuit is a simple N-bit counter, which can generate all
possible assignments to N bits. Counter solutions may be
impractical because test sequences are long for circuits
with a relatively few inputs and also may be ineffective
in producing sequences to test for delay or CMOS faults.
Researchers have investigated optimizing count sequences
to achieve more reasonable test lengths (35). The second
type of circuit generates pseudorandom sequences using
linear feedback shift registers (LFSRs). For combinational
circuits, as the number of random test patterns applied to
the circuit increases, fault coverage increases asymptoti-
cally to 100%. Much research has been conducted in the
development of efcient pseudorandom sequence genera-
tors. An excellent source onmany aspects of pseudorandom
techniques is Ref. 36. Athirdtype of circuit is constructedto
generate specic test patterns efciently for specic types
of circuit structures. In this case, the desired sequence of
test patterns is examined and a machine is synthesized to
recreate the sequenc. Memory tests have shown some
success in using specialized test pattern generator circuits
(37).
To determine whether a fault is present, the outputs of
the circuit must be monitored and compared with expected
fault-free outputs. Test pattern generation equipment
solves this by storing the expected circuit outputs for a
given sequence of inputs applied by the tester. As noted
above, it may be impractical to store or to recreate the exact
circuit responses. Alternate approaches employduplication
approaches (several are summarized in Ref. 2) where a
duplicate subsystem guarantees the ability to generate
correct circuit outputs, assuming a single fault model. A
discrepancy between the outputs of the duplicated modules
infers the presence of a fault. Although duplication often is
used in systems that require tault tolerance or safety,
duplicationmay be anundesirable approachinmany situa-
tions. An alternative approach, signature analysis, com-
presses the circuit output responses into a single code word,
a signature, which is used to detect the presence of faults.
Good circuit responses are taken either from a known good
circuit or more frequently from circuit simulations. A fault
in a circuit would result in a signature that differs fromthe
expected good signature with high probability.
Signature Analysis. The cost of testing is a function of
manyinuences that include designcosts, testingtime, and
Clock
Y
S
d1
d0
Y
S
d0
d1
Y
S
d0
d1
Combinational
Logic
D Q
Q D
D Q
Test
Figure 17. One scan path approach.
14 AUTOMATIC TEST GENERATION
test equipment costs. Thus, reducing the test set size and
the ease of determining whether a fault is present has a
great inuence on the success of the system. In signature
analysis, the test set is reduced effectively to one repre-
sentative value, whichis termeda signature. The signature
comparison can be performed internally, using the same
technology as the systemproper. In signature analysis, the
signature register is implemented as a LFSR as shown in
Fig. 18. The LFSR consists of a shift register of length N,
and linear connections fed back to stages nearer the begin-
ning of the shift register. With successive clocks, the LFSR
combines its current state with the updated test point
values. Figure 19 shows two different LFSRcongurations:
(1) a single input shift register (SISR) compresses the
results of a single test point into a signature and (2) a
multiple input shift register (MISR) compresses several
test point results. After the test sequence is complete, the
contents of the LFSR is compared with the known good
signature to determine whether faults are present. Asingle
fault may result in the LFSR contents differing from the
good signature, but generally will not provide sufcient
information to identify the specic fault. Furthermore, a
single fault mayresult inanal LFSRstate that is identical
to the good signature, which is termed allasing. This out-
come is acceptable if aliasing occurs withlowprobability. In
Ref. 36, the aliasing probability upper bounds were derived
for signatures computed with SISRs. In addition in Ref. 30,
methods for developing MISRs with no aliasing for single
faults were developed. The circuitry necessary to store the
signature, to generate a vector to compre with the signa-
ture, and to compare the signature is modular and simple
enough to be intergrated with circuit functions of reason-
able sizes, which makes signature analysis an important
BISTtechnique. LFSRs canbe usedinsignature analysis in
several ways.
BILBO. The built-in logic block observer (BILBO)
approach has gained a fairly wide usage as a result of its
modularity andexibility (40). The BILBOapproachcanbe
used in both scan path and signature analysis test applica-
tions by encapsulating several important functions. BILBO
registers operate in one of four modes. The rst mode is
used to hold the state for the circuitry as Dip-ops. In the
second mode, the BILBO register can be congured as a
shift register that can be used to scan values into the
register. In the third mode, the register operates as a
multiple input signature register (MISR). In the fourth
mode, the register operates as a parallel random pattern
generator. These four modes make possible several test
capabilities. A four-bit BILBO register is shown in Fig. 20.
One example application of BILBO registers is shown in
Fig. 21. Intest mode, two BILBOregisters are congured to
isolate one combinational logic block. The BILBO register
at the input, R1, is congured as a PRPG, whereas the
register at the output, R2, is congured as a MISR. In test
mode operation, for each random pattern generated, one
output is taken and used to compute the next intermediate
signature in the MISR. When all tests have been per-
formed, the signature is readandcomparedwiththe known
good signature. Any deviation indicates the presence of
faults in the combinational circuit. To test the other com-
binational logic block, the functions of R1 and R2 only need
to be swapped. Conguration of the data path to support
test using BILBO registers is best achtevedby performing
register allocation and data path design with testability in
mind (41).
Memory BIST
Semiconductor memories are designed to achieve the high
storage densities in specic technologies. High storage
densities are achieved by developing manufacturing pro-
cesses that result in the replication, organization, and
optimization of a basic storage element. Although in prin-
ciple the memories canbe testedas other sequential storage
elements, in reality the overhead associated with using
scan path and similar test approaches would impact
severely the storage capacity of the devices. Furthermore,
the basic function of a memory typically allows straightfor-
ward observability and controllability of stored informa-
tion. On the other hand, the regularity, of the memorys
physical structure and the requisite optimizations result in
fault manifestations as alinkage betweenadjacent memory
cells. From a testing perspective, the manifestation of the
fault is a function of the state of a memory cell and its
TP
Clock
D Q D Q D Q Q D
Figure 18. Linear feedback shift register.
Signature
Test Input LFSR
Test Input
Signature
LFSR
(a) SISR (b) MISR
Figure 19. Different signature register congurations.
AUTOMATIC TEST GENERATION 15
physically adjacent memory cells. Among the test design
considerations of memories is the number of tests as a
function of the memory capacity. For example, a test meth-
odology was developed(37) for creating test patterns. These
test patterns could be computed using a state machine that
is relatively simple and straightforward. The resulting
state machine was shown to be implementable in random
logic and as a microcode driven sequencer.
Programmable Devices
With seyeral vendors currently (2004) marketing FPGA
devices capable of providing in excess of one million usable
gates, testing of the programmed FPGA becomes a serious
design consideration. Indeed, the capacity and perfor-
mance of FPGAs makes this class of technology viable in
many applications. Furthermore, FPGA devices are an
integration of simpler programmable device architectures,
eachrequiring its owntestingapproach. FPGAs include the
ability to integrate memories and programmable logic
arrays, which requires ATG approaches most suitable for
that component. Summarized previously, PLAs exhibit
fault models not observed in other implerneniation tech-
nologies. One approach for testing to apply BIST
approaches described previously (42). PLA test can be
viewed from two perspectives. First, the blank device can
be tested and deemed suitable for use in an application.
Second, once a device is programmed, the resulting digital
system can be tested according to the jpejhods already
described. One early and notable ATG methods applied
to PLAs is PLATYPUS (43). The method balances random
TPG with deterministic TPG to devise tests for both tradi-
tional stuck-at faults as well as cross-point faults. Modern
FPGAs support standard test interfaces such as JTAG/
IEEE 1149 standards. In this context, ATG techniques
can be applied in the context of boundary scan for the
programmed device.
Minimizing the Size of the Test Vector Set
In the combinational ATG algorithms presented, specic
faults are targeted in the test generation process. To
Y
Q D
Q
Q D
Q
Q D
Q
Q D
Q
S
Shift in
C0
C1
d1
d0
Figure 20. BILBO register.
R1 (BILBO)
Logic
Combinational
Combinational
Logic
PRPG
MISR R2 (BILBO)
Figure 21. Application of BILBO to testing.
16 AUTOMATIC TEST GENERATION
develop a comprehensive test for a combinational circuit,
one may develop a test for each faultlndividually. In most
cases, however, the number of tests that results is much
larger than necessary to achieve the same fault coverage.
Indeed, the techniques of fault collapsing, test compaction,
and fault simulation can produce a signicant reduction in
the test set size.
Fault Collapsing. Distinct faults in a circuit may produce
the identical effect when observed at circuit outputs. As a
result, the faults cannot be differentiated by any test. For
example, if any AND gate input is SA-0, the input fault
cannot be differentiated frogi thexiutout SA-0 fault. In this
case, the faults can be collapsed into one, output SA-0,
which requires a test only for the collapsed fault.
Test Compaction. The D-Algorithm and PODEM can
generate test vectors with incompletely specied inputs,
providing an opportunity to merge different tests through
test compaction. For example, consider a combinational
circuit whose tests are given in Table 4. Examining the
rst two faults in the table, aand b, shows the test 0010 will
test for both faults. Test compaction can be either static or
dynamic. In static test compaction, all tests are generated
by the desired ATG and then analyzed to determine those
tests that can be combined, off-line, thereby creating tests
to detect several faults and specifying undetermined input
values. In dynamic tests, after each new test is generated,
the test is compacted with the cumulative compacted list.
For the tests in Table 4, static compaction results in two
tests that tests for all faults, 0000, and 0011 for faults {a,c}
and {b,d} respectively. Dynamic compaction produces a
different result, as summarized in Table 5. Note that the
number of tests that result from dynamic compaction is
more compared with static compaction. From a practical
perspective, optimal compaction is computationally expen-
sive and heuristics are often employed (1). Reference 9 also
notes that in cases where heuristics are used in static
compaction, dynamic compaction generally produces
superior results while consuming fewer computational
resources. Furthermore, dynamic compaction processes
vectors as they come, but more advanced dynamic compac-
tion heuristics may choose to not compact immediately but
rather wait until a more opportune time.
Compaction Using Fault Simulation. A test that has been
found for one fault may also test for several other faults.
Those additional tests can be determined experimentally
byperformingafault simulationfor the test andidentifying
the additional faults that are also detected. The process is
outlined as follows:
1. Initialize fault set
2. Select a fault from the fault set
3. Generate a test pattern for the selected fault
4. Run fault simulation
5. Remove additional detected faults form fault set
6. If fault set emptyor fault coverage thresholdmet then
exit, otherwise go to step 2
Test Vector Compression. Test vector compression takes
on two forms, lossless and lossy. Lossless compression is
necessary in circumstances where the precise input and
output values must be known as what might be necessary
on an integrated circuit tester. Under certain limited cir-
cumstances, lossless compression might make it possible to
store test-compressed vectors in the system itself. In lossy
compression, the original test vectors cannot be reconsti-
tuted. Lossy compression is used in such applications as
pseudorandomnumber generationandsignature registers,
where input and output vectors are compressed through
the inherent structure of the linear feedback shift registers
as described previously. Lossy compression is suitable
when the probability of not detecting an existing fault is
muchsmaller thanthe proportionof uncoveredfaults inthe
circuit.
ATG CHALLENGES AND RELATED AREAS
As circuit complexity andcapabilities evolve, so does the art
and science of ATG. Technologic innovations have resulted
in the ability to implement increasingly complex and inno-
vative designs. The technologic innovations also drive the
evolution of testing. One trend, for example, is that newer
CMOS circuits have increased quiescent currents, which
impacts the ability to apply I
DDQ
testing technologies.
ATG in Embedded Systems
The natural evolution of technology advances and design
has resulted in systems composed from IP obtained from
many sources. Clearly, the use of IP provides developers
faster development cycles and quicker entry to market. The
use of IPpresents several ATGchallenges. First, the testing
embedded IP components can be difcult. Second, because
the internals of the IPcomponents are oftennot known, the
success of ATG techniques that require full knowledge of
the circuit structure will be limited. Third, IP developers
may be hesitant to provide fault tests for fear that it would
give undesired insights into the IP implementation.
Functional ATG
Design tools and rapid prototyping paradigms result in the
designer specifying hardware systems in an increasingly
Table 4. Test pattern compaction
Fault Test
a 00X0
b 001X
c X00X
d X011
Table 5. Simple dynamic test compaction
Sequence New test Compacted tests
1 00X0 {00X0}
2 001X {0010}
3 X00X {0010,X00X}
4 X011 {0010,X00X,X011}
AUTOMATIC TEST GENERATION 17
abstract fashion. As a result, the modern digital system
designer maynot get the opportunitydeveloptests basedon
gate level implementations. Employing ATG technologies
relieves the designer of this task provided the design tools
can dene and analyze a gate level implementation.
SUMMARY
In this article, many aspects of ATG have been reviewed.
ATGis the process of generating tests for adigital systemin
an automated fashion. The ATG algorithms are grounded
in fault models that provide the objective for the test gen-
eration process. Building on fault models, ATG for combi-
national circuits have been shown to be effective.
Sequential circuits are more difcult to test as they require
the circuits to be unrolled in a symbolic fashion or be the
object of specialized test pattern search algorithms.
Becauseof the difculties encounteredintestingsequential
circuits, the circuits themselves are occasionally modied
to simplify the process of nding test patterns and to
improve the overal test fault coverage. The inexorable
technology progression provides many challenges in test
testing process. As technology advances, new models and
techniques must continue to be developed to keep pace.
BIBLIOGRAPHY
1. G. E. Moore, Cramming more components onto integrated
circuits, Electronics, 38 (8): 1965.
2. B. W. Johnson, Design and Analysis of Fault-Tolerant Digital
Systems, Reading, MA: Addison-Wesley Publishing Company,
1989.
3. R. C. Aitken, Nanometer technology effects on fault models for
IC testing, Computer, 32 (11): 4752, 1999.
4. M. Sivaraman and A. J. Strojwas, A Unied Approach for
Timing Verication and Delay Fault Testing, Boston: Kluwer
Academic Publishers, 1998.
5. N. K. Jha and S. Kindu, Testing and Reliable Design of CMOS
Circuits, Boston: Kluwer, 1992.
6. J. Gailay, Y. Crouzet, and M. Vergniault, Physical versus
logical fault models in MOS LSI circuits: Impact on their
testability, IEEE Trans. Computers, 29(6): 2861293, l980.
7. C. F. Hawkins, J. M. Soden, R. R. Fritzmeier, and L. K.
Horning, Quiescent power supply current measurement for
CMOS IC defect detection, IEEE Trans. Industrial Electron.,
36(2): 211218, 1989.
8. R. Dekker, F. Beenker, and L. Thijssen, A realistic fault model
and test., algorithms for static randomaccess memories, IEEE
Trans. Comp.-Aided Des., 9(6): 567572, 1996.
9. M. Abramovici, M. A. Breuer, and A. D. Friedman, Digital
Systems Testing and Testable Design, Ascataway, NJ: IEEE
Press, revised printing edition, 1990.
10. N. K. Jha and S. Kindu, Assessing Fault Model and Test
Quality, Boston: Kluwer, 1992.
11. L. H. Goldstein and E. L. Thigpen, SCOAP: Sandia controll-
ability/observability analysis program, Proceedings of the 17th
Conference on Design Automation, Minneapolis, MN: 1980, pp.
190196.
12. V. D. Agrawal, C. R. Kime, and K. K. Saluja, Atutorial onbuilt-
in-self-test part 2: Applications, IEEE Design and Test of
Computers, 6977, 1993.
13. S. Chakravarty and P. J. Thadikaran, Simulation and genera-
tion of IDDQtests for bridging faults in combinational circuits,
IEEE Trans. Comp., 45(10): 11311140, 1996.
14. A. D. Friedman and P. R. Menin, Fault Detection in Digital
Circuits. Englewood Cliffs, NJ: Prentice-Hall, 1971.
15. Z. Kohavi, SwitchingandFinite AutomataTheory, 2nded.New
York: McGraw-Hill, 1978.
16. M. R. Garey and D. S. Johnson, Computers and Intractability:
A Guide to the Theory of NP-Completeness, San Francisco, CA:
W.H. Freeman and Company, 1979.
17. T. Larrabee, Test pattern generation using Boolean satisa-
bility, IEEE Trans. Comp. Aided Des., 5(1): 415, 1992.
18. J. Paul Roth, Diagnosis of automata failures: A calculus and a
method, IBM J. Res. Devel., 10: 277291, 1966.
19. O. H. Ibarra and S. K. Sahni, Polynomially complete fault
detection problems, IEEE Trans. Computers, C-24(3): 242
249, 1975.
20. P. Goel, An implicit enumeration algorithm to generate tests
for combinational logic circuits, IEEE Trans. Comp., C-30(3):
2l5222, 1981.
21. I. Hamzaoglu and J. H. Patel, Deterministic test pattern gen-
eration techniques for sequential circuits, DAC, 2000, pp. 538
543.
22. T. M. Niermann and J. H. Patel, HITEC: A test generation
package for sequential circuits, Proceedings European Design
Automation Conference, 1990, pp. 214218.
23. D. E. Goldberg, Genetic Algorithms in Search, Optimization,
and Machine Learning. Reading, MA: Addison Wesley, 1989.
24. M. S. Hsiao, E. M. Rudnick, and J. H. Patel, Application of
genetically engineered nite-state-machine sequences to
sequential circuit TPGA, IEEE Trans. Comp.-Aided Design
of Integrated Circuits Sys., 17(3): 239254, 1998.
25. M. J. Geuzebroek, J. Th. van derLinden, and A. J. van deGoor,
Test point insertionthat facilitates ATPGin reducing test time
and data volume, Proceedings of the 2002 International Test
Conference (ITC2002), 2002, pp. 138147.
26. H. Vranken, F. S. Sapei, and H. Wunderlich, Impact of test
point insertion on silicon area and timing during layout, Pro-
ceedings of the Design, Automatin and test in Europe Confer-
ence and Exhibition (DATE04), 2004.
27. W. C. Carter, H. C. Montgomery, R. J. Preiss, and J. J. Rein-
heimer, Design of serviceability features for the IBM system/
360, IBM J. Res. & Devel., 115126, 1964.
28. E. B. Eichelberger and T. W. Williams, Alogic designstructure
for LSI testability, Proceedings of the Fourteenth Design Auto-
mation Conference, New Orleans, LA: 1977, pp. 462468.
29. S. Funatsu, N. Wakatsuki, and T. Arima, Test generation
systems in Japan, Proceedings of the Twelfth Design Automa-
tion Conference, 1975, pp. 114122.
30. M. J. Y. Williams and J. B. Angel, Enhancing testability of
large-scale integrated circuits via test points and additional
logic, IEEE Trans. Computers, C-22(l): 4660, 1973.
31. A. S. M. Hassan, V. K. Agarwal, B. Nadeau-Dostie, and J.
Rajski, BIST of PCBinterconnects using boundary-scan archi-
tecture, IEEE Trans. Comp., 41(10): 12781288, 1992.
32. N. A. Touba and B. Pouya, Using partial isolation rings to test
core-based designs, IEEEDesign and Test of Computers, 1997,
pp. 5257.
33. Y. Zorian, A structured testability approach for multi-chip
modules based on BIST and boundary-scan, IEEE Trans.
Compon., Packag. Manufactu. Technol. Part B, 17(3): 283
290, 1994.
18 AUTOMATIC TEST GENERATION
34. IEEEIEEE Standard Test Access Port and Boundary-Scan
Architecture, Piscataway, NJ: IEEE, 1990.
35. D. Kagaris, S. Tragoudas, and A. Majumdar, On the use of
counters for reproducing deterministic test sets, IEEE Trans.
Comp., 45(12): 14051419, 1996.
36. P. H. Bardell, W. H. McAnney, and J. Savir, Built-in Test for
VLSI: Pseudorandom Techniques, New York: John Wiley &
Sons, l987.
37. M. Franklin, K. K. Saluja, and K. Kinoshita, Abuilt-in self-test
algorithm for row/column pattern sensitive faults in RAMs,
IEEE J. Solid-State Circuits, 25(2): 514524, 1990.
38. S. Feng, T. Fujiwara, T. Kasami, and K. Iwasaki, On the
maximum value of aliasing probabilities for single input sig-
natureregisters, IEEETrans. Comp., 44(11): 12651274, 1995.
39. M. Lempel and S. K. Gupta, Zero aliasing for modeled faults,
IEEE Trans. Computers, 44(11): 12831295, 1995.
40. B. Koenemann, B. J. Mucha, and G. Zwiehoff, Built-in test for
complex digital integrated circuits, IEEE J. Solid State Phys.,
SC-15(3): 315318, 1980.
41. M. Tien-Chien Lee, High-Level Tests Synthesis of Digital VLSI
Circuits, Boston: Artech House, Inc, 1997.
42. M. R. Prasad, P. Chong, and K. Keutzer, Why is ATPG easy?
Design Automation Conference, 1999.
43. R. Wei and A. Sangiovanni-Vincentelli, PLATYPUS: A PLA
test pattern generation tool, 22nd Design Automation Confer-
ence, 1985, pp. 197203.
LEE A. BELFORE II
Old Dominion University,
Norfolk, Virginia
AUTOMATIC TEST GENERATION 19
C
CD-ROMs AND COMPUTER SYSTEMS
HISTORY OF DEVELOPMENT: CD-ROM AND DVD
To distribute massive amounts of digital audio data, at
reasonable cost and high quality, industry giants, such as
Philips (Amsterdam, the Netherlands) and Sony (Tokyo,
Japan), developed CD-ROM optical storage technology in
the early 1980s, when digital fever was taking over the
analog stereo music industry.
The obvious attractive features of the audio CD versus
the vinyl LP are the relatively low cost of production,
duplication, and distribution, as well as the robustness of
the mediaandthe signicantly clearer andbetter (afeature
that is still disputed by some artistically bent ears) sound
quality that the digital technology offers over that of the
analog.
It might be interesting to note that, as with many new,
revolutionary technologies, even in the United States,
where society accepts technological change at a faster
rate than in most other countries, it took approximately
5 years for the audio CD to take over the vinyl phonograph
record industry. (Based on this experience, one wonders
how long it will take to replace the current combustion
automobile engine for clean electric or other type of
power. . . .)
For the computer industry, the compact disc digital
audio (CD-DA) became an exciting media for storing any
data (i.e., not just audio), including computer-controlled
interactive multimedia, one of the most interesting tech-
nological innovations of the twentieth century. The
approximately $1.00 cost of duplicating 650Mb of data
and then selling it as a recorded product for approximately
$200.00 (in those days) created a new revolution that
became the multimedia CD-ROM as we know it today.
Although there are not just read-only, but read/write
CD-ROMs too (see CD-RW below), typically a CD-ROM is
an optical read-only media, capable of storing approxi-
mately 650 Mbof uncompressed digital data (as anexample
aSony CD-ROMstores 656.10Mbin335925blocks, uncom-
pressed), meaning any mixcture of text, digital video, voice,
images, and others.
It is important to note that, with the advancement of
real-time compression and decompression methods and
technologies, CD recording software packages can put on
a CD-ROMover 1.3Gb of data, instead of the usual 650 Mb.
It is expected that, with increasing computer processor
speeds andbetter integration[see what Apples (Cupertino,
CA) backside cache can do to the overall speed of the
machine], real-time compression and decompression will
be an excellent solution for many applications that need
more than650MbononeCD. Obviously this depends onthe
cost of the competing DVD technology too! This solution
makes the CD-ROMandthe emerging evenhigher capacity
DVD technology essential to the digitalization and compu-
terization of photography, the animation and the video
industry, and the mass archivation and document storage
and retrieval business.
To illustrate the breath and the depth of the opportu-
nities of electronic image capture, manipulation, storage,
and retrieval, consider Fig. 1(a) and 1(b), a solid model
animation sequence of a short battle created by Richard G.
Ranky, illustrating 200 high-resolution frame-by-frame
rendered complex images, integrated into an Apple Quick-
Time digital, interactive movie andstoredonCD-ROM, and
Fig. 2(a)(d), by Mick. F. Ranky, an interactively naviga-
table, panable, Apple QuickTime VR virtual reality movie
of Budapest by night, allowing user-controlled zoom-in/out
and other hot-spot controlled interactivity. (Note that some
of these images and sequences are available in full color at
www.cimwareukandusa.com and that more interactive
demonstrations are available in the electronic version of
this encyclopedia.) Note the difference in terms of the
approach and methods used between the two gures.
The rst set was created entirely by computer modeling
and imaging, and it illustrates a totally articial world,
whereas the second was rst photographed of real, physical
objects and then digitized and pasted and integrated into
an interactive QTVR (see below) movie (111).
CD-ROM TECHNOLOGY, MEDIUM, AND THE STORAGE
DENSITY OF DATA
The most important differences between the magnetic
(hard disk) versus the optical (compact disc) technology
include the storage density of data as well as the way data
are coded and stored. This difference is because CD-ROMs
(and DVDs) use coherent light waves, or laser beams,
versus magnetic elds that are spread much wider than
laser beams to encode information.
The other major advantage is that the laser beam does
not need to be that close to the surface of the media as is the
case with the magnetic hard disk read/write heads. Mag-
netic read/write heads can be as close as 16 mm, or
0.0016mm, to the surface, increasing the opportunity of
the jet-ghter-shaped, literally ying read/write head to
crash into the magnetic surface, in most cases, meaning
catastrophic data loss to the user.
The principle of the optical technology is that binary
data can be encoded by creating a pattern of black-and-
white splotches, just as ON/OFF electrical signals do or as the
well-known bar code appears in the supermarket. Reading
patterns of light and dark requires a photo detector, which
changes its resistance depending onthe brightness levels it
senses through a reected laser beam.
Interms of manufacturing, i.e., printing/ duplicating the
compact disc, the major breakthrough came when engi-
neers found that, by altering the texture of a surface
mechanically, its reectivity could be changed too, which
means that the dark pit does not reect light as well as a
bright mirror. Thus, the CD-ROM should be a reective
mirror that should be dotted with dark pits to encode data,
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
by means of a laser beam traveling along a long spiral, just
as with the vinyl audio LPs, that blasts pits accurately onto
the disc.
The CD is a 80-mm-(i.e., the minidisc) or a 120-mm-
diameter (i.e., the standard) disc, which is 1.2 mm thick
andis spinning, enabling direct dataaccess, just as withthe
vinyl audio LP, when the needle was dropped to any of the
songs in any order. (Note that the more obvious 100 mm
diameter would have been too small to provide the approxi-
mately 150-Mb-per-square-inch storage density of the opti-
cal technologyof that of the 1980s, preferredbythe classical
music industry.) This meant solving the data access pro-
blem on a piece of coated plastic disk, in comparison with
the magnetic hard disk, mechanically in a much simpler
way.
To maximize the data storage capacity of the disc, the
linear velocity recording of the compact disc is a constant
1.2 m per second. To achieve this rate both at the inside as
well as the outside tracks of the disc, the spin varies
between 400rpm (revolutions per minute) and 200rpm
(at the outside edge). This way the same length of tracks
appears to the read/write head in every second.
Furthermore, because of the access time the drive is
capable of performing, it is important to note that the track
pitch of the CD is 1.6 mm. This is the distance the head
moves from the center toward the outside of the disc as it
reads/writes data, and that the data bits are at least
0.83mm long. (In other words, a CD-ROM and its drive are
precision electro-mechanical and software instruments.)
It should be noted that, due to such small distances
between the tracks, it is extremely important to properly
cool CD writers as they cut master CD-ROMs in a profes-
sional studio, or inside or outside a PC or Mac on the
desktop. Usually fan cooling is adequate; nevertheless,
on a warm summer day, air-conditioning even in a desktop
environment is advisable! Furthermore, such equipment
should not be operated at all in case the inside cooling fan
breaks down!
In terms of mass duplication, a master CD (often
referred to as the gold CD) is recorded rst, then this
master is duplicated by means of a stamping equipment
in principle, it is very similar to the vinyl audio LP produc-
tion or the photocopying process. A crucial aspect of this
process is that the data pits are sealed within layers of the
disc and are never reached mechanically, only optically by
the laser beam. Therefore, theoretically, quality mass-pro-
duced (silver) CDs never wear out, only when abusing
them harshly leaves scratches on the surface of the disc or
when exposing them to extreme temperatures (1217).
CD-ROM BLOCKS AND SECTORS
The recordable part of the compact disc consists of at least
three blocks. These are as follows:
HyCD 4.6.5.
Dataware Technologies CD Record 2.2
Eletroson GEAR 3.50
JVC Personal RomMaker
Plus UNIX 3.6
Young Minds Makedisc/CD Studio 1.2
Sun Solaris
Creative Digital Research CDR Publisher
HyCD 4.6.5
Dataware Technologies CD Record 2.2
Electroson GEAR 3.50
JVC Personal RomMaker
Plus UNI 3.6
Kodak Built-It 1.2
Luminex Fire Series 1.9
Smart Storage SmartCD for integrated
recording & access 2.00
Young Minds Makedisc/CD Studio 1.2
HP HP/UX
Electroson Gear 3.50
Smart Storage SmartCD for integrated
recording & access 2.00
Young Minds Makedisc/CD Studio 1.20
JVC Personal RomMaker
Plus UNIX 1.0
Luminex Fire Series 1.9
SGI IRIX
Creative Digital Research CDR Publisher HyCD 4.6.5
Electroson GEAR 3.50
JVC Personal RomMaker
Plus UNIX 1.0
Luminex Fire Series 1.9
Young Minds Makedisc/CD Studio 1.20
DEC OSF
Electroson GEAR 3.50
Young Minds Makedisc/CD Sturdio 1.20
IBM AIX
Electroson GEAR 3.50
Luminex Fire Series 1.9
Smart Storage SmartCD for integrated recording
& access 2.00
Young Minds Makedisc/CD Studio1.20
SCO SVR/ODT
Young Minds Makedisc/CD Studio 1.20
CD-ROMs AND COMPUTERS SYSTEMS 17
Novell NetWare
Celerity systems Virtual CD Writer 2.1
Smart Storage SmartCD for recording 3.78
Smart Storage Smart CD for integrated recording
& access 3.78
Amiga
Asimware Innovations MasterISO 2.0
ACKNOWLEDGMENTS
We hereby would like to express our thanks to NSF (USA),
NJIT (in particular co-PIs in major research grants, Pro-
fessors Steven Tricamo, Don Sebastian, and Richard
Hatch, and the students), Professor T. Pato at ISBE, Swit-
zerland (co-PI in our European research grants), the stu-
dents, and faculty who have helped us a lot with their
comments in the United Kingdom, the United States,
Switzerland, Sweden, Germany, Hungary, Austria, Hong
Kong, China, and Japan. We also thank DARPA in the
United States, The U.S. Department of Commerce, The
National Council for Educational Technology (NCET, Uni-
ted Kingdom), The University of East London, the Enter-
prise Project team, the Ford Motor Company, General
Motors, Hitachi Seiki (United Kingdom) Ltd, FESTO(Uni-
ted Kingdom and United States), Denford Machine Tools,
Rolls-Royce Motor Cars, HP (United Kingdom) Ltd.,
Siemens Plessey, Marconi Instruments, and Apple Com-
puters Inc. for their continuous support in our research,
industrial, and educational multimedia (and other) pro-
jects. Furthermore, we would like to express our thanks to
our families for their unconditional support and encour-
agement, including our sons, Gregory, Mick Jr. and
Richard, for being our rst test engineers and for their
valuable contributions to our interactive multimedia
projects.
FURTHER READING
A. Kleijhorst, E. T. Van der Velde, M. H. Baljon, M. J. G. M.
Gerritsen and H. Oon, Secure and cost-effective exchange of
cardiac images over the electronic highway in the Netherlands,
computers in cardiology, Proc. 1997 24th Annual Meeting on
Computers in Cardiology, Lund, Sweden, Sept. 710, 1997,
pp. 191194.
P. Laguna, R. G. Mark, A. Goldberg and G. B. Moody, Database
for evaluation of algorithms for measurement of qt and other
waveform intervals in the ecg, Proc. 1997 24th Annual Meeting
on Computers in Cardiology, Lund, Sweden, Sept. 710, 1997,
pp. 673676.
B. J. Dutson, Outlook for interactivity via digital satellite, IEE
Conference Publication, Proc. 1997 International Broadcasting
Convention, Amsterdam, the Netherlands, Sept. 1216, 1997,
pp. 15.
Physical properties of polymers handbook, CD-ROM, J. Am.
Chemi. Soc., 119(46): 1997.
J. Phillips, Roamable imaging gets professional: Putting immer-
sive images to work, Adv. Imag., 12(10): 4750, 1997.
H. Yamauchi, H. Miyamoto, T. Sakamoto, T. Watanabe, H. Tsuda
andR. Yamamura, 24&Times;-speedcirc decoder for a Cd-Dsp/CD-
ROM decoder LSI Sanyo Electric Co, Ltd, Digest of Technical
PapersIEEE International Conference on Consumer Electronics
Proc. 199716thInternational Conference onConsumer Electronics,
Rosemont IL, 1113, 1997, pp. 122123.
K. Holtz andE. Holtz, Carom: Asolid-state replacement for the CD-
ROM, Record Proc. 1997 WESCONConference, San Jose, CA, Nov.
46, 1997, pp. 478483.
Anonymous, Trenchless technology research in the UK water
industry, Tunnelling Underground Space Technol., 11(Suppl 2):
6166, 1996.
J. Larish, IMAGEGATE: Making web image marketing work for
the individual photographer, Adv. Imag., 13(1): 7375, 1998.
B. C. Lamartine, R. A. Stutz and J. B. Alexander, Long, long-term
storage, IEEE Potentials, 16(5): 1719, 1998.
A. D. Stuart and A. W. Mayers, Two examples of asynchronous
learning programs for professional development, Conference Proc.
1997 27th Annual Conference on Frontiers in Education. Part 1
(of 3), Nov. 58, Pittsburgh, PA, 1997, pp. 256260.
P. Jacso, CD-ROM databases with full-page images, Comput.
Libraries, 18(2): 1998.
J. Hohle, Computer-assisted teaching and learning in photogram-
metry, ISPRSJ. Photogrammetry Remote Sensing, 52(6): 266276,
1997.
Y. Zhao, Q. Zhao, C. Zhu and W. Huang, Laser-induced tempera-
ture eld distribution in multi-layers of cds and its effect on the
stability of the organic record-layer, Chinese J. Lasers, 24(6): 546
550, 1997.
K. Sakamoto and H. Urabe, Standard high precision picture-
s:SHIPP, Proc. 1997 5th Color Imaging Conference: Color Science,
Systems, and Applications, Scottsdale, AZ, Nov. 1720, 1997,
pp. 240244.
S. M. Zhu, F. H. Choo, K. S. Low, C. W. Chan, P. H. Kong and M.
Suraj, Servo system control in digital video disc, Proc. 1997 IEEE
International Symposium on Consumer Electronics, ISCE97
Singapore, Dec. 24, 1997, pp. 114117.
Robert T. Parkhurst, Pollution prevention in the laboratory, Proc.
Air &Waste Management Associations Annual Meeting &Exhibi-
tion Proceedings, Toronto, Canada, June 813, 1997.
V. W. Sparrow and V. S. Williams, CD-ROM development for a
certicate program in acoustics, Engineering Proc. 1997 National
Conference on Noise Control Engineering, June 1517, 1997,
pp. 369374.
W. H. Abbott, Corrosion of electrical contacts: Review of owing
mixed gas test developments, Br. Corros. J., 24(2): 153, 1989.
M. Parker, et al., Magnetic and magneto-photoellipsometric eva-
luation of corrosion in metal-particle media, IEEE Trans. Mag-
netics, 28(5): 2368, 1992.
P. C. Searson and K. Sieradzki, Corrosion chemistry of magneto-
optic data storage media, Proc. SPIE, 1663: 397, 1992.
Y. Gorodetsky, Y. Haibin and R. Heming, Effective use of multi-
mediafor presentations, Proc. 1997IEEEInternational Conference
on Systems, Man, and Cybernetics, Orlando, FL, Oct. 1215, 1997,
pp. 23752379.
J. Lamont, Latest federal information on CD-ROMs, Comput.
Libraries, 17: 1997.
M. F. Iskander, A. Rodriguez-Balcells, O. de losSantos, R. M.
Jameson and A. Nielsen, Interactive multimedia CD-ROM for
engineering electromagnetics, Proc. 1997 IEEE Antennas and
Propagation Society International Symposium, Montreal, Quebec,
Canada, July 1318, 1997, pp. 24862489.
M. Elphick, Rapidprogress seeninchips for optical drives, Comput.
Design, 36(9): 46, 4850, 1997.
18 CD-ROMs AND COMPUTERS SYSTEMS
H. Iwamoto, H. Kawabe, and N. Mutoh, Telephone directory
retrieval technology for CD-ROM, Telecommun. Res. Lab Source,
46(7): 639646, 1997.
J. Deponte, H. Mueller, G. Pietrek, S. Schlosser and B. Stoltefuss,
Design and implementation of a system for multimedial distrib-
uted teaching and scientic conferences, Proc. 1997 3rd Annual
Conference on Virtual Systems and Multimedia, Geneva, Switzer-
land, Sept. 1012, 1997, pp. 156165.
B. K. Das and A. C. Rastogi, Thin lms for secondary data storage
IETE, J. Res., 43(23): 221232, 1997.
D. E. Speliotis et al., Corrosion study of metal particle, metal lm,
and ba-ferrite tape, IEEE Trans. Magnetics, 27(6): 1991.
J. VanBogart et al., Understanding the Battelle Lab accelerated
tests, NML Bits, 2(4): 2, 1992.
P. G. Ranky, An Introduction to Concurrent Engineering, an
Interactive Multimedia CD-ROMwithoff-line andon-line Internet
support, over 700 interactive screens following an Interactive
Multimedia Talking Book format, Design & Programming by
P. G. Ranky and M. F. Ranky, CIMware 1996, 97. Available:
http://www.cimwareukandusa.com.
P. G. Ranky, An Introduction to Computer Networks, an Inter-
active Multimedia CD-ROM with off-line and on-line Internet
support, over 700 interactive screens following an Interactive
Multimedia Talking Book format, Design & Programming by
P. G. Ranky and M. F. Ranky, CIMware 1998. Available: http://
www.cimwareukandusa.com.
Nice, Karim, Available: http://electronics.howstuffworks.com/
dvd1. htm, 2005.
Available: http://en.wikipedia.org/wiki/Blu-Ray, 2006.
Available: http://en.wikipedia.org/wiki/Dvd, Feb. 2006.
Available: http://en.wikipedia.org/wiki/HD_DVD, Feb. 2006.
R. Silva, Available: http://hometheater.about.com/od/dvdrecor-
derfaqs/f/ dvdrecgfaq5. htm, 2006.
Herbert, Available: http://www.cdfreaks.com/article/186/1, Mar.
2005.
J. Taylor, Available: http://www.dvddemystied.com/dvdfaq.html,
Feb. 10, 2005.
B. Greenway, Available: http://www.hometheaterblog.com/home-
theater/blu-ray_hd-dvd/, Feb. 14, 2006.
L. Magid, Available: http://www.pcanswer.com/articles/synd_dvds.
htm, Oct. 2 2003.
BIBLIOGRAPHY
1. C. F. Quist, L. Lindegren and S. Soderhjelm, Synthesis ima-
ging, Eur. Space Agency, SP-402: 257262, 1997.
2. H. Schrijver, Hipparcos/Tycho ASCII CD-ROM and access
software, Eur. Space Agency, SP-402: 6972, 1997.
3. J. Zedler, and M. Ramadan, I-Media: An integrated media
server and media database as a basic component of a cross
media publishing system, Comput. Graph., 21(6): 693702,
1997.
4. V. Madisetti, A. Gadient, J. Stinson, J. Aylor, R. Klenke, H.
Carter, T. Egolf, M. Salinas and T. Taylor, Darpas digital
system design curriculum and peer-reviewed educational
infrastructure, Proc. 1997 ASEE Annual Conference, Milwau-
kee, WI, 1997.
5. T. F. Hess, R. F. Rynk, S. Chen, L. G. King and A. L. Kenimer,
Natural systems for wastewater treatment: Course material
and CD-ROM development, ASEE Annual Conference Proc.,
Milwaukee, WI, 1997.
6. R. Guensler, P. Chinowsky and C. Conklin, Development of a
Web-based environmental, impact, monitoring and assment
course, Proc. 1997 ASEE Annual Conference, Milwaukee, WI,
1997.
7. M. G. J. M. Gerritsen, F. M. VanRappard, M. H. Baljon, N. V.
Putten, W. R. M. Dassen, W. A. Dijk, DICOM CD-R, your
guarantee to interchangeability?Proc. 1997 24th Annual
Meeting on Computers in Cardiology, Lund, Sweden, 1997,
pp. 175178.
8. T. Yoshida, N. Yanagihara, Y. Mii, M. Soma, H. Yamada,
Robust control of CD-ROMdrives using multirate disturbance
observer, Trans. Jpn. Soc. Mech. Eng. (Part C), 63(615):
pp. 39193925, 1997.
9. J. Glanville andI. Smith, Evaluating the options for developing
databases to support research-based medicine at the NHS
Centre for Reviews and Dissemination, Int. J. Med. Infor-
matics, 47(12): pp. 8386, 1997.
10. J.-L. Malleron and A. Juin, with R.-P. Rorer, Database of
palladium chemistry: Reactions, catalytic cycles and chemical
parameters on CD-ROM Version 1.0 J. Amer. Chem. Soc.,
120(6): p. 1347, 1998.
11. M. S. Park, Y. Chait, M. Steinbuch, Inversion-free design
algorithms for multivariable quantitative feedback theory:
An application to robust control of a CD-ROM park, Automa-
tica, 33(5): pp. 915920, 1997.
12. J.-H. Zhang and L. Cai, Prolometry using an optical stylus
with interferometric readout, Proc. IEEE/ASME Interna-
tional Conference on Advanced Intelligent Mechatronics,
Tokyo, Japan, p. 62, 1997.
13. E. W. Williams andT. Kubo, Cross-substitutional alloys of insb.
for write-once read-many optical media, Jpn. J. Appl. Phys.
(Part 2), 37(2A): pp. L127L128, 1998.
14. P. Nicholls, Apocalypse now or orderly withdrawal for CD-
ROM? Comput. Libraries, 18(4): p. 57, 1998.
15. N. Honda, T. Ishiwaka, T. Takagi, M. Ishikawa, T. Nakajima,
Information services for greater driving enjoyment, SAE Spe-
cial Publications on ITS Advanced Controls and Vehicle Navi-
gation Systems, Proc. 1998 SAE International Congress &
Exposition, Detroit, MI, 1998, pp. 5169.
16. J. K. Whitesell, Merck Index, 12th Edition, CD-ROM (Macin-
tosh): An encyclopedia of chemicals, drugs & biologicals, J.
Amer. Chem. Soc., 120(9): 1998.
17. C. Van Nimwegen, C. Zeelenberg, W. Cavens, Medical devices
database on CD, Proc. 1996 18th Annual International Con-
ference of the IEEE Engineering in Medicine and Biology
Society (Part 5), Amsterdam, 1996, pp. 19731974.
18. S. G. Stan, H. Van Kempen, G. Leenknegt, T. H. M. Akker-
mans, Look-ahead seek correction in high-performance CD-
ROM drives, IEEE Transactions on Consumer Electronics,
44(1): 178186, 1998.
19. W. P. Murray, CD-ROM archivability, NML Bits, 2(2): p. 4,
1992.
20. W. P. Murray, Archival life expectancy of 3M magneto-optic
media, J. Magnetic Soci. Japan, 17(S1): 309, 1993.
21. F. L. Podio, Research on methods for determining optical disc
media life expectancy estimates, Proc. SPIE, 1663: 447, 1992.
22. On CD-ROMs that set a new standard, Technol. Rev., 101(2):
1998.
23. National Geographic publishes 108 years on CD-ROM, Ima-
ging Mag., 7(3): 1998.
24. Shareware Boosts CD-ROM performance, tells time, EDN,
43(4): 1998.
CD-ROMs AND COMPUTERS SYSTEMS 19
25. Ford standardizes training with CD-ROMs, Industrial Paint
Powder, 74(2): 1998.
26. O. Widell and E. Egis, Geophysical information services, Eur.
Space Agency, SP-397: 1997.
27. J. Tillinghast, G. Beretta, Structure and navigation for elec-
tronic publishing, HP Laboratories Technical Report, 97162,
Hewlett Packard Lab Technical Publ Dept, Palo Alto, CA, Dec.
1997.
28. J. Fry, A cornerstone of tomorrows entertainment eco-
nomy, Proc. 1997 WESCON Conference, San Jose, CA, 1997,
pp. 6573.
29. M. Kageyama, A. Ohba, T. Matsushita, T. Suzuki, H. Tanabe,
Y. Kumagai, H. Yoshigi and T. Kinoshita, Free time-shift DVD
video recorder, IEEE Trans. Consumer Electron., 43(3): 469
474, 1997.
30. S. P. Schreiner, M. Gaughan, T. Myint and R. Walentowicz,
Exposure models of library and integrated model evaluation
system: A modeling information system on a CD-ROM with
World-Wide Web links, Proc. 1997 4th IAWQ International
Symposium on Systems Analysis and Computing in Water
Quality Management, Quebec, Canada, June 1720, 1997,
pp. 243249.
31. Anonymous, software review, Contr. Eng., 44(15): 1997.
32. M. William, Using multimedia and cooperative learning in
and out of class, Proc. 1997 27th Annual Conference on
Frontiers in Education. Part 1 (of 3), Pittsburgh, PA, Nov.
58,1997 pp. 4852.
33. P. G. Ranky, Amethodology for supporting the product innova-
tionprocess, Proc. USA/JapanInternational IEEEConference
on Factory Automation, Kobe, Japan, 1994, pp. 234239.
34. P. AshtonandP. G. Ranky, The development andapplicationof
an advanced concurrent engineering research tool set at Rolls-
Royce Motor Cars Limited, UK, Proc. USA/Japan Interna-
tional IEEE Conference on Factory Automation, Kobe, Japan,
1994, pp. 186190.
35. K. L. Ho and P. G. Ranky, The designand operation control of a
recongurable exible material handling system, Proc. USA/
Japan International IEEEConference on Factory Automation,
Kobe, Japan, 1994, pp. 324328.
36. P. G. Ranky, The principles, application and research of inter-
active multimedia and open/distance learning in advanced
manufacturing technology, Invited Keynote Presentation,
The Fourth International Conference on Modern Industrial
Training, Xilan, China, 1994, pp. 1628.
37. D. A. Norman and J. C. Spohner, Learner-centered education,
Commun. ACM, 39(4): 2427, 1996.
38. R. C. Schankand A. Kass, Agoal-basedscenario for highschool
students, Commun. ACM, 39(4): 2829, 1996.
39. B. Woolf, Intelligent multimedia tutoring systems, Commun.
ACM, 39(4): 3031, 1996.
40. M. Flaherty, M. F. Ranky, P. G. Ranky, S. Sands and S.
Stratful, FESTO: Servo Pneumatic Positioning, an Interactive
Multimedia CD-ROM with off-line and on-line Internet sup-
port, Over 330 interactive screens, CIMware & FESTO Auto-
mation joint development 1995,96, Design & Programming by
P.G. Ranky and M. F. Ranky. Available: http://www.cimwar-
eukandusa.com.
41. P. G. Ranky, An Introduction to Total Quality (including
ISO9000x), an Interactive Multimedia CD-ROM with off-line
and on-line Internet support, over 700 interactive screens
following an Interactive Multimedia Talking Book format,
Design & Programming by P. G. Ranky and M. F. Ranky,
CIMware 1997. Available: http://www.cimwareukandusa.com.
42. P. G. Ranky, AnIntroductionto Flexible Manufacturing, Auto-
mation &Assembly, an Interactive Multimedia CD-ROMwith
off-line and on-line Internet support, over 700 interactive
screens following an Interactive Multimedia Talking Book
format, Design & Programming by P. G. Ranky and M. F.
Ranky, CIMware 1997. Available: http://www.cimwareukan-
dusa.com.
PAUL G. RANKY
New Jersey Institute of
Technology
Newark, New Jersey
GREGORY N. RANKY
MICK F. RANKY
Ridgewood, New Jersey
20 CD-ROMs AND COMPUTERS SYSTEMS
C
COMPUTER ARCHITECTURE
The term computer architecture was coined in the 1960s
by the designers of the IBM System/360 to mean the
structure of a computer that a machine language program-
mer must understand to write a correct program for a
machine (1).
The task of a computer architect is to understand the
state-of-the-art technologies at each design level and the
changing design tradeoffs for their specic applications.
The tradeoff of cost, performance, and power consumption
is fundamental to a computer system design. Different
designs result from the selection of different points on
the cost-performance-power continuum, and each applica-
tionwill requireadifferent optimumdesignpoint. For high-
performance server applications, chip and systemcosts are
less important than performance. Computer speedup can
be accomplished by constructing more capable processor
units or by integrating many processors units on a die. For
cost-sensitive embedded applications, the goal is to mini-
mize processor die size and system power consumption.
Technology Considerations
Modern computer implementations are based on silicon
technology. The two driving parameters of this technology
are die size and feature size. Die size largely determines
cost. Feature size is dependent on the lithography used in
wafer processing andis denedas the lengthof the smallest
realizable device. Feature size determines circuit density,
circuit delay, and power consumption. Current feature
sizes range from 90 nm to 250 nm. Feature sizes below
100 nm are called deep submicron. Deep submicron tech-
nology allows microprocessors to be increasingly more
complicated. According to the Semiconductor Industry
Association (2), the number of transistors (Fig. 1) for
high-performance microprocessors will continue to grow
exponentially in the next 10 years. However, there are
physical and program behavioral constraints that limit
the usefulness of this complexity. Physical constraints
include interconnect and device limits as well as practical
limits on power and cost. Program behavior constraints
result from program control and data dependencies and
unpredictable events during execution (3).
Muchof the improvement inmicroprocessor performance
has beena result of technology scaling that allows increased
circuit densities at higher clock frequencies. As feature sizes
shrink, device area shrinks roughly as the square of the
scaling factor, whereas device speed (under constant eld
assumptions) improves linearly with feature size.
On the other hand, there are a number of major tech-
nical challenges inthe deep submicronera, the most impor-
tant of which is that interconnect delay (especially global
interconnect delay) does not scale with the feature size. If
all three dimensions of an interconnect wire are scaled
down by the same scaling factor, the interconnect delay
remains roughly unchanged, because the fringing eld
component of wire capacitance does not vary with feature
size. Consequently, interconnect delay becomes a limiting
factor in the deep submicron era.
Another very important technical challenge is the
difculty faced trying to dissipate heat from processor
chip packages as chip complexity and clock frequency
increases.
Indeed, special cooling techniques are needed for pro-
cessors that consume more than 100W of power. These
cooling techniques are expensive and economically infea-
sible for most applications (e.g., PC). There are also a
number of other technical challenges for high-performance
processors. Custom circuit designs are necessary to enable
GHz signals to travel in and out of chips. These challenges
require that designers provide whole-system solutions
rather than treating logic design, circuit design, and
packaging as independent phases of the design process.
Performance Considerations
Microprocessor performance has improved by approxi-
mately 50% per year for the last 20 years, which can be
attributed to higher clock frequencies, deeper pipelines,
and improved exploitation of instruction-level parallelism.
However, the cycle time at a given technology cannot be
too small, or we will sacrice overall performance by incur-
ring too much clock overhead and suffering long pipeline
breaks. Similarly, the instruction-level parallelism is
usually limited by the application, which is further dimin-
ished by code generation inefciencies, processor resource
limitations, and execution disturbances. The overall sys-
temperformance may deteriorate if the hardware to exploit
the parallelism becomes too complicated.
High-performance server applications, in which chip and
system costs are less important than total performance, en-
compass a wide range of requirements, from computation-
intensive to memory-intensive to I/O-intensive. The need
to customize implementation to specic applications may
even alter manufacturing. Although expensive, high-
performance servers may require fabrication micropro-
duction runs to maximize performance.
Power Considerations
Power consumption has received increasingly more atten-
tion because both high-performance processors and pro-
cessors for portable applications are limited by power
consumption. For CMOS design, the total power dissipa-
tion has three major components as follows:
1. switching loss,
2. leakage current loss, and
3. short circuit current loss.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Among these three factors, switching loss is usually the
most dominant factor. Switching loss is proportional to
operating frequency and is also proportional to the square
of supply voltage. Thus, lowering supply voltage can effec-
tively reduce switching loss. In general, operating
frequency is roughly proportional to supply voltage. If
supply voltage is reduced by 50%, operating frequency is
also reduced by 50%, and total power consumption
becomes one-eighth of the original power. On the other
hand, leakage power loss is a function of CMOS threshold
voltage. As supply voltage decreases, threshold voltage has
to be reduced, which results in an exponential increase in
leakage power loss. When feature size goes below 90 nm,
leakage power loss can be as high as switching power loss.
For many DSPapplications, the acceptable performance
can be achieved at a low operating frequency by exploiting
the available program parallelism using suitable parallel
forms of processor congurations. Improving the battery
technology, obviously, can allow processors to run for an
extended period of time. Conventional nickel-cadmium
battery technology has been replaced by high-energy den-
sity batteries such as the NiMH battery. Nevertheless, the
energy density of a battery is unlikely to improve drasti-
cally for safety reasons. When the energy density is too
high, a battery becomes virtually an explosive.
Cost Considerations
Another design tradeoff is to determine the optimum die
size. In the high-performance server market, the processor
cost may be relatively small compared with the overall
system cost. Increasing the processor cost by 10 times
may not signicantly affect the overall system cost. On
the other hand, system-on-chip implementations tend to be
very cost sensitive. For these applications, the optimumuse
of die size is extremely important.
The area available to a designer is largely a function of
the manufacturing processing technology, which includes
the purity of the silicon crystals, the absence of dust and
other impurities, andthe overall control of the diffusionand
process technology. Improved manufacturing technology
allows larger die with higher yields, and thus lower man-
ufacturing costs.
At a given technology, die cost is affected by chip size in
two ways. First, as die area increases, fewer die can be
realizedfromawafer. Second, as the chipsize increases, the
yield decreases, generally following a Poisson distribution
of defects. For certain die sizes, doubling the area can
increase the die cost by 10 times.
Other Considerations
As VLSI technology continues to improve, there are new
design considerations for computer architects. The simple
traditional measures of processor performancecycle time
and cache sizeare becoming less relevant in evaluating
application performance. Some of the new considerations
include:
1. Creating high-performance processors with enabling
compiler technology.
2. Designing power-sensitive system-on-chip proces-
sors in a very short turnaround time.
3. Improving features that ensure the integrity and
reliability of the computer.
4. Increasing the adaptability of processor structures,
such as cache and signal processors.
Performance-Cost-Power Tradeoffs
In the era of deep-submicron technology, two classes of
microprocessors are evolving: (1) high-performance server
processors and (2) embedded client processors. The major-
ity of implementations are commodity system-on-chip
processors devoted to end-user applications. These highly
cost-sensitive client processors are used extensively in
consumer electronics. Individual application may have
specic requirements; for example, portable and wireless
applications require very low power consumption. The
other class consists of high-end server processors, which
are performance-driven. Here, other parts of the system
dominate cost and power issues.
At a xed feature size, area can be traded off for perfor-
mance (expressed in term of execution time, T). VLSI
complexity theorists have shown that an A T
n
bound
exists for microprocessor designs (1), where n usually falls
between 1 and 2. By varying the supply voltage, it is also
possible to tradeoff time T for power P with a P T
3
bound.
Figure 1. Number of transistors per chip.
(Source: National Technology Roadmap for Semiconduc-
tors)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
90 65 32 18
2004 2007 2013 2018
Technology (nm) / Year of First Introduction
N
o
.
o
f
t
r
a
n
s
i
s
t
o
r
s
(
m
i
l
l
i
o
n
s
)
2 COMPUTER ARCHITECTURE
Figure 2 shows the possible tradeoff involving area, time,
and power in a processor design (3). Embedded and high-
end processors operate in different design regions of this
three-dimensional space. The power and area axes are
typically optimized for embedded processors, whereas the
time axis is typically optimized for high-end processors.
Alternatives in Computer Architecture
In computer architecture, the designer must understand
the technology and the user requirements as well as the
available alternatives in conguring a processor. The
designer must apply what is known of user program beha-
vior and other requirements to the task of realizing an
area-time-power optimized processor. User programs offer
differing types and forms of parallelism that can be
matched by one or more processor congurations. A pri-
mary design goal is to identify the most suitable processor
conguration and then scale the concurrency available
within that conguration to match cost constraints.
The next section describes the principle functional ele-
ments of a processor. Then the various types of parallel and
concurrent processor conguration are discussed. Finally,
some recent architectures are compared and some conclud-
ing remarks are presented.
PROCESSOR ARCHITECTURE
The processor architecture consists of the instruction
set, the memory that it operates on, and the control and
functional units that implement and interpret the instruc-
tions. Although the instruction set implies many imple-
mentation details, the resulting implementation is a great
deal more than the instruction set. It is the synthesis of the
physical device limitations with area-time-power tradeoffs
to optimize cost-performance for specied user require-
ments. As shown in Fig. 3, the processor architecture
may be divided into a high-level programming model and
a low-level microarchitecture.
Instruction Set
Computers deal with many different kinds of data and
data representations. The operations available to perform
the requisite data manipulations are determined by the
data types and the uses of such data. Processor design
issues are closely bound to the instruction set. Instruction
set behavior data affects many of these design issues
The instruction set for most modern machines is based
on a register set to hold operands and addresses. The
register set size varies from 8 to 64 words, each word
consisting of 32 to 64 bits. An additional set of oating-
point registers (16 to 64 bits) is usually also available. A
Power (P)
Cost- and power-
sensitive client
processor design
A T
n
= constant
P T
3
= constant
High-performance
server processor
design
Area (A)
Time (T)
Figure 2. Design tradeoffs for high-end and low-end proces-
sors.
Figure 3. Processor architecture block
diagram.
COMPUTER ARCHITECTURE 3
typical instruction set species a program status word,
which consists of various types of control status informa-
tion, including condition codes set by the instruction.
Common instruction sets can be classied by format differ-
ences into three types as follows:
1. the L/S, or Load-Store architecture;
2. the R/M, or Register-Memory architecture; and
3. the RM, or Register-plus-Memory architecture.
The L/S or Load-Store instruction set describes many of
the RISC (reduced instruction set computer) microproces-
sors (5). All values must be loaded into registers before an
execution can take place. An ALU ADD instruction must
have both operands with the result specied as registers
(three addresses). The purpose of the RISC architecture is
to establish regularity of execution and ease of decoding
in an effort to improve overall performance. RISC archi-
tects have tried to reduce the amount of complexity in
the instruction set itself and regularize the instruction
format so as to simplify decoding of the instruction. A
simpler instruction set with straightforward timing is
more readily implemented. For these reasons, it was
assumed that implementations based on the L/S instruc-
tion set would always be faster (higher clock rates and
performance) than other classes, other parameters being
generally the same.
The R/M or Register-Memory architectures include
instruction that can operate both on registers and with
one of the operands residing in memory. Thus, for the R/M
architecture, an ADD instruction might be dened as the
sum of a register value and a value contained in memory,
with the result going to a register. The R/Minstruction sets
generally trace their evolution to the IBM System/360
introduced in 1963. The mainframe computers follow the
R/M style (IBM, Amdahl, Hitachi, Fujitsu, etc., which all
use the IBM instruction set), as well as the basic Intel x86
series.
The RM or Register-plus-Memory architectures allow
formats to include operands that are either in memory or
in registers. Thus, for example, an ADD may have all of
its operands in registers or all of its operands in memory,
or any combination thereof. The RM architecture gene-
ralizes the formats of R/M. The classic example of the RM
architecture was Digital Equipments VAX series of
machines. VAX also generalized the use of the register
set through the use of register modes. The use of an
extended set of formats and register modes allows a power-
ful and varied specication of operands and operation type
within a single instruction. Unfortunately, format and
mode variability complicates the decoding process so that
the process of interpretationof instructions canbe slow(but
RM architectures make excellent use of memory/bus
bandwidth).
Fromthearchitects point of view, thetradeoff ininstruc-
tion sets is an area-time compromise. The register-memory
(R/Mand RM) architectures offer a more concise program
representation using fewer instructions of variable size
compared with L/S. Programs occupy less space in memory
and smaller instruction caches can be used effectively. The
variable instruction size makes decoding more difcult.
The decoding of multiple instructions requires predicting
the starting point of each. The register-memory processors
require more circuitry (and area) to be devoted to instruc-
tion fetch and decode. Generally, the success of Intel-type
x86 implementations in achieving high clock rates and
performance has shown that the limitations of a register-
memory instruction set can be overcome.
Memory
The memory system comprises the physical storage ele-
ments in the memory hierarchy. These elements include
those specied by the instruction set (registers, main mem-
ory, and disk sectors) as well as those elements that are
largely transparent to the users program(buffer registers,
cache, and page-mapped virtual memory). Registers have
the fastest access and, although limited in capacity (32 to
128bits), inprogramexecution, is the most oftenreferenced
type of memory. A processor cycle time is usually dened
by the time it takes to access one or more registers, operate
their contents, and return the result to a register.
Main memory is the type of storage usually associated
with the simple term memory. Most implementations are
based on DRAM (e.g., DDR and DDR-2 SDRAM), although
SRAM and Flash technologies have also been used. DRAM
memory is accessible in order of 10s of cycles (typically
20 to 30) and usually processors have between 128 MB and
4 GB of such storage. The disk contains all the programs
and data available to the processor. Its addressable unit
(sector) is accessible in 1 to 10 ms, with a typical single-unit
disk capacity of 10 to 300 GB. Large server systems may
have 100s or more such disk units. As the levels of the
memory system have such widely differing access times,
additional levels of storage (buffer registers, cache, and
paged memory) are added that serve as a buffer between
levels attempting to hide the access time differences.
Memory Hierarchy. There are basically three para-
meters that dene the effectiveness of the memory system:
latency, bandwidth, and the capacity of each level of the
system. Latency is the time for a particular access request
to be completed. Bandwidth refers to the number of
requests supplied per unit time. To provide large memory
spaces with desirable access time latency and bandwidths,
modern memory systems use a multiple-level memory
hierarchy.
Smaller, faster levels have greater cost per bit than
larger, slower levels. The multiple levels in the storage
hierarchy can be ordered by their size and access time from
the smallest, fastest level to the largest, slowest level. The
goal of a good memory system design is to provide the
processor with an effective memory capacity of the largest
level with an access time close to the fastest. How well this
goal is achieved depends on a number of factorsthe
characteristics of the device used in each level as well as
the behavioral properties of the programs being executed.
Suppose we have a memory systemhierarchy consisting
of a cache, a main memory, and a disk or backing storage.
The disk contains the contents of the entire virtual memory
space. Typical size (S) and access time ratios (t) are as
4 COMPUTER ARCHITECTURE
follows:
Size: memory/cache ~ 1000
Access time: memory/cache ~ 30
Size: disk/memory ~ 1001000
Access time: disk/memory ~ 100,000
Associated with both the cache and a paged main memory
are corresponding tables that dene the localities that are
currently available at that level. The page table contains
the working set of the diskthose disk localities that have
beenrecently referencedby the program, andare contained
inmain memory. The cache table is completely managed by
the hardware and contains those localities (called lines) of
memory that have been recently referenced by the pro-
gram.
The memory system operates by responding to a virtual
effective address generated by a user program, which is
translated into a real address in main memory. This real
address accesses the cache table to nd the entry in the
cache for the desired value.
Paging and caching are the mechanisms to support the
efcient management of the memory space. Paging is
the mechanism by which the operating system brings
xed-size blocks (or pages)a typical size is 4 to 64 KB
into main memory. Pages are fetched from backing store
(usually disk) on demand (or as required) by the processor.
When a referenced page is not present, the operating
system is called and makes a request for the page, then
transfers control to another process, allowing the processor
resources to be used while waiting for the return of the
requested information. The real address is used to access
the cache and main memory. The low-order (least signi-
cant) bits address a particular location in a page. The
upper bits of a virtual address access a page table (in
memory) that:
1. determines whether this particular partial page lies
in memory, and
2. translates the upper address bits if the page is pre-
sent, producing the real address.
Usually, the tables performing address translation are in
memory, and a mechanism for the translation called the
translationlookaside buffer (TLB) must be usedto speedup
this translation. The TLB is a simple register system
usually consisting of between 64 and 256 entries that
save recent address translations for reuse.
Control and Execution
Instruction Execution Sequence. The semantics of the
instruction determines that a sequence of actions must
be performed to produce the specied result (Fig. 4). These
actions can be overlapped (as discussed in the pipelined
processor section) but the result must appear in the speci-
ed serial order. These actions include the following:
1. fetching the instruction into the instruction register
(IF),
2. decoding the op code of the instruction (ID),
3. generating the address in memory of any data item
residing there (AG),
4. fetching data operand(s) into executable registers
(DF),
5. executing the specied operation (EX), and
6. returning the result to the specied register (WB).
Decode: Hardwired and Microcode. The decoder pro-
duces the control signals that enable the functional units
to execute the actions that produce the result specied
by the instruction. Each cycle, the decoder produces a
new set of control values that connect various registers
and functional units. The decoder takes as an initial input
the op code of the instruction. Using this op code, it gene-
rates the sequence of actions, one per cycle, which com-
pletes the execution process. The last step of the current
instructions execution is the fetching of the next instruc-
tion into the instruction register so that it may be decoded.
The implementation of the decoder may be based on
Boolean equations that directly implement the specied
actions for each instruction. When these equations are
implemented with logic gates, the resultant decoder is
called a hardwired decoder.
For extended instruction sets or complex instructions,
another implementation is sometimes used, which is based
on the use of a fast storage (or microprogram store). A
particular word in the storage (called a microcode) contains
the control information for a single action or cycle. A
sequence of microinstructions implements the instruction
execution.
Data Paths: Busses and Functional Units. The data paths
of the processor include all the functional units needed
to implement the vocabulary (or op codes) of the instruc-
tion set. Typical functional units are the ALU (arithmetic
logic unit) and the oating-point unit. Busses and other
structured interconnections between the registers and
the functional units complete the data paths.
PROGRAM PARALLELISM AND PARALLEL ARCHITECTURE
Exploiting program parallelism is one of the most impor-
tant elements in computer architecture design. Programs
written in imperative languages encompass the following
four levels of parallelism:
1. parallelism at the instruction level (ne-grained),
2. parallelism at the loop level (middle-grained),
Time
WB IF DF EX ID AG
Figure 4. Instruction execution sequence.
COMPUTER ARCHITECTURE 5
3. parallelism at the procedure level (middle-grained),
and
4. parallelism at the program level (coarse-grained).
Instruction-level parallelism (ILP) means that multiple
operations can be executed in parallel within a program.
ILPmay be achieved withhardware, compiler, or operating
systemtechniques. At the loop level, consecutive loop itera-
tions are ideal candidates for parallel execution provided
that there is no data dependency between subsequent loop
iterations. Next, there is parallelism available at the pro-
cedure level, which depends largely on the algorithms used
in the program. Finally, multiple independent programs
can obviously execute in parallel.
Different computer architectures have been built to
exploit this inherent parallelism. In general, a computer
architecture consists of one or more interconnected proces-
sor elements that operate concurrently, solving a single
overall problem. The various architectures can be conve-
niently described using the stream concept. A stream is
simply a sequence of objects or actions. There are both
instruction streams and data streams, and there are four
simple combinations that describe the most familiar par-
allel architectures (6):
1. SISD single instruction, single data stream; The
traditional uniprocessor (Fig. 5).
2. SIMD single instruction, multiple data stream,
which includes array processors and vector proces-
sors (Fig. 6).
3. MISD multiple instruction, single data stream,
which are typically systolic arrays (Fig. 7).
4. MIMD multiple instruction, multiple data stream,
which includes traditional multiprocessors as well as
the newer work of networks of workstations (Fig. 8).
The stream description of computer architectures
serves as a programmers view of the machine. If the pro-
cessor architecture allows for parallel processing of one
sort or another, then this information is also visible to
the programmer. As a result, there are limitations to the
stream categorization. Although it serves as useful short-
hand, it ignores many subtleties of an architecture or an
implementation. Even an SISD processor can be highly
parallel in its execution of operations. This parallelism is
typically not visible to the programmer even at the assem-
bly language level, but becomes visible at execution time
with improved performance.
There are many factors that determine the overall effec-
tiveness of a parallel processor organization. Interconnec-
tion network, for instance, can affect the overall speedup.
The characterizations of both processors and networks are
complementary to the stream model and, when coupled
with the stream model, enhance the qualitative under-
standing of a given processor conguration.
SISD Single Instruction, Single Data Stream
The SISD class of processor architecture includes most
commonly available computers. These processors are
known as uniprocessors and can be found in millions of
embedded processors in video games and home appliances
as well as stand-alone processors in home computers, engi-
neering workstations, and mainframe computers. Although
Figure 7. MISD multiple instruction, single data stream.
Figure 5. SISD single instruction, single data stream.
Figure 6. SIMD single instruction, multiple data stream.
Figure 8. MIMD multiple instruction, multiple data stream.
6 COMPUTER ARCHITECTURE
a programmer may not realize the inherent parallelism
within these processors, a good deal of concurrency can be
present. Pipelining is a powerful technique that is used in
almost all current processor implementations. Other tech-
niques aggressively exploit parallelism in executing code
whether it is declared statically or determined dynamically
from an analysis of the code stream.
During execution, a SISD processor executes one or
more operations per clock cycle from the instruction
stream. An instruction is a container that represents the
smallest execution packet managed explicitly by the pro-
cessor. One or more operations are contained within an
instruction. The distinction between instructions and
operations is crucial to distinguishbetweenprocessor beha-
viors. Scalar and superscalar processors consume one or
more instructions per cycle where eachinstructioncontains
a single operation. VLIW processors, on the other hand,
consume a single instruction per cycle where this instruc-
tion contains multiple operations.
A SISD processor has four primary characteristics.
The rst characteristic is whether the processor is capable
of executing multiple operations concurrently. The second
characteristic is the mechanisms by which operations
are scheduled for executionstatically at compile time,
dynamically at execution, or possibly both. The third char-
acteristic is the order that operations are issued andretired
relative to the original program orderthese operations
can be in order or out of order. The fourth characteristic is
the manner in which exceptions are handled by the pro-
cessorprecise, imprecise, or a combination. This last
condition is not of immediate concern to the applications
programmer, although it is certainly important to the
compiler writer or operating system programmer who
must be able to properly handle exception conditions.
Most processors implement precise exceptions, although
a few high-performance architectures allow imprecise
oating-point exceptions.
Tables 1, 2, and 3 describe some representative scalar,
processors, superscalar processors, and VLIW processors.
Scalar Processor. Scalar processors process a maximum
of one instruction per cycle and execute a maximum of one
operationper cycle. The simplest scalar processors, sequen-
tial processors, process instructions atomically one after
another. This sequential execution behavior describes the
sequential execution model that requires each instruction
executed to completion in sequence. In the sequential
execution model, execution is instruction-precise if the
following conditions are met:
1. All instructions (or operations) preceding the current
instruction (or operation) have been executed and all
results have been committed.
2. All instructions (or operations) after the current
instruction (or operation) are unexecuted and no
results have been committed.
3. The current instruction (or operation) is in an arbi-
trary state of execution and may or may not have
completed or had its results committed.
For scalar and superscalar processors with only a single
operation per instruction, instruction-precise and operation-
precise executions are equivalent. The traditional deni-
tion of sequential execution requires instruction-precise
execution behavior at all times, mimicking the execution
of a nonpipelined sequential processor.
Sequential Processor. Sequential processors directly
implement the sequential execution model. These proces-
sors process instructions sequentially from the instruction
stream. The next instruction is not processed until all
execution for the current instruction is complete and its
results have been committed.
Although conceptually simple, executing each instruc-
tion sequentially has signicant performance drawbacks
a considerable amount of time is spent in overhead and not
in actual execution. Thus, the simplicity of directly imple-
menting the sequential execution model has signicant
performance costs.
Pipelined Processor. Pipelining is a straightforward
approach to exploiting parallelism that is based on concur-
rently performing different phases (instruction fetch,
decode, execution, etc.) of processing an instruction. Pipe-
lining assumes that these phases are independent between
different operations and can be overlapped; when this
condition does not hold, the processor stalls the down-
stream phases to enforce the dependency. Thus, multiple
operations can be processed simultaneously with each
operation at a different phase of its processing. Figure 9
illustrates the instruction timing in a pipelined processors,
assuming that the instructions are independent. The
Table 1. Typical Scalar Processors (SISD)
Processor
Year of
introduction
Number of
function unit
Issue
width Scheduling
Number of
transistors
Intel 8086 1978 1 1 Dynamic 29K
Intel 80286 1982 1 1 Dynamic 134K
Intel 80486 1989 2 1 Dynamic 1.2M
HP PA-RISC 7000 1991 1 1 Dynamic 580K
Sun SPARC 1992 1 1 Dynamic 1.8M
MIPS R4000 1992 2 1 Dynamic 1.1M
ARM 610 1993 1 1 Dynamic 360K
ARM SA-1100 1997 1 1 Dynamic 2.5M
COMPUTER ARCHITECTURE 7
meaning of each pipeline stage is described in the Instruc-
tion Execution Sequence system.
For a simple pipelined machine, only one operation
occurs in each phase at any given time. Thus, one operation
is being fetched, one operation is being decoded, one opera-
tion is accessing operands, one operation is in execution,
andone operationis storingresults. The most rigidformof a
pipeline, sometimes called the static pipeline, requires the
processor to go through all stages or phases of the pipeline
whether required by a particular instruction or not.
Dynamic pipeline allows the bypassing of one or more of
the stages of the pipeline depending on the requirements of
the instruction. There are at least three levels of sophisti-
cation within the category of dynamic pipeline processors
as follows:
v
Type 1: Dynamic pipelines that require instructions to
be decoded in sequence and results to be executed and
written back in sequence. For these types of simpler
dynamic pipeline processors, the advantage over a
static pipeline is relatively modest. In-order execution
requires the actual change of state to occur in order
specied in the instruction sequence.
v
Type 1-Extended: A popular extension of the Type 1
pipeline is to require the decode to be in order, but
the execution stage of ALU operations need not be in
order. In these organizations, the address generation
stage of the load and store instructions must be
completed before any subsequent ALU instruction
does a writeback. The reason is that the address gen-
eration may cause a page execution and affect the
Table 3. Typical VLIW Processors (SISD)
Processor
Year of
introduction
Number of
function unit
Issue
width Scheduling
Issue/complete
order
Multiow Trace 7/200 1987 7 7 Static In-order/in-order
Multiow Trace 14/200 1987 14 14 Static In-order/in-order
Multiow Trace 28/200 1987 28 28 Static In-order/in-order
Cydrome Cydra 5 1987 7 7 Static In-order/in-order
Philips TM-1 1996 27 5 Static In-order/in-order
TI TMS320/C62x 1997 8 8 Static In-order/in-order
Intel Itanium 2001 9 6 Static In-order/in-order
Intel Itanium 2 2003 11 6 Static In-order/in-order
Table 2. Typical Superscalar Processors (SISD)
Processor
Year of
introduction
Number of
function unit
Issue
width Scheduling
Number of
transistors
HP PA-RISC 7100 1992 2 2 Dynamic 850K
Motorola PowerPC 601 1993 4 3 Dynamic 2.8M
MIPS R8000 1994 6 4 Dynamic 3.4M
DEC Alpha 21164 1994 4 4 Dynamic 9.3M
Motorola PowerPC 620 1995 4 2 Dynamic 7M
MIPS R10000 1995 5 4 Dynamic 6.8M
HP PA-RISC 7200 1995 3 2 Dynamic 1.3M
Intel Pentium Pro 1995 5 3/6, Dynamic 5.5M
DEC Alpha 21064 1992 4 2 Dynamic 1.7M
Sun Ultra I 1995 9 4 Dynamic 5.2M
Sun Ultra II 1996 9 4 Dynamic 5.4M
AMD K5 1996 6 4/46, Dynamic 4.3M
Intel Pentium II 1997 5 3/66, Dynamic 7.5M
AMD K6 1997 7 2/66, Dynamic 8.8M
Motorola PowerPC 740 1997 6 3 Dynamic 6.4M
DEC Alpha 21264 1998 6 4 Dynamic 15.2M
HP PA-RISC 8500 1998 10 4 Dynamic 140M
Motorola PowerPC 7400 1999 10 3 Dynamic 6.5M
AMD K7 1999 9 3/66, Dynamic 22M
Intel Pentium III 1999 5 3/66, Dynamic 28M
Sun Ultra III 2000 6 4 Dynamic 29M
DEC Alpha 21364 2000 6 4 Dynamic 100M
AMD Athlon 64 FX51 2003 9 3/66, Dynamic 105M
Intel Pentium 4 Prescott 2003 5 3/66, Dynamic 125M
1
For some Intel 86 family processors, eachinstructionis brokeninto a number of microoperationcodes inthe decoding stage. Inthis article,
two different issue widths are givenfor these processors: The rst one is the maximumnumber of instructions issuedper cycle, andthe second
one is the maximum number of microoperation codes issued per cycle.
8 COMPUTER ARCHITECTURE
processor state. As aresult of these restrictions andthe
overall frequency of load and store instructions, the
Type 1-Extended pipeline behaves much as the basic
Type-1 pipeline.
v
Type 2: Dynamic pipelined machines that can be con-
gured to allow out-of order execution yet retain in-
order instruction decode. For this type of pipelined
processor, the execution and writeback of all instruc-
tions is a function only of dependencies on prior
instruction. If a particular instruction is independent
of all preceding instructions, its execution can be com-
pleted independently of the successful completion of
prior instructions.
v
Type 3: The third type of dynamic pipeline allows
instructions to be issued as well as completed out of
order. A group of instruction is analyzed together, and
the rst instruction that is found to be independent of
prior instructions is decoded.
Instruction-level Parallelism. Although pipelining does
not necessarily lead to executing multiple instructions at
exactly the same time, there are other techniques that do.
These techniques may use some combination of static sche-
duling and dynamic analysis to perform concurrently the
actual evaluation phase of several different operations,
potentially yielding an execution rate of greater than one
operation every cycle. This kind of parallelism exploits
concurrency at the computation level. As historically
most instructions consist of only a single operation, this
kind of parallelism has been named instruction-level par-
allelism (ILP).
Two architectures that exploit ILP are superscalar
and VLIW, which use radically different techniques to
achieve greater than one operation per cycle. Asuperscalar
processor dynamically examines the instruction stream to
determine which operations are independent and can be
executed. A VLIW processor depends on the compiler to
analyze the available operations (OP) and to schedule
independent operations into wide instruction words, which
then executes these operations in parallel with no further
analysis.
Figure 10 shows the instruction timing of a pipelined
superscalar or VLIW processor executing two instructions
per cycle. In this case, all the instructions are independent
so that they can be executed in parallel.
Superscalar Processor. Dynamic out-of-order pipelined
processors reach the limits of performance for a scalar
processor by allowing out-of-order operation execution.
Unfortunately, these processors remain limited to execut-
ing a single operation per cycle by virtue of their scalar
nature. This limitation can be avoided with the addition
of multiple functional units and a dynamic scheduler to
process more than one instruction per cycle. These result-
ing superscalar processors can achieve execution rates of
more than one instruction per cycle. The most signicant
advantage of a superscalar processor is that processing
multiple instructions per cycle is done transparently to
the user, and that it can provide binary compatibility while
achieving better performance.
Compared with an out-of-order pipelined processor, a
superscalar processor adds a scheduling instruction win-
dow that dynamically analyzes multiple instructions from
the instruction stream. Although processed in parallel,
these instructions are treated in the same manner as in
anout-of-order pipelinedprocessor. Before aninstructionis
issued for execution, dependencies between the instruction
and its prior instructions must be checked by hardware.
As a result of the complexity of the dynamic scheduling
logic, high-performance superscalar processors are limited
to processing four to six instructions per cycle (refer to the
Examples of Recent Architecture section). Although super-
scalar processors can take advantage of dynamic execution
behavior and exploit instruction-level parallelism from
the dynamic instruction stream, exploiting high degrees
of instruction requires a different approach. An alter-
native approach is to rely on the compiler to perform the
dependency analyses and to eliminate the need for complex
analyses performed in hardware.
VLIW Processor. In contrast to dynamic analyses in
hardware to determine which operations can be executed
in parallel, VLIW processors rely on static analyses in the
compiler. VLIW processors are, thus, less complex than
superscalar processor and have the potential for higher
performance. A VLIW processor executes operations from
statically scheduled instructions that contain multiple
Time
WB
WB EX DF AG
Instruction #1
ID IF
EX DF AG
Instruction #2
ID IF
DF AG
Instruction #4
ID IF
EX DF AG
Instruction #3
ID IF
Figure 9. Instructiontiminginapipelinedprocessor.
COMPUTER ARCHITECTURE 9
independent operations. Although it is not required that
static processors exploit instruction-level parallelism, most
statically scheduled processors use wide instruction
words. As the complexity of a VLIW processor is not sig-
nicantly greater than that of a scalar processor, the
improved performance comes without the complexity
penalties.
On the other hand, VLIW processors rely on the static
analyses performed by the compiler and are unable to take
advantage of any dynamic execution characteristics. As
issue widths become wider and wider, the avoidance of
complex hardware logic will rapidly erode the benets of
out-of-order execution. This benet becomes more signi-
cant as memory latencies increase and the benets from
out-of-order execution become a less signicant portion of
the total execution time. For applications that can be
statically scheduled to use the processor resources effec-
tively, a simple VLIW implementation results in high
performance.
Unfortunately, not all applications can be effectively
scheduled for VLIW processors. In real systems, execution
rarely proceeds exactly along the path dened by the code
scheduler in the compiler. These are two classes of execu-
tion variations that can develop and affect the scheduled
execution behavior:
1. Delayedresults fromoperations whose latencydiffers
fromthe assumed latency scheduled by the compiler.
2. Interruptions from exceptions or interrupts, which
change the execution path to a completely different
and unanticipated code schedule.
Although stalling the processor can control delayed
results, this solution can result in signicant performance
penalties from having to distribute the stall signal across
the processor. Delays occur from many causes including
mismatches between the architecture and an implementa-
tion as well as from special-case conditions that require
additional cycles to complete an operation. The most com-
mon execution delay is a data cache miss; another example
is a oating-point operation that requires an additional
normalization cycle. For processors without hardware
resource management, delayed results can cause resource
conicts and incorrect execution behavior. VLIW proces-
sors typically avoid all situations that can result in a delay
by not using data caches and by assuming worst-case
latencies for operations. However, when there is insuf-
cient parallelism to hide the exposed worst-case operation
latency, the instruction schedule will have many incom-
pletely lled or empty instructions that can result in poor
performance.
Interruptions are usually more difcult to control than
delayed results. Managing interruptions is a signicant
problem because of their disruptive behavior and because
the origins of interruptions are often completely beyond a
programs control. Interruptions develop from execution-
related internal sources (exceptions) as well as arbitrary
external sources (interrupts). The most common interrup-
tionis anoperationexceptionresulting fromeither anerror
condition during execution or a special-case condition that
requires additional corrective action to complete operation
execution. Whatever the source, all interruptions require
the execution of an appropriate service routine to resolve
the problemand to restore normal execution at the point of
the interruption.
SIMD Single Instruction, Multiple Data Stream
The SIMD class of processor architecture includes both
array and vector processors. The SIMD processor is a
natural response to the use of certain regular data struc-
tures, such as vectors and matrices. From the reference
point of an assembly-level programmer, programming
SIMD architecture appears to be very similar to pro-
gramming a simple SISDprocessor except that some opera-
tions perform computations on aggregate data. As these
Figure 10. Instruction timing of a pipelined ILP
processor.
Instruction #1
Instruction #6
Instruction #4
WB
Instruction #2
WB
Time
Instruction #3
WB
Instruction #5
WB
EX DF AG ID IF
EX DF AG ID IF
EX DF AG ID IF
EX DF AG ID IF
EX DF AG ID IF
EX DF AG ID IF
10 COMPUTER ARCHITECTURE
regular structures are widely used in scientic program-
ming, the SIMDprocessor has beenvery successful inthese
environments.
The two popular types of SIMD processor are the array
processor and the vector processor. They differ both in
their implementations and in their data organizations.
An array processor consists of many interconnected pro-
cessor elements that each have their own local memory
space. A vector processor consists of a single processor that
references a single global memory space and has special
function units that operate specically on vectors. Tables 4
and 5 describe some representative vector processors and
array processors.
Array Processors. The array processor is a set of parallel
processor elements connected via one or more networks,
possibly including local and global interelement commu-
nications and control communications. Processor elements
operate in lockstep in response to a single broadcast
instruction from a control processor. Each processor ele-
ment has its own private memory and data is distributed
across the elements in a regular fashion that is dependent
on both the actual structure of the data and also on the
computations to be performed on the data. Direct access
to global memory or another processor elements local
memory is expensive, so intermediate values are pro-
pagated through the array through local interprocessor
connections, which requires that the data be distributed
carefully so that the routing required to propagate these
values is simple and regular. It is sometimes easier to
duplicate data values and computations than it is to affect
a complex or irregular routing of data between processor
elements.
As instructions are broadcast, there is no means local
to a processor element of altering the owof the instruction
stream; however, individual processor elements can con-
ditionally disable instructions based on local status
informationthese processor elements are idle when
this condition occurs. The actual instruction stream con-
sists of more than a xed stream of operations; an array
processor is typically coupled to a general-purpose control
processor that provides both scalar operations as well as
array operations that are broadcast to all processor ele-
ments in the array. The control processor performs the
scalar sections of the application, interfaces with the out-
side world, and controls the ow of execution; the array
processor performs the array sections of the application as
directed by the control processor.
A suitable application for use on an array processor
has several key characteristics: a signicant amount of
data that has a regular structure; computations on the
data that are uniformly applied to many or all elements
of the dataset; and simple and regular patterns relating
the computations and the data. An example of an appli-
cation that has these characteristics is the solution of the
Navier Stokes equations, although any application that
has signicant matrixcomputations is likely to benet from
the concurrent capabilities of an array processor.
The programmers reference point for anarrayprocessor
is typically the high-level language level; the programmer
is concerned with describing the relationships between the
data and the computations but is not directly concerned
with the details of scalar and array instruction scheduling
or the details of the interprocessor distribution of data
within the processor. In fact, in many cases, the program-
mer is not even concerned with size of the array processor.
In general, the programmer species the size and any
Table 4. Typical Vector Computers (SIMD)
Processor
Year of
introduction
Memory- or
register-based
Number of
processor units
Maximum vector
length
Cray 1 1976 Register 1 64
CDC Cyber 205 1981 Memory 1 65535
Cray X-MP 1982 Register 14 64
Cray 2 1985 Register 5 64
Fujitsu VP-100/200 1985 Register 3 321024
ETA ETA 1987 Memory 28 65535
Cray Y-MP/832 1989 Register 18 64
Cray Y-MP/C90 1991 Register 16 64
Convex C3 1991 Register 18 128
Cray T90 1995 Register 132 128
NEC SX-5 1998 Register 1512 256
Table 5. Typical Array Processors (SIMD)
Processor
Year of
introduction
Memory
model
Processor
element
Number of
processors
Burroughs BSP 1979 Shared General purpose 16
Thinking Machine CM-1 1985 Distributed Bit-serial Up to 65,536
Thinking Machine CM-2 1987 Distributed Bit-serial 4,09665,536
MasPar MP-1 1990 Distributed Bit-serial 1,02416,384
COMPUTER ARCHITECTURE 11
specic distribution information for the data and the
compiler maps the implied virtual processor array onto
the physical processor elements that are available and
generates code to perform the required computations.
Thus, although the size of the processor is an important
factor in determining the performance that the array
processor canachieve, it is not afundamental characteristic
of an array processor.
The primary characteristic of a SIMD processor is
whether the memory model is shared or distributed. In
this section, only processors using a distributed memory
model are described as this conguration is used by SIMD
processors today and the cost of scaling a shared-memory
SIMD processor to a large number of processor elements
would be prohibitive. Processor elements and network
characteristics are also important in characterizing a
SIMD processor.
Vector Processors. A vector processor is a single proces-
sor that resembles a traditional SISD processor except
that some of the function units (and registers) operate
on vectorssequences of data values that are seemingly
operated on as a single entity. These function units are
deeply pipelined and have a high clock rate; although the
vector pipelines have as long or longer latency than a
normal scalar function unit, their high clock rate and the
rapid delivery of the input vector data elements results in a
signicant throughput that cannot be matched by scalar
function units.
Early vector processors processed vectors directly from
memory. The primary advantage of this approach was
that the vectors could be of arbitrary lengths and were
not limited by processor resources; however, the high start-
up cost, limited memory system bandwidth, and memory
system contention proved to be signicant limitations.
Modern vector processors require that vectors be explicitly
loaded into special vector registers and stored back into
memory, the same course that modern scalar processors
have taken for similar reasons. However, as vector regis-
ters can rapidly produce values for or collect results from
the vector function units and have low startup costs, mod-
ern register-based vector processors achieve signicantly
higher performance than the earlier memory-based vector
processors for the same implementation technology.
Modern processors have several features that enable
themto achieve high performance. One feature is the ability
to concurrently load and store values between the vector
register le and main memory while performing computa-
tions on values in the vector register le. This feature is
important because the limited length of vector registers
requires that vectors that are longer be processed in
segmentsa technique called strip-mining. Not being
able to overlap memory accesses and computations would
pose a signicant performance bottleneck.
Just like SISD processors, vector processors support a
form of result bypassingin this case called chaining
that allows a follow-on computation to commence as soon
as the rst value is available from the preceding compu-
tation. Thus, instead of waiting for the entire vector to be
processed, the follow-on computation can be signicantly
overlapped with the preceding computation that it is
dependent on. Sequential computations can be efciently
compounded and behave as if they were a single opera-
tion with a total latency equal to the latency of the rst
operation with the pipeline and chaining latencies of the
remaining operations but none of the startup overhead
that would be incurred without chaining. For example,
division could be synthesized by chaining a reciprocal
with a multiply operation. Chaining typically works for
the results of load operations as well as normal computa-
tions. Most vector processors implement some form of
chaining.
A typical vector processor conguration consists of a
vector register le, one vector addition unit, one vector
multiplication unit, and one vector reciprocal unit (used
in conjunction with the vector multiplication unit to per-
form division); the vector register le contains multiple
vector registers. In addition to the vector registers, there
are also a number of auxiliary and control registers, the
most important of which is the vector length register.
The vector length register contains the length of the vector
(or the loaded subvector if the full vector length is longer
than the vector register itself) and is used to control the
number of elements processed by vector operations. There
is no reason to perform computations on non-data that
is useless or could cause an exception.
As withthe array processor, the programmers reference
point for a vector machine is the high-level language. In
most cases, the programmer sees a traditional SISD
machine; however, as vector machines excel onvectorizable
loops, the programmer can often improve the performance
of the application by carefully coding the application, in
some cases explicitly writing the code to perform strip-
mining, and by providing hints to the compiler that help
to locate the vectorizable sections of the code. This situation
is purely an artifact of the fact that the programming
languages are scalar oriented and do not support the treat-
ment of vectors as an aggregate data type but only as a
collection of individual values. As languages are dened
(such as Fortran 90 or High-Performance Fortran) that
make vectors a fundamental data-type, then the program-
mer is exposed less to the details of the machine and to its
SIMD nature.
The vector processor has one primary characteristic.
This characteristic is the location of the vectors; vectors
can be memory-or register-based. There are many fea-
tures that vector processors have that are not included
here because of their number and many variations. These
features include variations on chaining, masked vector
operations based on a boolean mask vector, indirectly
addressed vector operations (scatter/gather), compressed/
expanded vector operations, recongurable register les,
multiprocessor support, and soon. Vector processors have
developed dramatically from simple memory-based vector
processors to modern multiple-processor vector processors
that exploit both SIMD vector and MIMD style processing.
MISD Multiple Instruction, Single Data Stream
Although it is easy to both envision and design MISD
processors, there has been little interest in this type of
parallel architecture. The reason, so far anyway, is that
12 COMPUTER ARCHITECTURE
there are no ready programming constructs that easily map
programs into the MISD organization.
Conceptually, MISD architecture can be represented as
multiple independently executing function units operating
on a single stream of data, forwarding results from one
function unit to the next, which, on the microarchitecture
level, is exactly what the vector processor does. However, in
the vector pipeline, the operations are simply fragments of
an assembly-level operation, as distinct from being a com-
plete operation. Surprisingly, some of the earliest attempts
at computers in the 1940s could be seen as the MISD
concept. They used plug boards for programs, where data
in the form of a punched card was introduced into the rst
stage of a multistage processor. A sequential series of
actions was taken where the intermediate results were
forwarded from stage to stage until, at the nal stage, a
result would be punched into a new card.
There are, however, more interesting uses of the MISD
organization. Nakamura has pointed out the value of an
MISD machine called the SHIFT machine. In the SHIFT
machine, all data memory is decomposed into shift regis-
ters. Various function units are associated with each shift
column. Data is initially introduced into the rst column
and is shifted across the shift register memory. In the
SHIFT machine concept, data is regularly shifted from
memory region to memory region (column to column) for
processing by various function units. The purpose behind
the SHIFT machine is to reduce memory latency. In a
traditional organization, any function unit can access
any region of memory and the worst-case delay path for
accessing memory must be taken into account. In the
SHIFT machine, we must only allow for access time to
the worst element in a data column. The memory latency
in modern machines is becoming a major problem the
SHIFT machine has a natural appeal for its ability to
tolerate this latency.
MIMD Multiple Instruction, Multiple Data Stream
The MIMD class of parallel architecture brings together
multiple processors with some form of interconnection. In
this conguration, each processor executes completely
independently, although most applications require some
form of synchronization during execution to pass informa-
tion and data between processors. Although no require-
ment exists that all processor elements be identical, most
MIMD congurations are homogeneous with all processor
elements identical. There have been heterogeneous MIMD
congurations that use different kinds of processor
elements to perform different kinds of tasks, but these
congurations have not yielded to general-purpose appli-
cations. We limit ourselves to homogeneous MIMD orga-
nizations in the remainder of this section.
MIMD Programming and Implementation Considerations.
Up to this point, the MIMD processor with its multiple
processor elements interconnected by a network appears
to be very similar to a SIMD array processor. This simi-
larity is deceptive because there is a signicant difference
between these two congurations of processor elements
in the array processor, the instruction stream delivered to
each processor element is the same, whereas in the MIMD
processor, the instruction stream delivered to each proces-
sor element is independent and specic to each processor
element. Recall that in the array processor, the control
processor generates the instruction stream for each pro-
cessor element and that the processor elements operate
inlock step. In the MIMDprocessor, the instruction stream
for each processor element is generated independently by
that processor element as it executes its program. Although
it is often the case that each processor element is running
pieces the same program, there is no reason that differ-
ent processor elements should not run different programs.
The interconnection network in both the array proces-
sor and the MIMDprocessor passes data betweenprocessor
elements; however, in the MIMD processor, it is also used
to synchronize the independent executionstreams between
processor elements. When the memory of the processor is
distributed across all processors and only the local proces-
sor element has access to it, all data sharing is performed
explicitly using messages and all synchronization is
handled within the message system. When the memory
of the processor is shared across all processor elements,
synchronization is more of a problemcertainly messages
can be used through the memory system to pass data
and information between processor elements, but it is
not necessarily the most effective use of the system.
When communications between processor elements is
performed through a shared-memory address space, either
global or distributed between processor elements (called
distributed shared memory to distinguish it from distri-
buted memory), there are two signicant problems that
develop. The rst is maintaining memory consistency; the
programmer-visible ordering effects of memory references
both within a processor element and between different
processor elements. The second is cache coherency; the
programmer-invisible mechanism ensures that all proces-
sor elements see the same value for a given memory loca-
tion. Neither of these problems is signicant in SISD or
SIMD array processors. In a SISD processor, there is only
one instruction stream and the amount of reordering is
limited so the hardware can easily guarantee the effects of
perfect memory reference ordering and thus there is no
consistency problem; because a SISD processor has only
one processor element, cache coherency is not applicable. In
a SIMD array processor (assuming distributed memory),
there is still only one instruction stream and typically no
instruction reordering; because all interprocessor element
communications is via message, there is neither a consis-
tency problem nor a coherency problem.
The memory consistency problem is usually solved
through a combination of hardware and software techni-
ques. At the processor element level, the appearance of
perfect memory consistency is usually guaranteed for local
memory references only, which is usually a feature of the
processor element itself. At the MIMD processor level,
memory consistency is often only guaranteed through
explicit synchronization between processors. In this case,
all nonlocal references are only ordered relative to these
synchronization points. Although the programmer must be
aware of the limitations imposed by the ordering scheme,
COMPUTER ARCHITECTURE 13
the added performance achieved using nonsequential
ordering can be signicant.
The cache coherency problem is usually solved exclu-
sively through hardware techniques. This problem is sig-
nicant because of the possibility that multiple processor
elements will have copies of data in their local caches, each
copy of which can have different values. There are two
primary techniques to maintain cache coherency. The rst
is to ensure that all processor elements are informed of any
change to the shared-memory statethese changes are
broadcast throughout the MIMD processor and each pro-
cessor element monitors these changes (commonly referred
to as snooping). The second is to keep track of all users of a
memory address or block in a directory structure and to
specically inform each user when there is a change made
to the shared-memory state. In either case, the result of a
change can be one of two things, either the new value is
provided and the local value is updated or all other copies of
the value are invalidated.
As the number of processor elements in a system
increases, a directory-based system becomes signicantly
better as the amount of communications required to main-
tain coherency is limited to only those processors holding
copies of the data. Snooping is frequently used within a
small cluster of processor elements to track local changes
here the local interconnection can support the extra
trafc used to maintain coherency because each cluster
has only a few processor elements in it.
The primary characteristic of a MIMD processor is the
nature of the memory address space; it is either separate or
shared for all processor elements. The interconnection
network is also important in characterizing a MIMD pro-
cessor and is described in the next section. With a separate
address space (distributed memory), the only means of
communications between processor elements is through
messages and thus these processors force the programmer
to use a message-passing paradigm. With a shared address
space (shared memory), communications between proces-
sor elements is throughthe memory systemdepending on
the application needs or programmer preference, either a
shared memory or message passing paradigm can be used.
The implementationof adistributed-memory machine is
far easier than the implementation of a shared-memory
machine when memory consistency and cache coherency is
taken into account. However, programming a distributed
memory processor can be much more difcult because the
applications must be writtento exploit andnot be limitedby
the use of message passing as the only form of communica-
tions between processor elements. On the other hand,
despite the problems associated with maintaining consis-
tency and coherency, programming a shared-memory pro-
cessor can take advantage of whatever communications
paradigm is appropriate for a given communications
requirement and can be much easier to program. Both
distributed-and shared-memory processors can be extre-
mely scalable and neither approach is signicantly more
difcult to scale than the other.
MIMD Rationale. MIMD processors usually are
designed for at least one of two reasons: fault tolerance
or program speedup. Ideally, if we have n identical pro-
cessors, the failure of one processor should not affect the
ability of the multiprocessor to continue program execu-
tion. However, this case is not always true. If the operating
system is designated to run on a particular processor and
that processor fails, the system fails. On the other hand,
some multiprocessor ensembles have been built with the
sole purpose of high-integrity, fault-tolerant computation.
Generally, these systems may not provide any program
speedup over a single processor. Systems that duplicate
computations or that triplicate and vote on results are
examples of designing for fault tolerance.
MIMD Speedup: Partitioning and Scheduling. As multi-
processors simply consist of multiple computing elements,
each computing element is subject to the same basic
design issues. These elements are slowed down by branch
delays, cache misses, and so on. The multiprocessor con-
guration, however, introduces speedup potential as well
as additional sources of delay and performance degrada-
tion. The sources of performance bottlenecks in multipro-
cessors generally relate to the way the program was
decomposed to allow concurrent execution on multiple
processors. The speedup (Sp) of anMIMDprocessor ensem-
ble is dened as:
Sp = T(1)=T(n)
or the execution time of a single processor (T(1)) divided by
the execution time for n processors executing the same
application (T(n)). The achievable MIMDspeedup depends
on the amount of parallelism available in the program
(partitioning) andhowwell the partitioned tasks are sched-
uled.
Partitioning is the process of dividing a program into
tasks, each of which can be assigned to an individual pro-
cessor for execution at run time. These tasks can be repre-
sented as nodes in a control graph. The arcs in the graph
specify the order in which execution of the subtasks must
occur. The partitioning process occurs at compile time, well
before program execution. The goal of the partitioning
process is to uncover the maximum amount of parallelism
possible without going beyond certain obvious machine
limitations.
The program partitioning is usually performed with
some a priori notion of program overhead. Program over-
head (o) is the added time a task takes to be loaded into a
processor before beginning execution. The larger the size of
the minimum task dened by the partitioning program,
the smaller the effect of program overhead. Table 6 gives
an instruction count for some various program grain sizes.
The essential difference betweenmultiprocessor concur-
rency and instruction-level parallelism is the amount of
overhead expected to be associated with each task. Over-
head affects speedup. If uniprocessor program P
1
does
operation W
1
, then the parallel version of P
1
does opera-
tions W
p
, where W
p
_W
1
.
For each task T
i
, there is an associated number of over-
head operations o
i
, so that if T
i
takes W
i
operations without
14 COMPUTER ARCHITECTURE
overhead, then
W
P
= S(W
i
o
i
) _W
1
where W
p
is the total work done by P
p
, including overhead.
To achieve speedup over a uniprocessor, a multiprocessor
system must achieve the maximum degree of parallelism
among executing subtasks or control nodes. On the other
hand, if we increase the amount of parallelism by using
ner- and ner-grain task sizes, we necessarily increase
the amount of overhead. Moreover, the overhead depends
on the following factors.
v
Overhead time is conguration-dependent. Different
shared-memory multiprocessors may have signi-
cantly different task overheads associated with
them, depending on cache size, organization, and
the way caches are shared.
v
Overhead may be signicantly different depending on
how tasks are actually assigned (scheduled) at run
time. A task returning to a processor whose cache
already contains signicant pieces of the task code
or dataset will have a signicantly lower overhead
thanthe same taskassigned to anunrelatedprocessor.
Increased parallelism usually corresponds to ner task
granularity and larger overhead. Clustering is the group-
ing together of subtasks into a single assignable task.
Clustering is usually performed both at partitioning time
and during scheduling run time. The reasons for clustering
during partition time might include when
v
The available parallelism exceeds the known number
of processors that the program is being compiled for.
v
The placement of several shorter tasks that share the
same instruction or data working set into a single task
provides lower overhead.
Scheduling can be performed statically at compile time
or dynamically at run time. Static scheduling information
can be derived on the basis of the probable critical paths,
which alone is insufcient to ensure optimum speedup or
even fault tolerance. Suppose, for example, one of the
processors scheduled statically was unavailable at run
time, having suffered a failure. If only static scheduling
had been done, the program would be unable to execute if
assignment to all n processors had been made. It is also
oftentimes the case that program initiation does not begin
with n designated idle processors. Rather, it begins with a
smaller number as previously executing tasks complete
their work. Thus, the processor availability is difcult to
predict and may vary from run to run. Although run-time
scheduling has obvious advantages, handling changing
systems environments, as well as highly variable program
structures, it also has some disadvantages, primarily its
run-time overhead.
Run-time scheduling can be performed in a number
of different ways. The scheduler itself may run on a parti-
cular processor or it may run on any processor. It can
be centralized or distributed. It is desirable that the
scheduling not be designated to a particular processor,
but rather any processor, and then the scheduling process
itself can be distributed across all available processors.
Types of MIMD processors. Although all MIMD archi-
tectures share the same general programming model, there
are many differences in programming detail, hardware
conguration, and speedup potential. Most differences
develop from the variety of shared hardware, especially
the way the processors share memory. For example,
processors may share at one of several levels:
v
Shared internal logic (oating point, decoders, etc.),
shared data cache, and shared memory.
v
Shared data cacheshared memory.
v
Separate data cache but shared busshared memory.
v
Separate data cache with separate busses leading to a
shared memory.
v
Separate processors and separate memory modules
interconnected with a multistage interconnection net-
work.
v
Separate processor-memory systems cooperatively
executing applications via a network.
The basic tradeoff in selecting a type of multiprocessor
architecture is between resource limitations and syn-
chronization delay. Simple architectures are generally
resource-limited and have rather low synchronization
communications delay overhead. More robust processor-
memory congurations may offer adequate resources for
extensive communications among various processors in
memory, but these congurations are limited by
v
delay through the communications network and
v
multiple accesses of a single synchronization variable.
The simpler and more limited the multiprocessor con-
guration, the easier it is to provide synchronization com-
munications and memory coherency. Each of these
functions requires an access to memory. As long as memory
bandwidth is adequate, these functions can be readily
handled. As processor speed and the number of processors
increase, eventually shared data caches and busses run out
of bandwidth and become the bottleneck in the multipro-
cessor system. Replicating caches or busses to provide
additional bandwidth requires management of not only
Table 6. Grain Size
Grain
Description Program Construct
Typical number
of instructions
Fine grain Basic block
Instruction-level parallelism
5 to 10
Medium
grain
Loop/Procedure
Loop-level parallelism
Procedure-level parallelism
100 to 100,000
Coarse
grain
Large task
Program-level parallelism
100,000 or more
COMPUTER ARCHITECTURE 15
the original trafc, but the coherency trafc also. From the
systems point of view, one would expect to nd anoptimum
level of sharing for each of the shared resourcesdata
cache, bus, memory, and so onfostering a hierarchical
view of shared-memory multiprocessing systems.
Multithreaded or shared resource multiprocessing. The
simplest and most primitive type of multiprocessor system
is what is sometimes called multithreaded or what we call
here shared-resource multiprocessing (SRMP). In SRMP,
each of the processors consists of basically only a register
set, which includes a program counter, general registers,
instructioncounter, andso on. The driving principle behind
SRMP is to make the best use of processor silicon area.
The functional units and busses are time-shared. The
objective is to eliminate context-switching overhead and
to reduce the realized effect of branch and cache miss
penalties. Each processor executes without signicant
instruction-level concurrency, so it executes more slowly
than a more typical SISD, which reduces per instruction
effect of processing delays; but the MIMD ensemble can
achieve excellent speedupbecause of the reduced overhead.
Note that this speedup is relative to a much slower single
processor.
Shared-memory multiprocessing. In the simplest of these
congurations, several processors share acommonmemory
via a common bus. They may even share a common data
cache or level-2 cache. As bus bandwidth is limited, the
number of processors that can be usefully congured in
this way is quite limited. Several processors sharing a bus
are sometimes referred to as a cluster.
Interconnected multiprocessors. Realizing multiproces-
sor congurations beyond the cluster requires an inter-
connection network capable of connecting any one of n
processor memory clusters to any other cluster. The inter-
connection network provides n switched paths, thereby
increasing the intercluster bandwidth at the expense of
the switchlatency in the network and the overall (consider-
able) cost of the network. Programming such systems may
be done either as a shared-memory or message-passing
paradigm. The shared-memory approach requires signi-
cant additional hardware support to ensure the con-
sistency of data in the memory. Message passing has
simpler hardware but is a more complex programming
model.
Cooperative computing: networked multiprocessors.
Simple processor-memory systems with LANof even Inter-
net connection can, for particular problems, be quite effec-
tive multiprocessors. Such congurations are sometimes
called network of workstations (NOW).
Table 7 illustrates some of the tradeoffs possible in
conguring multiprocessor systems. Note that the applica-
tion determines the effectiveness of the system. As archi-
tects consider various ways of facilitating interprocessor
communication in a shared-memory multiprocessor, they
must be constantly aware of the cost required to improve
interprocessor communications. In a typical shared-mem-
ory multiprocessor, the cost does not scale linearly; each
additional processor requires additional network services
andfacilities. Depending onthe type of interconnection, the
cost for an additional processor may increase at a greater
than linear rate.
For those applications that require rapid communica-
tions and have a great deal of interprocessor communica-
tions trafc, this added cost is quite acceptable. It is readily
justied ona cost-performance basis. However, many other
applications, including many naturally parallel applica-
tions, may have limited interprocessor communications.
In many simulation applications, the various cases to be
simulated can be broken down and treated as independent
tasks to be run on separate processors with minimum
interprocessor communication. For these applications, sim-
ple networked systems of workstations provide perfectly
adequate communications services. For applications whose
program execution time greatly exceeds its interprocessor
communication time, it is a quite acceptable message
passing time.
The problem for the multiprocessor systems architect is
to create a system that can generally satisfy a broad spec-
trum of applications, which requires a system whose costs
scale linearly with the number of processors and whose
overall cost effectively competes with the NOWthe sim-
ple networkof workstationsonthe one hand, andsatises
the more aggressive communications requirement for
those applications that demand it on the other. As with
any systems design, it is impossible to satisfy the require-
ments of all applications. The designer simply must choose
Table 7. Various Multiprocessor Congurations
Type Physical sharing
Programmers
model
Remote data
access latency Comments
Multi-threaded ALU, data cache, memory Shared memory No delay Eliminates context switch overhead
but limited possible Sp.
Clustered Bus and memory Shared memory Small delay due
to bus congestion
Limited Sp due to bus bandwidth limits.
Interconnection
network (1)
Interconnection network
and memory
Shared memory Order of 100 cycles. Typically 1664 processors; requires
memory consistency support.
Interconnection
network (2)
Interconnection network Message passing Order of 100 cycles
plus message decode
overhead
Scalable by application; needs
programmers support.
Cooperative
multiprocessors
Only LAN or similar
network
Message passing More than 0.1 ms. Limited to applications that require
minimum communications.
16 COMPUTER ARCHITECTURE
a broad enough set of applications and design a system
robust enough to satisfy those applications. Table 8 shows
some representative MIMDcomputer systems from1990 to
2000.
COMPARISONS AND CONCLUSIONS
Examples of Recent Architectures
This section describes some recent microprocessors and
computer systems, and it illustrates how computer archi-
tecture has evolved over time. In the last section, scalar
processors are described as the simplest kind of SISD
processor, capable of executing only one instruction at a
time. Table 1 depicts some commercial scalar processors
(7,8).
Intel 8086, which was released in 1978, consists of
only 29,000 transistors. In contrast, Pentium III (from
the same x86 family) contains more than 28,000,000 tran-
sistors. The huge increase in the transistor count is made
possible by the phenomenal advancement in VLSI technol-
ogy. These transistors allow simple scalar processors to
emerge to a more complicated architecture and achieve
better performance. Many processor families, such as Intel
x86, HP PA-RISC, Sun SPARC and MIPS, have evolved
fromscalar processors to superscalar processors, exploiting
a higher level of instruction-level parallelism. In most
cases, the migration is transparent to the programmers,
as the binary codes running on the scalar processors can
continue to run on the superscalar processors. At the same
time, simple scalar processors (such as MIPS R4000 and
ARM processors) still remain very popular in embedded
systems because performance is less important than cost,
power consumption, and reliability for most embedded
applications.
Table 2 shows some representative superscalar pro-
cessors from 1992 to 2004 (7,8). In this period of time,
the number of transistors in a superscalar processor has
escalated from 1,000,000 to more than 100,000,000. Inter-
estingly, most transistors are not used to improve the
instruction-level parallelism in the superscalar architec-
tures. Actually, the instruction issue width remains
roughly the same (between 2 to 6) because the overhead
(such as cycle time penalty) to build a wider machine, in
turn, can adversely affect the overall processor perfor-
mance. In most cases, many of these transistors are used
in the on-chip cache to reduce the memory access time. For
instance, most of the 140, 000,000 transistors in HP PA-
8500 are used in the 1.5MB on-chip cache (512 KB instruc-
tion cache and 1MB data cache).
Table 3 presents some representative VLIW processors
(7,9). There have been very few commercial VLIW proces-
sors in the past, mainly due to poor compiler technology.
Recently, there has been major advancement in VLIW
compiler technology. In 1997, TI TMS320/C62x became
the rst DSP chip using VLIW architecture. The simple
architecture allows TMS320/C62x to run at a clock fre-
quency (200MHz) much higher than traditional DSPs.
After the demise of Multiow and Cydrome, HP acquired
their VLIW technology and co-developed the IA-64 archi-
tecture (the rst commercial general-purpose VLIW pro-
cessor) with Intel.
Although SISD processors and computer systems are
commonly used for most consumer and business applica-
tions, SIMDand MIMDcomputers are used extensively for
scientic and high-end business applications. As described
in the previous section, vector processors and array pro-
cessors are the two different types of SIMDarchitecture. In
the last 25 years, vector processors have developed from a
single processor unit (Cray 1) to 512 processor units (NEC
SX-5), taking advantage of both SIMD and MIMD proces-
sing. Table 4 shows some representative vector processors.
On the other hand, there have not been a signicant
number of array processors due to a limited application
base andmarket requirement. Table 5 shows several repre-
sentative array processors.
For MIMD computer systems, the primary considera-
tions are the characterization of the memory address space
and the interconnection network among the processing
elements. The comparison of shared-memory and mes-
sage-passing programming paradigms was discussed in
the last section. At this time, shared-memory programming
paradigm is more popular, mainly because of its exibility
and ease of use. As shown in Table 8, the latest Cray
supercomputer (Cray T3E-1350), which consists of up to
2176 DEC Alpha 21164 processors with distributed
Table 8. Typical MIMD Systems
System
Year of
Interconnection
Processor
element
Number of
processors
Memory
distribution
Programming
paradigm
introduction
type)
Alliant FX/2800 1990 Intel i860 428 Central Shared memory Bus crossbar
Stanford DASH 1992 MIPS R3000 464 Distributed Shared memory Bus mesh
Cray T3D 1993 DEC 21064 1282048 Distributed Shared memory 3D torus
MIT Alewife 1994 Sparcle 1512 Distributed Message passing Mesh
Convex C4/XA 1994 Custom 14 Global Shared memory Crossbar
Thinking Machines
CM-500
1995 SuperSPARC 162048 Distributed Message passing Fat tree
Tera Computers MTA 1995 Custom 16256 Distributed Shared memory 3D torus
SGI Power Challenge XL 1995 MIPS R8000 218 Global Shared memory Bus
Convex SPP1200/XA 1995 PA-RISC 7200 8128 Global Shared memory Crossbar ring
Cray T3E-1350 2000 DEC 21164 402176 Distributed Shared memory 3D torus
)An indepth discussion of various interconnectionnetworks may be found in Parallel Computer Architecture: AHardware/Software Approach by David Culler
and J.P. Singh with Anoop Gupta.
COMPUTER ARCHITECTURE 17
memory modules, adopts the shared-memory program-
ming paradigm.
Concluding Remarks
Computer architecture has evolved greatly over the past
decades. It is now much more than the programmers view
of the processor. The process of computer design starts
with the implementation technology. As the semicon-
ductor technology changes, so to does the way it is used
in a system. At some point in time, cost may be largely
determined by transistor count; later as feature sizes
shrink, wire density and interconnection may dominate
cost. Similarly, the performance of a processor is dependent
on delay, but the delay that determines performance
changes as the technology changes. Memory access
time is only slightly reduced by improvements in feature
size because memory implementations stress size and the
access delay is largely determined by the wire length
across the memory array. As feature sizes shrink, the array
simply gets larger.
The computer architect must understand technology,
not only todays technology, but the projection of that
technology into the future. A design begun today may
not be broadly marketable for several years. It is the
technology that is actually used in manufacturing, not
todays technology that determines the effectiveness of a
design.
The foresight of the designer in anticipating changes in
user applications is another determinant in design effec-
tiveness. The designer should not be blinded by simple test
programs or benchmarks that fail to project the dynamic
nature of the future marketplace.
The computer architect must bring together the tech-
nology and the application behavior into a system cong-
uration that optimizes the available process concurrency,
which must be done in a context of constraints on cost,
power, reliability, and usability. Although formidable in
objective, a successful design is a design that provides a
lasting value to the user community.
FURTHER READING
W. Stallings, Computer Organization and Architecture, 5th ed.En-
glewood Cliffs, NJ: Prentice-Hall, 2000.
K. Hwang, Advanced Computer Architecture, New York: McGraw
Hill, 1993.
J. Hennessy and D. Patterson, Computer Architecture: A Quanti-
tative Approach, San Francisco, CA: Morgan Kaufman Publishers,
1996.
A. J. Smith, Cachememories, Comput. Surv., 14(3): 473530, 1982.
D. Culler and J. P. Singh with A. Gupta, Parallel Computer
Architecture: A Hardware/Software Approach, San Francisco,
CA: Morgan Kaufmann Publishers, 1988.
D. Sima, T. Fountain, and P. Kacsuk, Advanced Computer Archi-
tectures: A Design Space Approach, Essex, UK: Addison-Wesley,
1997.
K. W. Rudd, VLIW Processors: Efciently Exploiting Instruction
Level Parallelism, Ph.D Thesis, Stanford University, 1999.
M. J. Flynn, Computer Architecture: Pipelined and Parallel Pro-
cessor Design, Sudbury, MA: Jones and Bartlett Publishers, 1995.
P. M. Kogge, The Architecture of Pipelined Computers, New York:
McGraw-Hill, 1981.
S. Kunkel and J. Smith, Optimal pipelining in supercomputers,
Proc. 13th Annual Symposium on Computer Architecture, 1986,
404411.
W. M. Johnson, Superscalar Microprocessor Design, Englewood
Cliffs, NJ: Prentice-Hall, 1991.
BIBLIOGRAPHY
1. G. M. Amdahl, G. H. Blaauw, and F. P. Brooks, Architecture of
the IBMSystem/360. IBMJ. Res. Develop., 8 (2): 87101, 1964.
2. Semiconductor Industry Association, The National Technology
Roadmap for Semiconductors, San Jose, CA: Semiconductor
Industry Association, 1997.
3. M. J. Flynn, P. Hung, and K. W. Rudd, Deep-submicron micro-
processor design issues, IEEE Micro Maga., July-August, 11
22, 1999.
4. J. D. Ullman, Computational Aspects of VLSI, Rockville, MD:
Computer Science Press, 1984.
5. W. Stallings, Reduced Instruction Set Computers, Tutorial,
2nd ed.New York: IEEE Comp. Soc. Press, 1989.
6. M. J. Flynn, Very high speed computing systems, Proc. IEEE,
54:19011909, 1966.
7. MicroDesign Resources, Microprocessor Report, various
issues, Sebastopol, CA, 19922001.
8. T. Burd, General Processor Information, CPU Info Center,
University of California, Berkeley, 2001. Available: http://
bwrc.eecs.berkeley.edu/CIC/summary/.
9. M. J. Flynn and K. W. Rudd, Parallel architectues. ACM
Comput. Surv., 28 (1): 6770, 1996.
MICHAEL FLYNN
PATRICK HUNG
Stanford University
Stanford, California
18 COMPUTER ARCHITECTURE
C
COMMUNICATION PROCESSORS
FOR WIRELESS SYSTEMS
INTRODUCTION
Inthis article, we dene the termcommunication processor
as a device in a wired or wireless communication system
that carries out operations on data in terms of either
modifying the data, processing the data, or transporting
the data to other parts of the system. A communication
processor has certain optimizations built inside its hard-
ware and/or software that enables it to perform its task in
an efcient manner. Depending on the application, com-
munication processors may also have additional con-
straints on area, real-time processing, and power, while
providing the software exibility close to general purpose
microprocessors or microcontrollers. Althoughgeneral pur-
pose microprocessors and microcontrollers are designed to
support high processing requirements or low power, the
needto process datainreal-time is animportant distinction
for communication processors.
The processing in a communication systemis performed
in multiple layers, according to the open systems inter-
connection (OSI) model. (For details on the OSI model,
please see Ref. 1). When the communication is via a net-
workof intermediate systems, only the lower three layers of
the OSI protocols are used in the intermediate systems. In
this chapter, we will focus onthese lower, three layers of the
OSI model, shown in Fig. 1.
The bottom-most layer is called the physical layer (or
layer 1inthe OSI model). This layer serializes the datato be
transferred into bits and sends it across a communication
circuit to the destination. The form of communication can
be wired using a cable or can be wireless using a radio
device. In a wireless system, the physical layer is composed
of two parts: the radio frequency layer (RF) and the base-
bandfrequency layer. Bothlayers describe the frequency at
which the communication circuits are working to process
the transmitted wireless data. The RF layer processes
signals at the analog level, whereas the baseband opera-
tions are mostly performed after the signal has been down-
converted from the radio frequency to the baseband fre-
quency and converted to a digital formfor processing using
a analog-to-digital converter. All signal processing needed
to capture the transmitted signal and error correction is
performed in this layer.
Above the physical layer is the data link layer, which is
knownmore commonly as the mediumaccess control(MAC)
layer. The MAC layer is one of the two sub-layers in the
data link layer of the OSI model. The MAC layer manages
and maintains communication between multiple commu-
nicationdevices by coordinating access to a shared medium
and by using protocols that enhance communication over
that medium.
The third layer in the OSI model is the network layer.
The network layer knows the address of the neighboring
nodes in the network, packages output with the correct
network address information, selects routes and quality of
service (QOS), andrecognizes andforwards to the transport
layer incoming messages for local host domain.
Communication processors primarily have optimiza-
tions for the lower three layers of the OSI model. Depending
on which layer has the most optimizations, communication
processors are classied even more into physical layer(or
baseband) processors, mediumaccess control processors, or
network processors.
The desire to support higher data rates in wireless
communication systems implies meeting cost, area, power,
and real-time processing requirements in communication
processors. These constraints have the greatest impact on
the physical layer design of the communication processor.
Hence, although we mention the processing requirements
of multiple layers, we focus this article on challenges in
designing the physical layer of communication processors.
Evolution of Wireless Communication Systems
Over the past several years, communication systems have
evolved fromlowdata-rate systems for voice and data (with
data rates of several Kbps, such as dial-up modems, cellular
systems, and 802.1lb local area networks) to high data-rate
systems that support multimedia and video applications
with data rates of several Mbps and going toward Gbps,
such as DSL, cable modems, 802.11n local area networks
(LANs), andultra-widebandpersonal areanetworks (PANs)
(2). The rst generation systems (1G) came in the 1980s
mostly for cellular analog voice using AMPS (advanced
mobilephoneservice). This standardevolvedintothesecond
generation standard (2G) in the 1990s to support digital
voice and low bit rate data services. An example of such a
cellular systemis IS-54 (2). At the same time, wireless local
area networks began service starting at 1 Mbps for 802.11b
standards and extending to 11 Mbps close to the year 2000.
In the current generation of the standards (3G), cellular
services have progressed to higher data rates in terms of
hundreds of Kbps to support voice, data, and multimedia,
and wireless LANs have evolved to 802.1la and 802.1lg to,
support data rates around 100 mbps. In the future, for the
fourth generation systems (4G), the data rates are expected
to continue to increase and will provide IP-based services
along withQoS(3) Table 1 presents the evolutionof wireless
communication systems as they have evolved from1Gto 4G
systems. A range of data rates is shown in the table to
account for both cellular and W-LAN data rates in commu-
nication systems.
CHALLENGES FOR COMMUNICATION PROCESSORS
This evolution of communication systems has involved
radical changes in processor designs for these systems for
multiple reasons. First, the increase in data rates has come
at the cost of increased complexity in the system design.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Second, the performance of communication systems have
been increasing consistently with communication system
designers develop sophisticated signal processing algo-
rithms that enhance the performance of the system at the
expense of increased computational complexity. Flexibility
is also an important emerging characteristic in communica-
tion processors because of the need to support multiple
protocols and environments. Also, newer applications
have become more complex, they need to be backward-
compatible with existing systems. As the number of stan-
dards and protocols increase, the demand increases for new
standards to be spectrum-efcient, to avoid interference to
other systems, and also to mitigate interference from other
systems. The exibility needed in the baseband and radio
and regulatory requirements of spectrum and transmit
power also add challenges in testing the design of these
processors. The interaction and integration between differ-
ent layers of the communication systemalso presents inter-
esting challenges. The physical layer is signal processing-
based, involving complex, mathematical computations,
whereas the MAC layer is data processing-based, involving
data movement, scheduling, and control of the physical
layer. Finally, the range of consumer applications for com-
munication systems has increased from small low-cost
devices, such as RFID tags, to cellular phones, PDAs, lap-
tops, personal computers, and high-end network servers.
Processors for different applications have different optimi-
zatio constraints such as the workload characteristics, cost,
power, area, and data rate and require signicant trade-off
analysis. The above changes puts additional constraints on
the processor design for communication systems.
Increasing Data Rates
Figure 2 shows the increase in data rates provided by
communication systems over time. The gure shows that
over the past decade, communication systems have had a
1000increase in data rate requirements. Systems such as
wireless LANs and PANs have evolved from 1 Mbps sys-
tems such as 802.1lb and Bluetooth to 100 Mbps 802.1la
LANs to now Gbps systems being proposed for ultra-
wideband personal area networks. The same has been
true even for wired communication systems, going from
10 Mbps ethernet cards to nowGbps ethernet systems. The
increase in processor clock frequencies across generations
cannot keep up with the increase in raw data rate require-
ments. During the same period, the processor clock fre-
quencies have onlygone upbyone order of magnitude. Also,
applications (such as multimedia) are demanding more
compute resources and more memory than previous pro-
cessors. This demand implies that silicon process technol-
ogy advances are insufcient to meet the increase in raw
data rate requirements and additional architecture inno-
vations such as exploiting parallelism, pipelining, and
algorithm complexity reduction are needed to meet the
data rate requirements. We discuss this in more detail in
the section on Area, Time, and Power Tradeoffs.
L1 : Physical layer
L2 : Data link layer
L3 : Network layer
L4 : Transport layer
L5 : Session layer
L6 : Presentation layer
L7 : Application layer
Application
Programs
Interface for other
communication
devices
Communications
Processor
Figure 1. Layers in a OSI model. The communication processors
dened in this chapter are processors that have specic optimiza-
tions for the lower three layers.
Table 1. Evolution of communication systems
Generation Year Function Data rates
1G 19801990 Analog voice Kbps
2G 19902000 Voice low-rate data 10 Kbps10 Mbps
3G 20002010 Voice data multimedia 100 Kbps100 Mbps
4G 20102020 Voice data multimedia QoS IP 10 MbpsGbps
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
10
3
10
2
10
1
10
0
10
1
10
2
10
3
10
4
Year
D
a
t
a
r
a
t
e
s
(
i
n
M
b
p
s
)
,
C
l
o
c
k
f
r
e
q
u
e
n
c
y
(
i
n
M
H
z
)
Clock frequency (MHz)
WLAN data rate (Mbps)
Cellular data rate (Mbps)
Figure 2. Increase in data rates for communication systems. The
data rates in communication systems are increasing at a much
greater rate thantypical processor clockfrequencies, necessitating
new processor designs for communication system.
2 COMMUNICATION PROCESSORS FOR WIRELESS SYSTEMS
Increasing Algorithm Complexity
Although the data rate requirements of communication
processors are increasing, the processor design difculty
is exacerbated by the introduction of more sophisticated
algorithms that give signicant performance improve-
ments for communication systems. Figure 3 shows the
increase in computational complexity as standards have
progressed from rst generation to second and third gen-
erations (4). The gure shows that even if the data rates
are assumed constant, the increase inalgorithmic complex-
ity cannot be met solely with advances in silicon process
technology.
As an example, we consider decoding of error-control
codes at the receiver of a communication processor.
Figure 4 shows the benets of coding in a communication
system by reducing the bit error rate at a given signal-to-
noise ratio. We can see that advanced coding schemes such
as low density parity check codes (5) and turbo codes (6)
whichare iterative decoders that cangive 4-dBbenets over
conventional convolutional decoders. A4-dBgaintranslates
into roughlya60%improvement in communicationrangeof
a wireless system. Such advanced coding schemes are pro-
posed and implemented in standards such as HSDPA,
VDSL, gigabit ethernet, digital video broadcast, and Wi-
Fi. However, this improvement comes at a signicant
increase in computational complexity. Figure 5 shows
the increased complexity of some advanced coding schemes
(7). It can be observed that the iterative decoders have 35
orders of magnitude increase in computational complexity
over convolutional decoders. Thus, to order to implement
these algorithms, reducedcomplexity versions of these algo-
rithms shouldbeinvestigatedfor communicationprocessors
that allow simpler hardware designs with signicant par-
allelism without signicant loss in performance. An exam-
ple of such a design is presented in Ref. 8.
Flexibility
As communicationsystems evolve over time, a greater need
exists for communication processors to be increasingly
exible. Communication systems are designed to support
several parameters such as variable coding rates, variable
modulation modes, and variable frequency band. This ex-
ibility allows the communication system to adapt itself
better to the environment to maximize data rates over
the channel and/or to minimize power. For example,
Fig. 6 shows base-station cpmputational requirements
1980 1985 1990 1995 2000 2005 2010 2015 2020
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
10
9
Year
C
o
m
p
l
e
x
i
t
y
Processor Performance
Algorithm complexity
1G
2G
3G
4G
Figure 3. Algorithm complexity increasing faster than silicon
process technology advances. (Reprinted with permission from
Ref. 4.)
0 1 2 3 4 5 6
BER vs. SNR
0 1 2 3 4 5 6
10
4
10
3
10
2
10
1
10
0
SNR (dB)
B
E
RIterative
Code
ML decoding
Conv. code
Uncoded
Capacity
Bound
4 dB
Figure 4. Decoder performance with advanced coding schemes.
(Reprinted with permission from Ref. 7.)
0 2 4 6 8 10 12
10
0
10
1
10
2
10
3
10
4
10
5
SNR (dB)
R
e
l
a
t
i
v
e
C
o
m
p
l
e
x
i
t
y
8/9 Conv. nu = 3, N = 4k
2/3 Conv, nu = 4, N = 64k
1/2 Conv, nu = 4, N = 64k
8/9 LDPC, N = 4k, 5,3,1 iterations
8/9 Turbo, nu = 4, N = 4k
2/3 Turbo, nu = 4, N = 4k, 3,2,1 iterations
1/2 Turbo, nu = 4, N = 4k, 3,2,1 iterations
1/2 LDPC, N = 10
7
, 1100 iterations
Figure 5. Decoder complexity for various types of coding
schemes. (Reprinted with permission from Ref. 7.)
COMMUNICATION PROCESSORS FOR WIRELESS SYSTEMS 3
and the exibility needed to support several users at vari-
able constraint lengths (9). The gure also shows an exam-
ple of a 2G station at 16 Kbps/user supporting only voice
and a 3G base-station at 128 Kbps/user supporting voice,
data, and multimedia. A 3G base-station processor now
must be backward-compatible to a 2G base-station proces-
sor and hence, must support both the standards as well as
adapt its compute resources to save power when the pro-
cessing requirements are lower. The amount of exibility
provided incommunicationprocessors canmake the design
for test for test for these systems extremely challenging
because of the large number of parameters, algorithms, and
radio interfaces that must tested.
Along with the support for variable standards and pro-
tocols researchers are also investigating the design of a
single communication processor that can swith seamlessly
between different standards, depending on the availability
and cost of that standard. The (rice everywhere network)
project (10) demonstrates the design of a multitier network
interface card with a communication processor that sup-
ports outdoor cellular (CDMA) and indoor wireless (LAN)
and changes over the network seamlessly when the user
moves from an ofce environment with wireless LAN
into an outdoor environment using cellular services.
Figure 7 shows the designof the wireless multitier network
interface card concept at Rice University. Thus, exibility
to support various standards is becoming an increasingly
desired feature in communication processors.
Spectrum Issues
The wireless spectrumis a scarce resource and is regulated
by multiple agencies worldwide. As new standards evolve,
they have to coexist with spectrums that are allocated
already for existing standards. The regulatory bodies,
such as Federal Communications Commission (see
www.fcc.gov), demand that new standards meet certain
limitations on transmit power and interference avoidance
to make sure that the existing services are not degraded by
the new standard. Also, because of a plethora of wireless
standards inthe 15GHz wireless spectrum, newstandards
are forced to look at much higher RF frequencies, which
make the design of radies more challenging as well as
increase the need for transmit power because of larger
attendation at higher frequencies. Newer standards also
need to have interference detection and mitigation
techniques to coexist with existing standards. This involves
challenges at the radio level, suchas to transmit at different
frequencies to avoid interference and to develop the need for
software-dened radios (11) Spectrum regulations have
variations across countries worldwide and devices need to
have the exibility to support different programming to
meet regulatory specications.
Area, Time, and Power Tradeoffs
The design of communication processors is complicated
even more by the nature of optimizations needed for the
application and for the market segment. A mobile market
segment may place greater emphasis on cost (area) and
power, whereas ahigh-datarate market segment mayplace
a greater focus on performance. Thus, even after new
algorithms are designed and computationally efcient ver-
sions of the algorithms have been developed, tradeoffs
between area-time and power consumptions occur for the
implementation of the algorithm on the communication
processor. Also, other parameters exist that need to be
traded off such as the silicon process technology ( 0.18-
vs. 0.13-vs. 0.09-mm CMOS process) and voltage and clock
frequencies. For example, the area-time tradeoffs for
Viterbi decoding are shown in Fig. 8(12). The curve shows
that the area needed for the Viterbi decoder can be traded
off at the cost of increasing the execution time for the
Viterbi decoder.
In programmable processors, the number of functional
units and the clock frequency can be adjusted to meet
0
5
10
15
20
25
O
p
e
r
a
t
i
o
n
c
o
u
n
t
(
i
n
G
O
P
s
)
(32,9) (32,7) (16,9) (16,7) (8,9) (8,7) (4,9) (4,7)
2G basestation (16 Kbps/user)
3G basestation (128 Kbps/user)
Figure 6. Flexibility needed to support various users, rates (for
example), and backward -compatibility to standards. (Reprinted
with permission from Ref. 9.)
RF
Interface
Baseband
communications
processor
Indoor
W-LAN
Outdoor
WCDMA
Mobile
Host
(MAC,
Network
layers)
mNIC
Figure 7. Multi-tier network interface card concept. (Reprinted with permission from Ref. 10.)
4 COMMUNICATION PROCESSORS FOR WIRELESS SYSTEMS
real-time requirements for an application. An example of
this application is shown in Fig. 9(13). The gure shows
that as the number of adders and multipliers in a pro-
grammable processor are increased, the clock frequency
needed to meet real-time for an application decreases
until a certain point, at which no more operations can
be scheduled on the additional adders and multipliers in
the processor. The numbers on the graph indicate the
functional unit use of the adders and multipliers in the
processor.
Interaction Between Multiple Layers
The interaction between the different layers in a commu-
nications system also presents challenges to the processor
design. As will be shown in the following sections, the
characteristics of the physical layer in a communication
system are completely different than the characteristics of
the MAC or network layer. The physical layer of a commu-
nication system consists of signal processing algorithms
that work on estimation of the channel, detection of the
received bits, and decoding of the data, and it requires
computational resources.The MAC and network layers
are more data ow-oriented and have more control and
data-grouping operations. The combination of these two
divers requirements make the task of design of a single
integrated communication processor (which does both
the PHY as well as the MAC) difcult. Hence, these layers
are implementedtypically as separate processors, although
they may be present on the same chip.
For example, it is very common in communication pro-
cessor design to have a coprocessor-based approach for the
physical layer that performs sophisticated mathematical
operations inreal-time, while having amicrocontroller that
handles the control and data management.
PHYSICAL LAYER OR BASEBAND PROCESSORS
The physical layer of wireless communication systems
presents more challenges to the communication processor
design than wired communication systems. The nature of
the wireless channel implies the need for sophisticated
algorithms on the receiver to receive and decode the
data. Challenges exist in both the analog/RF radio and
the digital baseband of the physical layer in emerging
communication processors. The analog and RF radio
design challenge is dominated by the need to support
multiple communication protocols with varying require-
ments on the components in the transmitter and receiver
chain of the radio. This need has emerged into a stream of
research called software dened radios (11). We focus on
the challenges in meeting the computational, real-time
processing requirements and the exibility requirements
of the physical layer in the communication processor.
Characteristics of Baseband Communication Algorithms
Algorithms for communication systems in the physical
layer process signals for transmission and reception of
analog signals over the wireless (or the even wired) link.
Hence, most algorithms implemented on communication
processors are signal-processing algorithms and show cer-
tain characteristics that can be exploited in the design of
communication processors.
1. Communication processors have stringent real-time
requirements that imply the need to process data at a
certain throughput rate while also meeting certain
latency requirements.
2. Signal processing algorithms are typically compute-
bound, which implies that the bottle neck in the
processing are the computations (as opposed to mem-
ory) and the architectures require a signicant num-
ber of adders and multipliers.
0 2 4 6 8 10 12 14
0
5
10
15
20
25
Area (A) in mm
2
S
y
m
b
o
l
T
i
m
e
(
n
s
)
AT = constant
implementations
Figure 8. Normalized area-time efciency for viterbi decoding.
(Reprinted with permission from Ref. 12.)
1
2
3
4
5
1
1.5
2
2.5
3
400
600
800
1000
1200
(32,28)
#Multipliers
(38,28)
(33,34)
(50,31)
(42,37)
(36,53)
(64,31)
(51,42)
Autoexploration of adders and multipliers for Workload
(43,56)
(78,18)
(65,46)
(55,62)
#Adders
(78,27)
(67,62)
(78,45)
R
e
a
l
T
i
m
e
F
r
e
q
u
e
n
c
y
(
i
n
M
H
z
)
w
i
t
h
F
U
u
t
i
l
i
z
a
t
i
o
n
(
+
,
*
)
Figure 9. Number of adders and multipliers to meet real-time
requirements in a programmable processor. (Reprinted with per-
mission from Ref. 13.)
COMMUNICATION PROCESSORS FOR WIRELESS SYSTEMS 5
3. Communication processors require very low xed-
point precision in computations. At the transmitter,
the inputs are sent typically as bits. At the receiver,
the ADCs reduce the dynamic range of the input
signal by quantizing the signal. Quantization in com-
munication processors is acceptable because the
quantization errors are typically small compared
with the noise added through the wireless channel.
This nding is very useful to design low power and
high speed arithmetic and to keep the size of memory
requirements small in communication processors.
4. Communication algorithms exhibit signicant
amounts of data parallelism and show regular pat-
terns in computation that can be exploited for hard-
ware design.
5. Communication algorithms have a streaming data-
ow in a producer-consumer fashion between blocks
with very little data reuse. This dataow can be
exploited to avoid storage of intermediate values
and to eliminate hardware in processors such as
caches that try to exploit temporal reuse.
Figure 10 shows a typical transmitter in a communica-
tion processor. The transmitter, in the physical layer of a
communication systemis typically muchsimpler compared
with the receiver. The transmitter operations typically
consist of taking the data from the MAC layer and then
scrambling it to make it look sufciently random, encoding
it for error protection, modulating it oncertain frequencies,
and then precompensating it for any RF impairments or
distortions.
Figure 11 shows a typical receiver in a communications
processor. The receiver estimates the channel to compen-
sate for it, and then it demodulates the transmitted data
and decodes the data to correct for any errors during
transmission. Although not shown in the gure, many
other impairments in the channel and the radio, such as
fading, interference, I/Q imbalance, frequency offsets, and
phase offsets are also corrected at the receiver. The algo-
rithms used at the receiver involve sophisticated signal
MCPA
DAC
DUC
Filtering +
Pre Distortion
Chip Level
Modulation and
Spreading
Symbol
Encoding
Packet/
Circuit Switch
Control
BSC/RNC
Interface
Power Supply
and Control
Unit
Power Measurement and Gain
Control (AGC)
RF Baseband Processing Network Interface
E1/T1
or
Packet
Network
RF TX
ADC
Figure 10. Typical operations at a transmitter of a baseband processor. (Reprinted with permission from Texas Instruments.)
LNA
ADC
DDC
Frequency
Offset
Compensation
Channel
Estimation
Chip Level
Demodulation
Despreading
Symbol
Detection
Symbol
Decoding
Packet/
Circuit Switch
Control
BSC/RNC
Interface
Power and Control
Unit
Power Measurement and
Control
RF Baseband Processing Network Interface
E1/T1
or
Packet
Network
RF RX
Figure 11. Typical operations at the receiver of a baseband processor (Reprinted with permission from Texas Instruments.)
6 COMMUNICATION PROCESSORS FOR WIRELESS SYSTEMS
processing and in general, have increased in complexity
over time while providing more reliable and stable com-
munication systems.
Architecture Designs
A wide range of architectures can be used to design a
communication processor. Figure 12 shows the desired
characteristics in communication processors and shows
how different architectures meet the characteristics in
terms of performance and exibility. The efciency metric
on the x-axis is characterized as MOPs/mW (millions of
operations performed per mWof power). The architectures
shown in the gure trade exibility with performance/
power and are suited for different applications. A custom
ASIC has the best efciency in terms of data rate at unit
power consumptionout, at the same time it has the least
amount of exibility (14). On the other hand, a fully pro-
grammable processor is extremely exible but is not area/
power/throughput efcient. We discuss the tradeoffs
among the different types of architectures to use them as
communication processors.
Custom ASICs. Custom ASICs are the solution for com-
munication processors that provide the highest efciency
and the lowest cost in terms of chip area and price. This,
however, comes at the expense of a fairly large design and
test time and lack of exibility and scalability with
changes in standards and protocols. Another issue with
customASICs is that fabrication of these ASICs are extre-
mely expensive (millions of dollars), which implies that
extreme care needs to be taken in the functional design to
ensure rst pass success. Also, the volume of shipment for
these custom chips must be high to amortize the develop-
ment cost. A partial amount of exibility can be provided
as register settings for setting transmission or reception
parameters or for tuning the chip that can then controlled
by the MAC or higher layers in rmware (software). For
example, the data rate to be used for transmission can be
programmed into a register in the custom ASIC from the
MAC and that can be used to set the appropriate controls
in the processor.
Recongurable Processors. Recongurable processors
are a relatively new addition to the area of communication
processors. Typically, recongurable processors consist of a
CISC type instruction set processor with a recongurable
fabric attached to the processor core. This recongurable
fabric is used to run complex signal processing algorithms
that have sufcient parallelismand need a large number of
adders and multipliers. The benets of the recongurable
fabric comparedwithFPGAs is that the recongurationcan
be done dynamically during run-time. Figure 13 shows an
example of the Chameleon recongurable communication
processor (15).
The recongurable fabric and the instruction set com-
puting seek to provide the exibility needed for commu-
nicationprocessor whileprovidingthe dedicatedlogic inthe
reconguration fabric for efcient computing that can be
reprogrammed dynamically. One of the major disadvan-
tages of recongurable processors is that the software tools
and compilers have not progressed to a state where per-
formance/power benets are easily visible along with the
ease of programming the processor. The Chameleon recon-
gurable processor is no longer anactive product. However,
several researchers inacademia, suchas GARPat Berkeley
(16), RAWat MIT(17). Stallionat Virginia Tech(18), andin
industry such as PACT (19) are still pursuing this promis-
ing architecture for communication processors.
Application-Specic Instruction Processors. Application-
specic instruction processors (ASIPs) are processors with
an instruction set for programmability and with custo-
mized hardware tailored for a given application (20). The
programmability of these processors followed by the cus-
tomization for a particular application to meet data rate
andpower requirements make ASIPs a viable candidate for
communication processors.
A DSP is an example of such an application-specic
instruction processor withspecic optimizations to support
signal processing operations. Because standards are typi-
Communication Processors
Efficiency (MOPS/mW)
F
l
e
x
i
b
i
l
i
t
y
General-Purpose
Processor
Application-
Specific
Instruction
Processors
Reconfigurable
Processors
Custom
ASICs
Figure 12. Desired characteristics of communication processors.
32-bit reconfigurable
processing fabric
160-pin programmable I/O
128-bit RoadRunner bus
DMA
Configuration
subsystem
PCI
controller
ARC
processor
Memory
controller
Data
Stream
Data
Stream
Embedded
Processor
System
Figure 13. Recongurable communication processors. (Rep-
rinted with permission from Ref. 15.)
COMMUNICATION PROCESSORS FOR WIRELESS SYSTEMS 7
cally driven by what is possible from an ASIC implementa-
tion feasibility for cost, performance, and power, it is dif-
cult for a programmable architecture to compete with a
fully custom, based ASIC design for wireless communica-
tions. DSPs fail to meet real-time requirements for imple-
menting sophisticated algorithms because of the lack of
sufcient functional units. However, it is not simple to
increase the number of adders and multipliers in a DSP.
Traditional single processor DSP architectures such as the
C64x DSP by Texas Instruments (Dallas, Ts) (21) employ
VLIW architectures and exploit instruction level paralle-
lism (ILP) and subword parallelism. Such single-processor
DSPs can only have limited arithmetic units (less than 10)
and cannot extend directly their architectures to 100s of
arithmetic units. This limitation is because as the number
of arithmetic units increases in an architecture, the size of
the register les increases and the port interconnections
start dominating the chip area (21,22). This growth is
shown as a cartoon in Fig. 14(28). Although the use of
distributed register les may alleviate the register le
explosion at the cost of increased pepalty in register alloca-
tion(21), anassociated cost exists inexploiting ILPbecause
of the limited size of register les, dependencies in the
computations, and the register and functional unit alloca-
tion and use efciency of the compiler. It has been shown
that evenwithextremelygoodtechniques, it is verydifcult
to exploit ILPbeyond5(24). The large number of arithmetic
and logic units (ALUs) also make the task of compiling and
scheduling algorithms on the ALUs and keeping all the
ALUs busy difcult.
Another popular approach to designing communication
processors is to use a DSP with coprocessors (2527). The
coprocessors are still needed to performmore sophisticated
operations that cannot be done real-time on the DSP
because of the lack of sufcient adders and multipliers.
Coprocessor support in a DSP can be both tightly coupled
and loosely coupled coprocessor (TCC) approach, the copro-
cessor interfaces directly to the DSPcore and has access for
specic registers inthe DSPcore. The TCCapproachis used
for algorithms that work with small datasets and require
only a few instruction cycles to complete. The DSP proces-
sor freezes when the coprocessor is used because the DSP
will have to interrupt the coprocessor immediately in the
next fewcycles. In time, the TCCis integrated into the DSP
core with a specic instruction or is replaced with code in a
faster or lower-power DSP. An example of such a TCC
approach would be the implementation of a Galois eld
bit manipulation that may not be part of the DSP instruc-
tion set (27). The loosely coupled coprocessor approach
(LCC) is used for algorithms that wok with large datasets
and require a signicant amount of cycles to complete
without interruption from the DSP. The LCC approach
allows the DSP and coprocessor to execute in parallel.
The coprocessors are loaded with the parameters and
data and are initiated through application-specic instruc-
tions. The coprocessors sit on an external bus and do not
interface directly to the DSP core, which allows the DSP
core to execute in parallel. Figure 15 shows an example of
the TMS320C6416 processor from Texas Instruments
which has Viterbi and Turbo coprocessors for decoding
(28) using the LOC approach. The DSP provides the ex-
ibility needed for applications and the coprocessors provide
the compute resources for more sophisticatedcomputations
that are unable to be met on the DSP.
Programmable Processors. To be precise withdenitions,
inthis subsection, we consider programmable processors as
processors that do not have an application-specic optimi-
zation or instruction set. For example, DSPs without copro-
cessors are considered in this subsection as programmable
processors.
Stream processors are programmable processors that
have ontimizations for media and signal processing. They
are able provide hundreds of ALUs in a processor by
32
Register
File
1 ALU
RF 4
16
32
Register
File
4
16
Figure 14. Register le expansion with increasing number of
functional units in a processor. (Reprinted with permision from
Ref. 23.)
Coprocessors for Viterbi and
Turbo Decoding
C64x
TM
DSP Core
L1 cache
L1D cache
L2
cache
EDMA
controller
TCP
VCP
I/O
interfaces
(PCI,HPI,
GPIO)
Memory
interfaces
(EMIF,
McBSP)
Figure 15. DSP with coprocessors for decoding, (Reprinted with
permission from Ref. 28.)
8 COMMUNICATION PROCESSORS FOR WIRELESS SYSTEMS
arranging the ALUs into groups of clusters and by exploit-
ing data parallelismacross clusters. Streamprocessors are
able to support giga-operations per second of computation
in the processor. Figure 16 shows the distinction between
DSPs and stream processors. Although typical DSPs
exploit ILP and sub-word parallelism (SubP), stream pro-
cessors also exploit data-parallelism across clusters to pro-
vide the needed computational horsepower.
Streams are stored in a stream register le, which can
transfer data efciently to and from a set of local register
les between major computations. Local register les, colo-
cated with the arithmetic units inside the clusters, feed
those units directly with their operands. Truly global data,
data that is persistent throughout the application, is stored
off-chip only when necessary. These three explicit levels of
storage form an efcient communication structure to keep
hundreds of arithmetic units efciently fed with data. The
Imagine streamprocessor developed at Stanford is the rst
implementation of such a stream processor (29). Figure 17
shows the architecture of a stream processor with C 1
arithmetic clusters. Operations in a stream processor all
consume and/or produce streams that are stored in the
centrally located stream register le (SRF). The two major
streaminstructions are memory transfers and kernel oper-
ations.A stream memory transfer either loads an entire
stream into the SRF from external memory or stores an
entire stream from the SRF to external memory. Multiple
stream memory transfers can occur simultaneously, as
hardware resources allow. A kernel operation performs a
computation on a set of input streams to produce a set of
output streams. Kernel operations are performed within a
data parallel array of arithmetic clusters. Each cluster
performs the same sequence of operations on independent
stream elements. The streambuffers (SBs) allow the single
port into the SRF array (limited for area/power/delay rea-
sons) to be time-multiplexed among all the interfaces to the
SRF, making it seen that many logical ports exist the array.
The SBs also act as prefetchbuffers andprefetchthe datafor
kernel operations. Both the SRFand the streambuffers are
bankedtomatchthe number of clusters. Hence, kernels that
need to access data in other SRF banks must use the inter-
cluster communication network for communicating data
between the clusters.
The similarity between stream computations and com-
munication processing in the physical layer makes stream-
based processors an attractive architecture candidate for
communication processors (9).
Internal
Memory
+
+
+
x
x
x
+
+
+
x
x
x
+
+
+
x
x
x
+
+
+
x
x
x
ILP
SubP
...
CDP
(a) Traditional
embedded
processor
(DSP)
(b) Data-parallel embedded
stream processor
ILP
SubP
Internal Memory
(banked to the number of clusters)
Figure 16. DSP and stream processors. (Reprinted with permis-
sion from Ref. 13.)
C
l
u
s
t
e
r
C
C
l
u
s
t
e
r
0
C
l
u
s
t
e
r
1
External Memory (DRAM)
Stream Register File
(SRF)
SRFC
Inter-cluster
communication
network
SRF1 SRF0
SB SB SB SB
SB SB SB SB
Stream Buffers
(SB)
Clusters
of
Arithmetic
Units
M
i
c
r
o
-
C
o
n
t
r
o
l
l
e
r
Figure 17. Stream processor architecture. (Reprinted with permission from Ref. 25.)
COMMUNICATION PROCESSORS FOR WIRELESS SYSTEMS 9
MAC AND NETWORK PROCESSORS
Although the focus of this article is on the physical layer of
the communicationprocessor, the MACandnetwork layers
have a strong interaction with the physical layer especially
in wireless networks. In this section, we briey discuss the
challenges and the functionality needed in processors for
MAC and network layers (30).
MACs for wireless networks involve greater challenges
than MACs for wired networks. The wireless channel
necessitates the need for retransmissions when the
received data is not decoded correctly in the physical layer.
Wireless MACs also need to send out beacons to notify the
access point that anactive device is present onthe network.
Typical functions of a wireless MAC include:
1. Transmissions of beacons in regular intervals to
indicate the presence of the device on the network.
2. Buffering frames of data that are received from the
physical layer and sending requests for re-transmis-
sions for lost frames.
3. Monitoring radio channels for signals, noise and
interference.
4. Monitoring presence of other devices on the network.
5. Encryption of data using AES/DES to provide secur-
ity over the wireless channel.
6. Rate control of the physical layer to decide what data
rates should be used for transmission of the data.
From the above, it can be seen that the MAC layer
typically involves signicant data management and pro-
cessing. Typically, MACs are mplemented as a combination
of a RISCcore that provides the control to different parts or
the processor and dedicated logic for parts such as encryp-
tion for security and host interfaces.
Some functions of the network layer can be implemen-
ted on the MAC layer and vice-versa, depending on the
actual protocol and application used. Typical functions at
the network layer include:
1. Pattern matching and lookup. This involves match-
ing the IP address and TCP port.
2. Computation of checksum to see if the frame is valid
and any additional encryption and decryption.
3. Data manipulation that involves extracting and
insertion of elds in the IP header and also, fragmen-
tation and reassembly of packets.
4. Queue management tor lowpriority andhighpriority
trafc for QoS.
5. Control processing for updating routing tables and
timers to check for retransmissions and backoff and
so on.
CONCLUSIONS
Communication processor designs are evolving rapidly as
silicon process technology advances have proven unable to
keep up with increasing data rates and algorithmcomplex-
ity. The need for greater exibility to support multiple
protocols and be backward compatible exacerbates the
design problem because of the need to design programma-
ble solutions that can provide high throughput and meet
real-time requirements while being area and power ef-
cient. The stringent regulatory requirements on spectrum,
transmit power, and interference mitigation makes the
design of the radio difcult while the complexity, diverse
processing characteristics, and interaction between
the physical layers and the higher layers complicates the
design of the digital part of the communication processor
Various tradeoffs can be made in communication proces-
sors to optimize throughputs versus area versus power
versus cost, andthe decisions depend the actual application
under consideration. We present a detailed look at the
challenges involved in designing these processors and pre-
sent sample communication processor architectures that
are considered for communication processors in the future.
ACKNOWLEDGMENTS
Sridhar Rajagopal and Joseph Cavallaro were supported in
part by Nokia Corporation, Texas Instruments, Inc., and by
NSF under grants EIA-0224458, and EIA-0321266.
BIBLIOGRAPHY
H. Zimmermann, OSI reference model The ISQ model of archi-
tecture for open systems interconnection, EEE Trans. Communi-
cat., 28: 425432, 1980.
T. Ojanpera and R. Prasad, ed., Wideband CDMA for Third Gen-
eration Mobile Communications, Norwood, MA: Artech House
Publishers, 1998.
H. Honkasalo, K. Pehkonen, M. T. Niemi, andA. T. Leino, WCDMA
and WLAN for 3G and beyond, EEE Wireless Communicat., 9(2):
1418, 2002.
J. M. Rabaey, Low-power silicon architectures for wireless com-
munications, Design Automation Conference ASP-DAC2000, Asia
and South Pacic Meeting, Yokohama, Japan, 2000, pp. 377380.
T. Richardson and R. Urbanke, The renaissance of Gallagers low-
density parity-check codes, IEEE Communicat. Mag., l26131,
2003.
B. Vucetic and J. Yuan, Turbo Codes: Principles and Applications,
1st ed., Dordrecht: Khiwer Academic Publishers, 2000.
E. Yeo, Shannons bund: at what costs? Architectures and imple-
mentations of high throughput iterative decoders, Berkeley Wire-
less Research Center Winter Retreat, 2003.
S. Rajagopal, S. Bhashyam, J. R. Cavallaro, and B. Aazhang, Real-
time algorithms and architectures for multiuser channel estima-
tion and detection in wireless base-station receivers, IEEE Trans.
Wireless Commmunicat., 1(3): 468479, 2002.
S. Rajagopal, S. Rixner, and J. R. Cavallaro, Improving power
efciency in stream processors through dynamic cluster recon-
guration, WorkshoponMediaandStreamingProcessors, Portland,
OR, 2004.
10. B. Aazhang and J. R. Cavallaro, Multi-tier wireless commu-
nications, Wireless Personal Communications, Special Issue on
Future Strategy for the New Millennium Wireless World,
Kluwer, 17: 323330, 2001.
10 COMMUNICATION PROCESSORS FOR WIRELESS SYSTEMS
11. J. H. Reed, ed., Software Radio: A Modern Approach to Radio
Engineering, Englerwood Cliffs, NJ: Prentice Hall, 2002.
12. T. Gemmeke, M. Gansen, and T. G. Noll, Implementation of
scalable and area efcient high throughput viterbi decoders,
IEEE J. Solid-State Circuits, 37(7): 941948, 2002.
13. S. Rajagopal, S. Rixner, and J. R. Cavallaro, Design-space
exploration for real-time embedded stream processors, IEEE
Micro 24(4): 5466, 2004.
14. N. Zhang, A. Poon, D. Tse, R. Brodersen, and S. Verdu , Trade-
offs of performance and single chip implementation of indoor
wireless multi-access receivers, IEEE Wireless Communica-
tions andNetworkingConference (WCNC), vol. 1, NewOrleans,
LA, September 1999, pp. 226230.
15. B. Salefski and L. Caglar, Re-congurable computing in wire-
less, Design Automation Conference, Las Vegas. NV. 2001, pp.
178183.
16. T. C. Callahan, J. R. Huser, and J. Wawrzynek, The GARP
architecture and C compiler, IEEE Computer, 6269, 2000.
17. A. Agarwal, RAW computation, Scientic American, 28l(2):
6063, 1999.
18. S. Srikanteswara, R. C. Palat, J. H. Reed, and P. Athanas, An
overview of congurable computing machines for software
radiohandsets, IEEECommunica., Magaz., 2003, pp. 134141.
19. PACT: eXtreme Processing Platform(XPP) white paper. Avai-
lable:http://www.pactcorp.com.
20. K. Keutzer, S. Malik, and A. R. Newton, From ASIC to ASIP:
The next design discontinuity, IEEE International Conference
on Computer Design, 2002, pp. 8490.
21. S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. Lopez-
Lagunas, P. R. Mattson, andJ.D. Owens, Abandwidth-efcient
architecture for media processing, 31st Annual ACM/IEEE
International Symposium on Microarchitecture (Micm-31),
Dallas, TX, 1998, pp. 313.
22. H. Corporaal. Microprocessor Architectures - from VLIW to
TTA, 1st ed., Wiley Internationa, 1998.
23. S. Rixner, stream Processor Architecture, Dordrecht: Kluwer
Academic Publishers, 2002.
24. D. W. Wall, Limits of Instruction-Level Parallelism, 4th Inter-
national Conference on Architectural Support for Program-
ming Languages and Operating Systems (ASPLOS), Santa
Clara, CA, 199l, pp. 176188.
25. C-K. Chen, P-C. Tseng, Y-C. Chang, and L-G. Chen, A digital
signal processor with programmable correlator architecture
for third generation wireless communication system, IEEE
Trans. Circuits Systems-II: Analog Digital Signal Proc., 48
(12): 11101120, 2001.
26. A. Gatherer and E. Auslander, eds., The Application of Pro-
grammable DSPs in Mobile Communications, NewYork: John
Wiley and Sons, 2002.
27. A. Gatherer, T. Stetzler, M. McMahan, andE. Auslander, DSP-
based architectures for mobile communications: past, present
and future, IEEE Communicat. Magazine, 38(1): 8490, 2000.
28. S. Agarwala, et al., A600 MHz VLIWDSP, IEEEJ. Solid-State
Circuits, 37(11): 15321544, 2002.
29. U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P.
Mattson, and J. D. Owens, Programmable stream processors,
IEEE Computer, 36(8): 5462, 2003.
30. P. Crowley, M. A. Franklin, H. Hadimioglu, and P.Z. Onufryk,
Network Processor Design: Issues and Practices, vol. l, San
Francisco, CA: Morgan Kaufmann Publishers, 2002.
SRIDHAR RAJAGOPAL
WiQuest Communications, Inc.
Allen, Texas
JOSEPH R. CAVALLARO
Rice University
Houston, Texas
COMMUNICATION PROCESSORS FOR WIRELESS SYSTEMS 11
D
DATA STORAGE ON MAGNETIC DISKS
INTRODUCTION
Datastorage canbe performedwithseveral devices, suchas
hard disk drives (HDDs), tape drives, semiconductor mem-
ories, optical disks, and magneto-optical disks. However,
the primary choice of data recording in computer systems
has been based on HDDs. The reasons for this choice
include the faster access times at relatively cheaper costs
compared with other candidates. Although the semicon-
ductor memories may be faster, they are relatively more
expensive than the HDDs. On the other hand, storage on
optical disks used to be cheaper than the hard disk storage,
but they are slower. Inthe recent past, HDDtechnology has
grown at such a rapid pace that it is faster than the optical
disks but costs almost the same in terms of cost per bit.
HDDs have grown at an average rate of 60% during the
last decade. This growth has led to enhanced capacity. The
increase in the capacity of hard disks, while maintaining
the same manufacturing costs, leads to a lower cost per GB
(gigabytes 1 10
9
bytes). At the time of writing (August
2007), HDDs witha capacity of about 1 TB(terabytes 1
10
12
bytes) have beenreleased. At the same time, the size of
the bits inHDDs has reduced signicantly, whichhas led to
an increase of areal density (bits per unit area, usually
expressedas Gb/in
2
, gigabits per square inch). The increase
of areal density has enabled HDDs to be used in portable
devices as well as in digital cameras, digital video cameras,
and MP3 players.
To fuel this tremendous improvement in technology,
signicant inventions need to be made in the components
as well as in the processes and technologies associated with
HDDs. This article will provide a brief background of the
magnetic disks that store the informationinHDDs. Several
articles on the Web are targeted at the novice (13). Also,
several articles inthe research journals, are targeted at the
advanced reader (47). This article will try to strike a
balance between basic and advanced information about
the HDDs.
DATA STORAGE PRINCIPLES
Any data storage device needs to satisfy certain criteria. It
needs to have astorage medium(or media) to store the data,
ways to write the data, ways to read the data, and ways to
interpret the data. Let us take the example of a book. In a
printed book, paper is the storage medium. Writing infor-
mation (printing) is done using ink, and reading is carried
out withthe users eyes. Interpretationof the datais carried
out using the users brain. Components with similar func-
tions exist in an HDD too.
Figure 1(a) shows the components of atypical HDD, used
in desktop PCs. Some key components that make up an
HDD are marked. A disk, a head, a spindle motor, an
actuator, and several other components are included.
The disk is the recording medium that stores the informa-
tion, which is similar to the paper of a book. The head
performs two functions (similar to pen and eye in our
example), which are writing and reading information.
The spindle motor helps to spinthe diskso that the actuator
(that moves along the radial direction) cancarry the headto
any part of the disk and read/write information. The
printed circuit board [Fig. 1(b)] is the brain of the HDD,
which helps to control the HDD and pass meaningful
information to the computer or whatever device that
uses the HDD. Before we learn about these components
in detail, let us understand the principles of magnetic
recording, which is the basis for storage of information in
HDDs.
Magnetic recording was demonstrated successfully as a
voice signal recorder by Vladimir Poulsen about 100 years
ago. Later, it was exploited for storing all kinds of informa-
tion. In simple terms, magnetic recording relies on two
basic principles: (1) Magnets have north and south poles.
The eld fromthese poles can be sensed by a magnetic eld
sensor, which is a way of reading information. (2) The
polarity of the magnets can be changed by applying exter-
nal magnetic elds, which is a way of writing information.
Although earlier systems of magnetic recording such as
audio tapes and video tapes are analog devices, HDDs are
digital devices, which make use of a string of 1 s and 0 s to
store information. In the next few sections, the key tech-
nologies that aid in advancing the storage density of HDDs
will be discussed briey. The main focus of the article is,
however, on the magnetic disks that store the information.
HEADS
The head is a tiny device [as shown in Fig. 1(a)] that per-
forms the readwrite operationin anHDD. The heads have
undergone tremendous changes over the years. In the past,
both reading and writing operations were carried out using
aninductive head. Inductive heads are a kind of transducer
that makes use of current-carrying coils wound on a ferro-
magnetic material to produce magnetic elds (see Fig. 2).
The direction of the magnetic eld produced by the poles of
the ferromagnetic material canbe changed by changing the
direction of the electric current. This eld can be used to
change the magnetic polarities of the recording media
(writing information).
Inductive heads can also be used for reading informa-
tion, based on Faradays law, which states that a voltage
will be generated in a coil, if a time-varying ux (magnetic
eld lines) is in its vicinity. When the magnetic disk (in
which information is written) rotates, the eld that ema-
nates from the recording media bits will produce a time-
varying ux, which will lead to a voltage in the inductive
head. This voltage canbe used to represent 1s or 0s. The
inductive head technology was prevalent until the early
1990s. However, to increase the bit density, the bit size had
to be decreased, whichled to a decrease inthe magnetic ux
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
from the bits. The inductive heads were not sensitive
enough to the magnetic eld from the smaller bits as the
technology progressed. Therefore, more advanced read
sensors found their way into the head design. Modern
HDDs have two elements in a head: One is a sensor for
reading information (analogy: eye), and the other is an
inductive writer for writing information (analogy: pen).
Such components where the sensor and writer are inte-
grated are called integrated heads or simply heads.
Figure 2 shows the schematic diagram of an integrated
head, which consists of a writer and a reader. The inductive
head on the right side of the image is the writer that
produces the magnetic eld to change the magnetic state
of the region of the magnetic disk to be in one way or the
other. Figure 2 also shows a read-sensor (or reader), sand-
wiched between two shields. The shield 2, as shown in the
gure, is usually substituted with one side of the core of the
inductive head. The shields are ferromagnetic materials
that would shunt away any eld from bits that are away
from the region of interest, in a similar way as the blinders
prevent a horse from seeing on either side. Because of the
shields, only the magnetic eld from the bit (that is being
read) is sensed by the reader.
The HDDs used magneto-resistive (MR) heads for some
time (early to late 1990s) before switching to prevalent
giant magneto-resistive (GMR) sensors. Unlike inductive
heads, MR and GMR heads work on the basis of change in
the resistance in the presence of a magnetic eld. The GMR
sensor, which is shown in multicolor, is in fact made of
several magnetic and nonmagnetic layers. GMR devices
make use of the spin-dependent scattering of electrons.
Electrons have up and down spins. When an electric
current is passed through a magnetic material, the mag-
netic orientation of the magnetic material will favor the
movement of electrons with a particular spin up or
down. In GMR devices, the magnetic layers can be
designed in such a way that the device is more resistive
or less resistive to the ow of electrons depending on the
directionof the eldsensedby the sensors. Suchachange in
the resistance can be used to dene 1 and 0 needed for
digital recording.
RECORDING MEDIA
As mentioned, this article will discuss the recording
mediathe magnetic disks that store the informationin
Figure 1. Typical components of an HDD: (a) viewof inside parts
of the HDD and (b) view of the printed-circuit-board side.
Figure 2. Schematic diagram of a head in an HDD showing the
writer (inductive head), reader, and media below.
2 DATA STORAGE ON MAGNETIC DISKS
more detail. Historically, the rst recording media for HDD
was made by spraying a magnetic paint on 24 aluminium
disks. These types of recording media, where ne magnetic
particles are embedded in a polymer-based solvent, are
called particulate media. However, they became obsolete
in the late 1980s. Thin lm technology took the place of the
old technology to coat the recording media.
Fabrication of Magnetic Disks
In thin lm technology, the recording media consists of
several layers of thin lms deposited by a process called
sputtering. Figure 3shows atypical schematic cross section
of the layers that may constitute a perpendicular recording
medium, which is an emerging technology at the time of
writing. Sputtering, simply said, is a process of playing
billiards with atoms to create thin lmlayers. The sputter-
ing process is carried out in vacuum chambers and is
described here briey. The sputtering chamber, where
the thin lm deposition would take place, is pumped
down to a low pressure at rst. The objective of achieving
this low pressure, called the base pressure, is to minimize
the water vapor or other contaminating elements in the
deposition chamber. After the base pressure is achieved,
whose value depends on the application, argon (Ar) or some
other inert gas is passed into the chamber. When a high
negative voltage is appliedbetweenthe target (the material
that needs to be ejected) and the sputtering chamber
(ground), the Ar ions that were created by the discharge
in the sputtering chamber and accelerated because of
the high-voltage knock on the target and release atoms.
These atoms are scattered in different directions, and if the
sputtering system was optimized well, most of them would
arrive at the substrate (the disk that has to be coated).
Several modications can be made to the process
described above to improve the quality of the lms that
are obtained. In modern sputtering machines used in the
hard disk industry, magnetron sputtering, where magnets
are arranged below the targets and can even be rotated, is
usedto deposit thinlmat faster depositionrates withgood
uniformity over the entire substrate. Moreover, since the
recording media have several layers, the modern machines
also have several sputtering chambers in a single sputter-
ing machine, so that all layers can be deposited subse-
quently without exposing the layers to the ambient
conditions. Figure 4 illustrates the sputtering process.
Figure 5 shows the conguration of chambers in a sputter-
ing system from Oerlikon (8). In the media fabrication
process, the disk goes through all the chambers to coat
the various layers depicted in Fig. 3.
Magnetic Recording Schemes
Figure 6 shows the way the data are organized on magnetic
disks (also called platters). Several platters may be stacked
in an HDD to multiply capacity. In almost all HDDs, the
information is stored on both sides of the disks. Read/write
heads exist on both surfaces, which perform reading and
writing of information. For the sake of simplicity, only one
side of the disk surface is shown here. As can be shown, the
information is stored in circular tracks. Within the tracks,
addressed sectors exist, in which the information can be
written or read. The randomness in access/storage of infor-
mationfrom/inanaddress provided by the CPUcomes from
the ability to move the head to the desired sector. In the
state-of-the-art hard-disk media, about 150,000 tracks
exist running from the inner diameter (ID) of the disk to
the outer diameter (OD). The tracks are packed at such a
high density that it is equivalent to squeezing about 600
tracks inthe widthof a typical humanhair (considering the
average diameter of the human hair to be around 100 mm).
In each track, the bits are packed at a density of 1 million
bits in an inch (about 4000 bits in the width of a human
hair). If a nano-robot has to run on all the tracks to have a
look at the bits of a contemporary hard disk, it would have
almost completed a marathon race. That is how dense
are the bits and tracks in an HDD. This trend of squeezing
the tracks and bits in a smaller area will continue. Figure 7
Figure 3. Typical cross-sectional viewof arecordingmediumthat
shows different layers.
Figure 4. Illustration of the sputtering process inside a sputter-
ing chamber.
DATA STORAGE ON MAGNETIC DISKS 3
shows the trend of increase in the areal density (number of
bits in a square-inch) as a function of time. It can be noticed
that the areal density has increased by about 100 million
times in the past ve decades.
Figure 8 illustrates the recording process in the long-
itudinal recording technology. In this technology, the pola-
rities of the magnets are parallel to the surface of the hard
disk. When two identical poles (SS or NN) are next to
each other, a strong magnetic eld will emerge from the
recording medium, whereas no eld will emerge when
opposite poles (SN) are next to each other. Therefore,
when a magnetic eld sensor (GMR sensor, for example)
moves across this surface, a voltage will be produced only
when the GMR sensor goes over the transitions (regions
where like poles meet). This voltage pulse can be synchro-
nized with a clock pulse. If during the clock window,
the GMR sensor produces a voltage, it represents 1. If
no voltage is produced during the clock window, it repre-
sents 0. This simple illustration shows how 1s and 0s
are stored in hard disk media.
In our previous discussion, groups of bar magnets were
used to illustrate the way bits are organized. However, in
real hard disk media, the bits are stored in a collection of
nano-magnets called grains for several reasons. Grains
(tiny crystallites) are regions of a material within which
atoms are inarrangement of acrystal. The grains of current
hard disk media have an average size of about 7 nm.
Figure 9 shows a close-up view of the bits in a recording
medium(bit-boundary is shownby the dark lines). It canbe
noticed that the bits are not perfectly rectangular. The bit
boundaries are not perfectly straight lines, but they are
mostly determined by the grain boundaries in modern-day
magnetic disks. Also, several grains are between the bit
boundaries. In current technology, about 60 grains are
included in a bit.
The reasons for using several grains to store information
derive mainly fromthe fabrication process. The traditional
deposition methods of making the magnetic disks (such as
sputtering) lead to a polycrystalline thin lm. If the grains
of the polycrystalline material are magnetic, their easy
Figure 6. Organization of data in HDDs.
Figure 5. Picture of a sputtering system
with different chambers for manufacturing
hard disk media. The direction of move-
ment of the disk is marked by arrows.
4 DATA STORAGE ON MAGNETIC DISKS
axes (the direction in which north and south poles will
naturally align themselves) will be in random directions.
Moreover, the grains are also arranged inrandompositions
within the magnetic media so that relying on one grain to
store one bit is not possible. In addition, relying on several
grains to store information provides us with the conveni-
ence of storing information in any part of the recording
media.
It can be noticed from Fig. 10 that the easy axes are
pointing in different directions. The magnetic eld, which
produces the signal in the read-sensor, depends on the
component of magnetization parallel to the track. If more
grains have their magnetization orientated parallel to the
track, then the recording media will exhibit a high signal
and a low noise. When designing a recording medium, it is
important to achieve a high signal-to-noise ratio (SNR) at
high densities, as a high SNRwill enable the bits to be read
reliably. Hard disk media makes use of a mechanical-
texturing technology to maximize the number of grains
whose easy axes are parallel to the track. However, even by
improving this technology, not all grains can be arranged
when their easy axis direction is parallel to the track.
Because of these reasons, several grains of a magnetic
disk are used to store one bit.
Design of a Magnetic Recording Medium
In the previous section, it was highlighted that several
grains (tiny crystals) are used for storing information.
The SNR, which determines how reliably the bits can be
read, depends on the number of grains in a bit. When the
areal density needs to be increased in an HDD, the bit size
needs to be shrunk. Therefore, to maintain an acceptable
SNR, the grain size also needs to be shrunk, so that the
number of grains per bit remains more or less the same.
Therefore, reduction of grain size is one of the areas,
where the attention of researchers is always focused
(9,10). Figure 11 shows the reduction of grain size in
recording media over several generations. As the technol-
ogy progressed, the grain size decreased. However, signi-
cant improvements in other parts of HDDs can also relax
the requirement on grain size. Although the hard disk
media of a decade ago had about 600 grains per bit, current
media have only about 60 grains per bit. This decrease
shows that, along with improvements in recording media
technologies, the other components (such as head) or tech-
nologies (such as signal processing) have also progressed
signicantly enough to read a bit reliably from just 60
grains.
It is also essential to achieve sharper bit boundaries to
pack bits closer together. The sharpness of the bit bound-
aries deteriorates in the case of longitudinal recording
because identical magnetic poles facing at a bit boundary
do not favor a sharp magnetic change at the boundary. On
the other hand, in the case of perpendicular recording,
which will be described in detail later, where the magne-
tization is formed in the perpendicular direction to the disk
plane, opposite magnetic poles facing at a bit boundary help
to froma sharp magnetic change. Therefore, the sharpness
Figure 9. Close-up view of bits in a
recording medium.
Areal Density Growth
1.E+04
1.E+02
1.E+00
1.E+02
1.E+04
1.E+06
1956 1975 1989
Year
1996 2001
A
r
e
a
l
d
e
n
s
i
t
y
(
G
b
/
i
n
2
)
Figure 7. Trend in the increase of areal density of HDDs. Figure 8. Recording principle of longitudinal recording.
DATA STORAGE ON MAGNETIC DISKS 5
of the bit boundaries is determined by the nature of grains
in the recording medium.
If the grains are isolated fromeachother, eachgrainwill
act as magnet by itself. In this case, the bit boundaries
will be sharper. Onthe other hand, if the grains are not well
isolated, afewgrains couldswitchtogether. Inthis case, the
bit boundaries will be broader. It is essential to achieve a
good inter-grain separation, when the recording mediumis
designed. In the emerging perpendicular recording, the
hard disk media are an alloy of CoCrPt:SiO
2
. Signicant
presence of Co makes this alloy magnetic, and Pt helps to
improve the magneto-crystalline anisotropy energy. If this
energy is greater, the reversal of magnetic direction
becomes difcult, leading to a long-term storage without
self-erasure or erasure from small external elds. The
addition of Cr and SiO
2
are responsible for improving
inter-grain separation. When the magnetic layer is formed
from CoCrPt:SiO
2
, oxides of Cr and Si are formed in the
grainboundary. Because these oxides are nonmagnetic, the
magnetic grains are isolatedfromeachother, andthis helps
to obtain a narrower bit-boundary.
It is also important to achieve a desired crystallographic
orientation when a recording medium is designed. The
cobalt crystal, which is the most commonly used hard
disk media layer, has a hexagonal close-packed structure.
In a cobalt crystal, the north and south poles prefer to lie
along the C-axis (center of the hexagon). Therefore, when a
recording medium is made, it is necessary to orient the C-
axis in a desired way. For example, it is necessary to orient
the C-axis perpendicular to the disk in the emerging per-
pendicular recording technology (and parallel to the disk in
longitudinal recording technology). If the recording med-
ium is deposited directly on the substrate, it may not be
possible to achieve the desired orientation. Therefore,
usually several layers are deposited below the recording
layer (see Fig. 3) to improve the desired orientation. In
recording media design, the choice of materials and the
process conditions for the layers play a critical role.
In addition, several requirements such as corrosion-
resistance and a smooth surface must be met when design-
ing recording media. The designed media must not corrode
at least for a period of 10 years. The protection layers that
have to prevent corrosioncannot be thicker than4 nm(as of
the year 2007). In the future, the protection layers need to
be thinned down closer to 1 nm (1/100000th thickness of a
human hair). The surface of the media should be very
smooth with a roughness of only about 0.2 nm to enable
smooth ying of the head over the media surface without
crashing. All requirements must be met while maintaining
the cost of the hard disk media below about US$5.
OTHER COMPONENTS
The HDD involves several other advanced technologies in
its components and mechanisms. For example, the slider
that carries the read/write head has to y at a height of
about 7 nm (or lower in future). This ying condition is
achieved when the disk belowrotates at a linear velocity of
36 kmph. If a comparison is made between the slider and a
B747jet, yingthe slider at aheight of 7nmis like yingthe
jumbo jet at a height of 7 mm above the ground. In some
modernHDDs, the ying height of the slider canbe lowered
by about 2 nm when information needs to be read or
written. This decrease will allow the head to y at safer
heights during the idle state and y closer when needed.
Several challenges are associated with almost all com-
ponents of an HDD. Because the head needs to y at a very
lowheight over the media surface, the hard disk needs to be
in a very clean environment. Therefore, HDDs are sealed
with a rubber gasket and protected from the outside envir-
onment. The components inside the HDD (including the
screws) should not emit any contaminant particle or gas, as
that will contaminate the environment leading to a crash
between the head and the media. The motor of the HDDs
should provide sufcient torque for the heavy platters to
rotate. Yet the motor should be quiet, stable in speed, and
free fromvibrations. Inaddition, it shouldbe thinenoughto
t into smaller HDDs and consume less power.
OTHER TECHNOLOGIES
HDDs also use several advanced technologies to achieve an
ever increasing demand of higher capacity and lower cost
per gigabyte. For an Olympic sprinter who is sprinting at a
speed of 10 m/s, staying in his track with a width of about
1.25 m is not a serious problem. However, in HDDs, the
head that moves at similar speeds has to stay in its track,
Figure 10. Magnetic orientations of grains in a recorded bit.
25
20
15
10
5
0
25
20
15
10
5
0
1 10 100 1000
Areal Density (Gb/in
2
)
G
r
a
i
n
D
i
a
m
e
t
e
r
(
n
m
)
Figure 11. Average size of grains in recording media of different
generations.
6 DATA STORAGE ON MAGNETIC DISKS
which is only about 160 nm wide (about 8 million times
narrower than the Olympic track). Following the track gets
more complicated in the presence of airow. The airow
inside the HDD causes vibrations in the actuator and the
suspension that move and carry the head. Therefore, it is
necessary to study the air-ow mechanism and take the
necessary steps to minimize the vibrations and stay in the
track. These issues will be more challenging in the future,
as the tracks will be packed at even higher densities than
before. It was also mentioned that the HDDs use fewer
grains per bit now than in the recent past, which leads to a
reduction in the SNR and makes signal processing more
complicated. Therefore, it is very clear that the HDD
involves creation of new technologies that should go
hand-in-hand. HDD technology is one of the few technol-
ogies that needs multidisciplinary research.
RECENT DEVELOPMENTS
In magnetic recording, as mentioned, small grains help to
store the information. The energy that helps to make the
information stable depends on the volume of the grain.
When the grains become smaller (as technology pro-
gresses), this energy becomes smaller, which leads to a
problem called superparamagnetism. In superparamag-
netism, the north and south poles of a magnet uctuate
from thermal agitation, without the application of any
external eld. Even if 5% of the grains of a bit are super-
paramagnetic, data will be lost. It was once reported that
the areal density of 35 Gb/in
2
would be the limit of magnetic
recording (11). However, HDD technology has surpassed
this limit, and the current areal density of the products is at
least four times larger. Several techniques are employed to
overcome the limit of 35 Gb/in
2
, but describing those tech-
niques is beyond the scope of this article. One technology,
which can carry forward the recording densities to higher
values, at least as highas 500Gb/in
2
(whichis roughlythree
times the areal density in the products of 2006), is called
perpendicular recording technology.
Although perpendicular recording technology was pro-
posed in the mid-1970s, successful implementation of per-
pendicular recording technology has been offered only since
2006. In this technology, the north and south poles of the
magnetic grains that store the information lie perpendicu-
lar to the plane of the disk (12). In addition, this technology
has two main differences in the head and media design.
First, the media design includes the presence of a soft
magnetic underlayer (a material whose magnetic polarity
can be changed with a small eld) marked in Fig. 3 as
SUL1 and SUL2. Second, the head has a different archi-
tecture in that it produces a strong perpendicular magnetic
eld that goes through the recording layer and returns
back through another pole with a weaker eld via the SUL
(see Fig. 12). The eld becomes weaker at the return pole
because of the difference in the geometry of the two poles.
This conguration achieves a higher writing eld than
what is possible with the prevalent longitudinal recording
technology. When higher writing elds are possible, it is
possible to make use of smaller grains (whose magnetic
polarities are more stable) to store information. When
smaller grains are used, the bit sizes can be smaller leading
to higher areal densities. This increase in the writability is
one reason that gives the edge to perpendicular recording
technology.
Perpendicular recording technology offers larger M
r
.t
(which is the product of remanent moment M
r
and
thickness t or the magnetic moment per unit area) values
for the media without degrading the recording resolution.
The resolution and the noise are governed by the size of the
magnetic clusters (or the grain size in the extreme case). It
also offers increased thermal stability at high linear den-
sities. All these characters come from the difference in the
demagnetizing effect at the magnetic transitions.
Figure 12. Illustration of perpendicular
recording technology.
DATA STORAGE ON MAGNETIC DISKS 7
FUTURE STORAGE: WHAT IS IN STORE?
The recently introduced perpendicular recording technol-
ogy is expected to continue for the next four to ve years
without signicant hurdles. However, some new technolo-
gies would be needed after ve years or so, to keep increas-
ingthe areal densityof the HDDs. One suchtechnologythat
is investigated intensively is heat-assisted magnetic
recording (HAMR). In this technology, a ne spot (about
30 nm) of the recording medium is heated to enable the
writing operation. Without heating, writing/erasure can-
not be achieved. Once the writing is carried out, the media
are cooled to room temperature, at which the data are
thermally stable.
Several challenges are associated with this technology,
which are being investigated. Another technology that is
considered as a competitor to HAMRis bit-patterned media
recording. In bit-patterned media recording technology, the
media are (lithographically or otherwise) dened to have
regions of magnetic and nonmagnetic material. Although
the magnetic material will store the information, the non-
magnetic material will dene the boundary and help to
reduce the inter-bit magnetic interactions and, hence,
noise. Since the magnetic material in this technology is
larger than that of a single grain in perpendicular record-
ing technology, the information is more stable with bit-
patterned media technology. However, to be competitive,
feature sizes of about 10 nm need to be achieved lithogra-
phically without signicantly increasing costs. Although a
10-nm feature size over larger areas at a cheaper costs is a
challenge that cannot be tackled so soon, challenges also
come from other components to meet the requirement of
bit-patterned media recording technology. At this point, it
is not clear which technology will take the lead. But, it is
foreseen widely that far-future HDDs may combine both
technologies. Therefore, research has been going on for
both of these technologies.
When the new technologies described above enter into
HDDs, in about six to seven years from now, the areal
density will be about 1000 Gb/in
2
. With this kind of areal
densities, a desktop HDD could store about 5 TB or higher.
You could carry your laptop with an HDD that stores 1 TB.
It is almost certain that all of these high-capacity devices
would be available at the same price of a 400-GB desktop
drive available now.
SUMMARY
The working principle of an HDD is introduced, with
detailed attention given to the magnetic disks that store
the information. InanHDD, the head that has a sensor and
a writer are used for reading and writing information. The
information is stored in magnetic disks. To achieve high
densities, it is necessary to reduce the size of the grains
(tiny-magnets) that store the information. Recently, per-
pendicular recording technology has beenintroduced inthe
market to continue the growth of HDD technology.
Although perpendicular recording technology may last
for ve to eight more years, alternative technologies are
being sought. Heat-assisted magnetic recording and bit-
patterned media recording are a few candidates for future.
BIBLIOGRAPHY
1. Available: http://www.phptr.com/content/images/0130130559/
samplechapter/0130130559.pdf.
2. Available: http://en.wikipedia.org/wiki/Hard_disk_drive.
3. Available: http: / / computer . howstuffworks . com/ hard-disk1.
htm.
4. A. Moser, K. Takano, D.T. Margulies, M. Albrecht, Y. Sonobe,
Y. Ikeda, S. Sun and E.E. Fullerton, J. Phys. D: Appl. Phys. 35.
R157R167, 2002.
5. D. Weller and M.F. Doerner, Annu. Rev. Mater. Sci. 30: 611
644, 2000.
6. S.N. Piramanayagam, J. Appl. Phys., 102: 011301, 2007.
7. R. Sbiaa and S.N. Piramanayagam, Recent patents on nano-
technology 1: 2940, 2007.
8. Available http://www.oerlikon.com.
9. S.N. Piramanayagam, et al., Appl. Phys. Lett. 89: 162504, 2006.
10. M. Zheng, et al., IEEE Trans. Magn. 40 (4): 24982500, 2004.
11. S.H. Charap, P.L. Lu, andY.J. He, IEEETrans. Magn., 978: 33,
978, 1997.
12. S. Iwasaki, IEEE Trans. Magn., 20: 657, 1984.
S. N. PIRAMANAYAGAM
Data Storage Institute
Singapore
8 DATA STORAGE ON MAGNETIC DISKS
E
ELECTRONIC CALCULATORS
People have used calculating devices of one type or another
throughout history. The abacus, which uses beads to keep
track of numbers, was invented over 2000 years ago and is
still used today. Blaise Pascal invented a numerical wheel
calculator, a brass box with dials for performing addition,
in the seventeenth century. (1) Gottfried Wilhelm von
Leibniz soon created a version that could also multiply,
but mechanical calculators were not widely used until the
early nineteenth century, when Charles Xavier Thomas de
Colmar invented a machine that could perform the four
basic functions of addition, subtraction, multiplication, and
division. Charles Babbage proposed a steam-powered
calculating machine around 1822, which included many
of the basic concepts of modern computers, but it was never
built. A mechanical device that used punched cards to
store data was invented in 1889 by Herman Hollerith
and then used to mechanically compile the results of the
U.S. Census in only 6 weeks instead of 10 years. A bulky
mechanical calculator, with gears and shafts, was devel-
oped by Vannevar Bush in 1931 for solving differential
equations (2).
The rst electronic computers used technology based
on vacuum tubes, resistors, and soldered joints, and thus
they were much too large for use in portable devices. The
invention of the transistor (replacing vacuum tubes) fol-
lowed by the invention of the integrated circuit by Jack
Kilby in1958ledto the shrinking of electronic machinery to
the point where it became possible to put simple electronic
computer functionality into a package small enough to t
into a hand or a pocket.
Logarithms, developedbyJohnNapier around1600, can
be used to solve multiplication and division problems with
the simpler operations of addition and subtraction. Slide
rules are mechanical, analog devices that are based on the
idea of logarithms and use calibrated sticks or disks to
perform multiplication and division to three or four sig-
nicant gures. Slide rules were an indispensable tool for
engineers until they were replaced by handheld scientic
calculators starting in the early 1970s (3).
CALCULATOR TYPES AND USES
Electronic calculators come in a variety of types: four-
function (addition, subtraction, multiplication, and divi-
sion), desktop, printing, and scientic. Figure 1 shows
various calculators with prices ranging from $5 to $150.
Scientic calculators cancalculate square roots, logarithms
and exponents, and trigonometric functions. The scientic
category includes business calculators, which have time-
value-of-money, amortization, and other money manage-
ment functions. Graphingcalculators are atype of scientic
calculator with a display that can show function plots.
Advanced scientic and graphing calculators also have
user programming capability that allows the user to enter
and store programs. These programs can record and auto-
mate calculation steps, be used to customize the calculator,
or perform complicated or tedious algorithms. Some hand-
held calculators are solar powered, but most advanced
scientic calculators are powered by batteries that last
for many months without needing replacement.
Scientic Calculators
Scientic calculators can perform trigonometric functions
and inverse trigonometric functions (sin x, cos x, tan x,
arcsinx, arccos x, arctanx) as well as hyperbolic andinverse
hyperbolic functions (sinh x, cosh x, tanh x, arcsinh x,
arccosh x, arctanh x). They can also nd natural and
common logarithms (ln x, log x), exponential functions
(e
x
; y
x
; y
1=x
), factorials (n!), and reciprocals (1=x). Scientic
calculators contain a representation for the constant p, and
they canconvert angles betweendegrees and radians. Most
scientic calculators accept numbers with 10 to 12 digits
andexponents rangingfrom99to 99, althoughsome allow
exponents from 499 to 499.
Graphing Calculators
Graphing calculators were rst developed in the late
1980s as larger liquid-crystal displays (LCDs) became
available at lower cost. The pixels in an LCD display can
be darkened individually andso canbe usedto plot function
graphs. The user keys in a real-valued function of the form
y = f(x) and makes some choices about the scale to use for
the plot and the set of values for x. Then the calculator
evaluates f(x) for each x value specied and displays the
resulting (x,f(x)) pairs as a function graph. Graphing cal-
culators can also plot polar and parametric functions,
three-dimensional (3-D) wireframe plots, differential equa-
tions, and statistics graphs such as scatter plots, histo-
grams, and box-and-whisker plots. (See Fig. 2.) Once a
graph has been displayed, the user can move a small cursor
or crosshairs around the display by pressing the arrow or
cursor keys and then obtain information about the graph,
such as the coordinates of points, the x-intercepts, or the
slope of the graphat a certainpoint. The user canalso select
an area of interest to zoom in on, and the calculator will
replot the graph using a different scale (4).
Programmable Calculators
If a series of steps is to be repeated using various different
inputs, it is convenient to be able to record those steps and
replay them automatically. Simple programmable calcula-
tors allow the user to store a sequence of keystrokes as a
program. More complicated programmable calculators pro-
vide programming languages withmany of the components
of high-level computer languages, such as branching and
subroutines.
Given all these types and uses of calculators, what is it
that denes acalculator? The basic paradigmof acalculator
is key per function. For example, one key is dedicated to the
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
square root function on most scientic calculators. All the
user has to do is input anumber andthenpress one key, and
the calculator performs a complicated series of steps to
obtain an answer that a person could not easily calculate
on their own. Another way to say this is that there is an
asymmetry of information ow: Given a small amount of
input, the calculator does something nontrivial and gives
you back results that you could not have easily found in
your head or with pencil and paper.
CALCULATOR HARDWARE COMPONENTS
Todays advanced scientic and graphing calculators
have many similarities to computers. The block diagram
in Fig. 3 shows the system architecture of an advanced
scientic graphing calculator (5). The two main compo-
nents of a calculator are hardware and software. The hard-
ware includes plastic and metal packaging, display,
keypad, optional additional input/output devices (such as
infrared, serial ports, card slots, and beeper parts to pro-
duce sound), power supply circuit, and an electronic sub-
system. The electronic subsystem consists of a printed
circuit board with attached electronic components and
integrated circuits, including a central processing unit
(CPU), display controllers, random access memory
(RAM), and the read-only memory (ROM) where software
programs are stored permanently.
The mechanical design of a calculator consists of
subassemblies such as a top case with display and keypad,
a bottom case, and a printed circuit or logic assembly.
Figure 4 shows the subassemblies of a graphing calculator.
A metal chassis in the top case supports the keypad, pro-
tects and frames the glass display, and provides a negative
battery contact. The metal chassis is also part of the shield-
ing, which protects the electronic circuitry from electro-
static discharge (ESD). The bottom case may contain
additional metal shielding, a piezoelectric beeper part,
and circuitry for battery power. The subassemblies are
connected electrically with exible circuits (6).
Display
Early calculators used light-emitting diode (LED) displays,
but liquid crystal displays (LCDs) are used in most modern
Figure 1.
Figure 2.
2 ELECTRONIC CALCULATORS
calculators because they have low voltage requirements,
good visibility in high ambient light conditions, and they
can produce a variety of character shapes and sizes (7). An
LCD consists of two pieces of glass with a layer of liquid
crystal in between which will darken in specic areas when
a voltage signal is applied. These areas can either be
relatively large segments that are combined a few at a
time to represent a number or character, or they can be
a grid of very small rectangles (also called picture elements
or pixels) that can be darkened selectively to produce
characters, numbers, and more detailed graphics. Basic
calculators have one line displays that show one row of
numbers at a time, whereas todays more advanced calcu-
lators can display up to eight or more rows of characters
with 22 or more characters per row, using a display with as
many as 100 rows and 160 columns of pixels.
Keypad
Calculator keypads are made up of the keys that the user
presses, an underlying mechanism that allows the keys
to be depressed and then to return to their initial state,
and circuit traces that allow the system to detect a key
press. When a key is pressed, an input register line and an
output register line make contact, which causes an inter-
rupt to be generated. This interrupt is a signal to the
software to scan the keyboard to see which key is pressed.
Keypads have different tactile feel, depending on how they
are designed and of what materials they are made. Hinged
plastic keys and dome-shaped underlying pads are used to
provide feedback to the user with a snap when keys are
pressed. An elastomer membrane separating the keys from
the underlying contacts helps to protect the electronic
system from dust (8).
The keypad is an input device, because it is a way for the
user to provide information to the calculator. The display is
an output device, because it allows the calculator to convey
information to the user. Early calculators, and todays
simple calculators, make do with only these input and
output devices. But as more and more memory has been
added to calculators, allowing for more data storage and for
more extensive programs to bestoredincalculator memory,
the keypad and display have become bottlenecks. Different
ways to store and input data and programs have been
developed to alleviate these input bottlenecks. Infrared
and serial cable ports allow some calculators to commu-
nicate with computers and other calculators to quickly and
easily transfer data and programs.
Circuits
Theelectronic components of acalculator formacircuit that
includes small connecting wires, which allow electric cur-
rent to owto all parts of the system. The systemis made up
of diodes; transistors; passive components suchas resistors,
capacitors, and inductors; as well as conventional circuits
andintegratedcircuits that are designedto performcertain
tasks. One of these specialized circuits is an oscillator that
serves as a clock and is used to control the movement of bits
of information through the system. Another type of specia-
lizedcircuit is a logic circuit, or processor, whichstores data
in registers and performs manipulations on data such as
addition.
Printed Circuit Assembly
A printed circuit board (PCB) forms the backbone of a
calculators electronic circuit system, allowing various
Liquid-Crystal Display
Display
Driver
RAM ROM
CPU
Input
Register
Output
Register
Keypad
To other I/O devices:
IR port, serial port,
card ports
Bus
Figure 3.
Figure 4.
ELECTRONIC CALCULATORS 3
components to be attached and connected to each other (9).
Figure 5 shows a calculator printed circuit assembly with
many electronic components labeled. Wires that transmit
data and instructions among the logic circuits, the memory
circuits, and the other components are called buses.
Various integrated circuits may be used in a calculator,
including a central processing unit (CPU), RAM, ROM, and
FlashROMmemorycircuits, memorycontrollers that allow
the CPU to access the memory circuits, a controller for the
display, quartz-crystal-controlled clocks, and controllers
for optional additional input/output devices such as serial
cable connectors and infrared transmitters and receivers.
Depending on the design of the calculator and the choice of
components, some of these pieces may be incorporated in a
single integrated circuit called an application-specic inte-
grated circuit (ASIC).
Central Processing Unit
The central processing unit (CPU) of a calculator or com-
puter is a complicated integrated circuit consisting of three
parts: the arithmetic-logic unit, the control unit, and the
main storage unit. The arithmetic-logic unit (ALU) carries
out the arithmetic operations of addition, subtraction,
multiplication, and divisionand makes logical comparisons
of numbers. The control unit receives programinstructions,
then sends control signals to different parts of the system,
and can jump to a different part of a programunder special
circumstances such as an arithmetic overow. The main
storage unit stores data and instructions that are being
used by the ALU or control unit.
Many calculators use custom microprocessors because
commercially available microprocessors that have been
designed for larger computers do not take into account
the requirements of a small, handheld device. Calculator
microprocessors must operate well under low power con-
ditions, should not require too many support chips, and
generally must work with a smaller system bus. This is
because wider buses use more power andrequire additional
integrated circuit pins, which increases part costs. Com-
plementary metal-oxide semiconductor or CMOS technol-
ogy is used for many calculator integrated circuits because
it is well suited to very low power systems (10). CMOS has
very low power dissipation and can retain data even with
drastically reduced operating voltage. CMOS is also highly
reliable and has good latch up and ESD protection.
Memory
RAM integrated circuits are made up of capacitors, which
represent bits of information. Each bit may be in one of two
possible states, depending on whether the capacitor is
holding an electric charge. Any bit of information in
RAM can easily be changed, but the information is only
retained as long as power is supplied to the integrated
circuit. In continuous memory calculators, the information
in RAM is retained even when the calculator is turned
OFF, because a small amount of power is still being supplied
by the system. The RAM circuits used in calculators have
verylowstandby current requirements andcanretaintheir
informationfor short periods of time without power, suchas
when batteries are being replaced.
ROM circuits contain information that cannot be
changed once it is encoded. Calculator software is stored
in ROM because it costs less and has lower power require-
ments than RAM. As software is encoded on ROM by the
manufacturer and cannot be changed, the built-in software
in calculators is often called rmware. When calculator
rmware is operating, it must make use of some amount of
RAMwhenever values need to be recorded in memory. The
RAM serves as a scratch pad for keeping track of inputs
from the user, results of intermediate calculations, and
nal results. The rmware in most advanced scientic
calculators consists of an operating system (which is a
control center for coordinating the low-level input and
output, memory access, and other system functions),
user interface code, and the mathematical functions and
other applications in which the user is directly interested.
Some calculators use Flash ROM instead of ROM to
store the programs provided by the calculator manufac-
turer. This type of memory is more expensive than ROM,
but it can be erased and reprogrammed, allowing the
calculator software to be updated after the calculator has
been shipped to the customer. Flash ROMdoes not serve as
a replacement for RAM, because Flash ROM can only be
reprogrammed a limited number of times (approximately
tens of thousands of times), whereas RAM can accommo-
date an effectively unlimited number of changes in the
Figure 5.
4 ELECTRONIC CALCULATORS
value of a variable that may occur while a program is
running.
OPERATING SYSTEM
The operation of a calculator can be broken down into three
basic steps of input, processing, andoutput. For example, to
nd the square root of a number, the user rst uses the
keypad to enter a number and choose the function to be
computed. This input generates electronic signals that
are processed by the calculators electronic circuits to pro-
duce a result. The result is then communicated back to the
user via the display. The processing step involves storing
data using memory circuits and making changes to that
data using logic circuits, as well as the general operation of
the system, which is accomplished using control circuits.
A calculator performs many tasks at the systemlevel, of
which the user is not normally aware. These tasks include
those that go along with turning the calculator ON or OFF,
keeping track of how memory is being used, managing the
power system, and all the overhead associated with getting
input from the keypad, performing calculations, and dis-
playing results. A calculators operations are controlled by
an operating system, which is a software program that
provides access to the hardware computing resources
and allows various application software programs to be
run on the computer (or calculator).
The operating system deals with memory organization,
data structures, and resource allocation. The resources
that it is controlling include CPU time, memory space,
and input/output devices such as the keypad and the dis-
play. When an application program is executed, the oper-
ating system is responsible for running the program by
scheduling slices of CPUtime that canbe usedfor executing
the program steps, and for overseeing handling of any
interrupts that may occur while the program is executing.
Interrupts are triggeredbyevents that needto be dealt with
in a timely fashion, such as key presses, requests from a
program for a system-level service such as refreshing the
display, or program errors. Some types of errors that may
occur when a programis running are lowpower conditions,
low memory conditions, arithmetic overow, and illegal
memory references. These conditions should be handled
gracefully, with appropriate information given to the user.
Operating systems provide convenience and efciency:
they make it convenient to execute application programs,
and they manage system resources to get efcient perfor-
mance from the computer or calculator (11).
USER INTERFACE
The user interface for one-line-display calculators is very
simple, consisting of a single number shown in the display.
The user may have some choice about the format of that
number, such as how many digits to display to the right of
the decimal point, or whether the number should be shown
using scientic notation. Error messages can be shown by
spelling out short words in the display. For more compli-
cated calculators than the simple four-function calculators,
the number of keys on the keypad may not be enough to use
one per operation that the calculator can perform. Then it
becomes necessary to provide a more extensive user inter-
face than just a simple keypad. One way to increase the
number of operations that the keypad can control is to add
shifted keys. For example, one key may have the square-
root symbol on the key and the symbol x
2
printed just
above the key, usually in a second color. If the user presses
the square-root key, the square-root function is performed.
But if the user rst presses the shift key and then presses
the square-root key, the x-squared function will be per-
formed instead.
Advanced scientic and graphing calculators provide
systems of menus that let the user select operations. These
menus may appear as lists of items in the display that the
user canscroll throughusing arrowor cursor keys and then
select by pressing the Enter key. Changeable labels in the
bottom portion of the display, which correspond to the top
rowof keys, canalso be used to display menuchoices. These
are called soft keys, and they are much like the function
keys on a computer. Methods for the user to enter informa-
tion into the calculator depend on the type of calculator. On
simple, one-line-display calculators, the user presses num-
ber keys and can see the corresponding number in the
display. Graphing calculators, with their larger displays,
can prompt the user for input and then display the input
using dialog boxes like the ones used on computers (12).
Figure 6 shows a graphing calculator dialog box used to
specify the plot scale.
NUMBERS AND ARITHMETIC
The most basic level of functionality that is apparent to the
calculator user is the arithmetic functions: addition, sub-
traction, multiplication, and division. All calculators per-
form these functions, and some calculators are limited to
these four functions. Calculators perform arithmetic using
the same types of circuits that computers use. Special
circuits based on Boolean logic are used to combine num-
bers, deal with carries and overdrafts, and nd sums and
differences. Various methods have been developed to per-
form efcient multiplication and division with electronic
circuits (13).
Binary Numbers
Circuits canbe used to represent zeros or ones because they
can take on two different states (such as ON or OFF). Calcu-
lator (and computer memory) at any given time can be
Figure 6.
ELECTRONIC CALCULATORS 5
thought of as simply a large collection of zeros and ones.
Zeros and ones also make up the binary or base-2 number
system. For example, the (base-10) numbers 1, 2, 3, 4 are
written in base-2 as 1, 10, 11, 100, respectively. Each
memory circuit that can be used to represent a zero or
one is called a binary digit or bit. Acollection of eight bits is
called a byte (or a word). Some calculator systems deal with
four bits at a time, called nibbles. If simple binary numbers
were used to represent all numbers that could possibly
be entered into a calculator, many bits of memory would
be needed to represent large numbers. For example, the
decimal number 2
n
is represented by the binary number
consisting of a 1 followed by n zeros, and so requires n 1
bits of memory storage. To be able to represent very large
numbers with a xed number of bits, and to optimize arith-
metic operations for the design of the calculator, oating-
point numbers are used in calculators and computers.
Floating-Point Numbers
Floating-point numbers are numbers in which the location
of the decimal point maymove sothat onlyalimitednumber
of digits are required to represent large or small numbers,
which eliminates leading or trailing zeros, but its main
advantage for calculators and computers is that it greatly
increases the range of numbers that can be represented
using a xed number of bits. For example, a number x may
be represented as x 1
s
F b
E
, where s is the sign, F
is the signicand or fraction, b is the base used in the
oating-point hardware, and E is a signed exponent. A
xed number of bits are then used to represent each num-
ber inside the calculator. The registers in a CPU, which is
designed for efcient oating-point operations, have three
elds that correspond to the sign, signicand, and expo-
nent, and can be manipulated separately.
Two types of errors canappear whenacalculator returns
an answer. One type is avoidable and is caused by inade-
quate algorithms. The other type is unavoidable and is the
result of usingnite approximations for innite objects. For
example, the innitely repeating decimal representation
for 2/3 is displayed as .6666666667 on a 10-decimal-place
calculator. A system called binary-coded decimal (or BCD)
is used on some calculators and computers as a way to deal
with rounding. Each decimal digit, 0, 1, 2, 3, . . .,9, is
represented by its four-bit binary equivalent: 0000, 0001,
0010, 0011,. . .,1001. So rather than convert each base-10
number into the equivalent base-2 number, the individual
digits of the base-10 number are each represented with
zeros and ones. When arithmetic is performed using BCD
numbers, the methods for carrying and rounding follow
base-10 conventions.
One way to improve results that are subject to rounding
errors is to use extra digits for keeping track of intermedi-
ate results, and then do one rounding before the result is
returned using the smaller number of digits that the user
sees. For example, some advanced scientic calculators
allow the user to input numbers using up to 12 decimal
places, and return results in this same format, but 15-digit
numbers are actually used during calculation.
Reverse Polish Notation and Algebraic Logic System
The Polish logician Jan Lukasiewicz demonstrated a
way of writing mathematical expressions unambiguously
without using parentheses in 1951. For example, given the
expression 2 3 7 1, each operator can be written
before the corresponding operands: 2 3 7 1. Or each
operator canbe writtenafter its operands: 2371 . The
latter method has come to be known as reverse polish
notation (RPN) (14). Arithmetic expressions are converted
to RPN before they are processed by computers because
RPNsimplies the evaluationof expressions. Ina non-RPN
expression containing parentheses, some operators cannot
be applied until after parenthesized subexpressions are
rst evaluated. Reading from left to right in an RPN
expression, every time an operator is encountered it can
be applied immediately, which means there is less memory
and bookkeeping required to evaluate RPN expressions.
Some calculators allow users to input expressions using
RPN. This saves the calculator the step of converting the
expression to RPN before processing it. It also means less
keystrokes for the user bacause parentheses are never
needed with RPN. Algebraic logic system(ALS) calculators
require numbers and operators to be entered in the order
they would appear in an algebraic expression. Parentheses
are used to delimit subexpressions in ALS.
User Memory
On many calculators, the user can store numbers in special
memory locations or storage register and then perform
arithmetic operations on the stored values. This process
is called register arithmetic. On RPN calculators, memory
locations are arranged in a structure called a stack. For
each operation that is performed, the operands are taken
from the stack and then the result is returned to the
stack. Each time a new number is placed on the stack,
the previous items that were on the stack are each
advanced one level to make room for the new item.
Whenever an item is removed from the stack, the remain-
ing items shift back. A stack is a data structure that is
similar to a stack of cafeteria trays, where clean trays are
added to the top, and as trays are needed, they are removed
from the top of the stack. This scheme for placing and
removing items is called last-inrst-out or LIFO.
ALGORITHMS
An algorithmis a precise, nite set of steps that describes a
method for a computer (or calculator) to solve a particular
problem. Many computer algorithms are designed with
knowledge of the underlying hardware resources in
mind, so that they can optimize the performance of the
computer. Numerical algorithms for calculators take into
6 ELECTRONIC CALCULATORS
account the way that numbers are represented in the
calculator.
Square-Root Algorithm
A simple approximation method is used by calculators to
nd square roots. The basic steps to nding y x
1=2
are
to rst guess the value of y, calculate y
2
, and then nd
r x y
2
. Thenif the magnitude of r is small enoughreturn
y as the answer. Otherwise, increase or decrease y (depend-
ing on whether r is positive or negative, respectively) and
repeat the process. The number of intermediate calcula-
tions required can be reduced by avoiding nding y
2
and
x y
2
for eachvalue of y. This canbedone byrst ndingthe
value of the largest place digit of y, then the next largest
place digit, and so on. For example, if calculating 54756
1/2
,
rst nd 200, then 30, and then 4 to construct the answer
y 234. This method is similar to a method once taught in
schools for nding square roots by hand(15).
Trigonometric Function Algorithms
The algorithms for computing trigonometric functions
depend on using trigonometric identities and relationships
to reduce arbitrarily difcult problems to more manageable
problems. First, the input angle u is converted to an angle
in radians, which is between 0 and 2p (or in some calcula-
tors, between 0 and p/4). Next y is expressed as a sum of
smaller angles. These smaller angles are chosen to be
angles whose tangents are powers of 10: tan
1
(1) = 458,
tan
1
(0.1), tan
1
(0.01), etc. Aprocess called pseudodivision
is used to express y in this way: First tan
1
(1) is repeatedly
subtracted from y until an overdraft (or carry) occurs, then
the angle being subtracted from is restored to the value it
had right before the overdraft occurred, then the process
is repeated by subtracting tan
1
(0.1) until an overdraft
occurs, and so forth, until we are left with a remaining
angle r, which is small enough for the required level of
accuracy of the calculator. Then y can be expressed as
y q
0
tan
1
1 q
1
tan
1
0:1 r (1)
Vector geometry is the basis for the formulas used to
compute the tangent of y once it has been broken up into
the sum of smaller angles. Starting with a vector with
angle y
1
and then rotating it counter-clockwise by an
additional angle of y
2
, Fig. 7 illustrates the following rela-
tionships:
X
2
X
1
cos y
2
Y
1
sin y
2
Y
2
Y
1
cos y
2
X
1
sin y
2
Dividing both sides of these equations by cos y
2
we
obtain:
X
2
=cos y
2
X
1
Y
1
tan y
2
X
0
2
2
Y
2
=cos y
2
Y
1
X
1
tan y
2
Y
0
2
3
As Y
2
=X
2
tan y
1
y
2
, then by Equations 2 and 3, we
can see that Y
0
2
=X
0
2
tany
1
y
2
. Equations 2 and 3 can be
used repeatedly to construct the tangent of y, because y has
been broken down into a series of smaller angles, shown in
Equation (1). The initial X
1
and Y
1
correspond to the small
residual angle r. As r is a very small angle (in radians),
sin(r) is close to r andcos(r) is close to 1, so if these values are
close enough for our overall accuracy requirements, we can
let Y
1
be r and X
1
be 1. Note Equations 2 and 3 involve
nding tangents, but because we expressed y as a sum of
angles of the formtan
1
(10
k
), tan (tan
1
(10
k
)) 10
k
, so
each evaluation of Equations 2 or 3 will simply involve
addition, subtraction, and multiplication by powers of 10.
As the only multiplication involved is by powers of 10, the
calculations can be accomplished more quickly and simply
usingaprocess calledpseudomultiplication, whichinvolves
only addition and the shifting of contents of registers to
simulate decimal point shifts that correspond to multipli-
cation by powers of 10. The iterative process of using
Equations 2 and 3 generates an X and Y, which are pro-
portional to the sine and cosine of the original angle y.
Then elementary operations can be used to nd the values
of the various trigonometric functions for y (16).
Logarithm Algorithms
Logarithms are found using a process similar to the
approximation process used to compute trigonometric
functions (17). It is a basic property of logarithms that
lna
1
a
2
. . . a
n
lna
1
lna
2
. . . lna
n
. To
nd the logarithm of a number x, x is rst expressed as
product of factors whose logarithms are known. The num-
ber x will bestoredinthe calculator usingscientic notation
x M 10
k
, where M is called the mantissa and M is
greater than or equal to 1 and less than 10. As
ln M 10
k
lnM k ln10, the problem of nding
ln(x) is reduced to the problem of nding ln(M). Let a
j
be
numbers whose natural logarithms are known. Let P
1/M. Then lnP lnM. Then express P as P P
n
=r,
where P
n
a
k0
0
a
k1
1
. . . a
k j
j
and r is a number close
y
x
X
1
Y
1
Y
2
X
2
Figure 7.
ELECTRONIC CALCULATORS 7
to 1. Note that lnP lnP
n
lnr, so now lnM lnr
lnP
n
and for r close to 1, ln(r) is close to 0. Also note that
M 1=P r=P
n
implies that MP
n
r. So to nd ln(M),
we can rst nd P
n
such that MP
n
is close to 1, where P
n
is
a product of specially chosen numbers a
j
whose logarithms
are known. To optimize this routine for a calculators
specialized microprocessor, values that give good results
are a
j
1 10
j
. Thus, for example, a
0
, a
1
, a
2
, a
3
, and
a
4
would be 2, 1.1, 1.01, 1.001, and 1.0001. It turns out
that M must rst be divided by 10 to use these a
j
choices.
This choice of the a
j
terms allows intermediate multi-
plications by each a
j
to be accomplished by an efcient,
simple shift of the digits in a register, similar to the pseu-
domultiplication used in the trigonometric algorithm.
CALCULATOR DESIGN CHOICES AND CHALLENGES
The requirements for a handheld calculator to be small,
portable, inexpensive, and dedicated to performing compu-
tational tasks have driven many design choices. Custom
integrted circuits and the CMOS process have been used
because of low power requirements. Calculator software
has been designed to use mostly ROM and very little RAM
because of part cost and power constraints. Specialized
algorithms have beendevelopedandrenedto be optimized
for calculator CPUs. As calculators become more compli-
cated, ease-of-use becomes an important design challenge.
As memory becomes less expensive and calculators have
more storage space, the keypad and display become bottle-
necks when it comes to transferring large amounts of data.
Improved input/output devices such as pen input, better
displays, and character and voice recognition could all help
to alleviate bottlenecks and make calculators easier to use.
A desktop PC does not t the needs of personal port-
ability. Desktop or laptop PCs are not very convenient to
use as a calculator for quick calculations. Also, a PC is a
generic platform rather than a dedicated appliance. The
user must take the time to start up an application to per-
formcalculations ona PCso a PCdoes not have the back-of-
the-envelope type of immediacy of a calculator. Handheld
PCs andpalmtopPCs also tendto be generic platforms, only
in smaller packages. So they are as portable as calculators,
but they still do not have dedicated calculating function-
ality. The user must go out of their way to select and run a
calculator application on a handheld PC. The keypad of a
handheld PC has a QWERTY keyboard layout, and so it
does not have keys dedicated to calculator functions like
sine, cosine, and logarithms. Handheld organizers and
personal digital assistants (PDAs) are closer to the calcu-
lator model, because they are personal, portable, battery-
operated electronic devices that are dedicated to particular
functionality, but theycurrently emphasize organizer func-
tionality rather than mathematics functionality.
COMMUNICATION CAPABILITY
Some calculators have already had communication cap-
ability for many years, using infrared as well as serial cable
and other types of cable ports. These have allowed calcu-
lators to communicate with other calculators, computers,
printers, and overhead display devices that allowan image
of the calculator screen to be enlarged and projected for a
roomful of people, data collection devices, bar code readers,
external memory storage, and other peripheral devices.
Protocols are standard formats for the exchange of electro-
nic data that allow different types of devices to commu-
nicate with each other. For example, Kermit is a le
transfer protocol developed at Columbia University. By
coding this protocol into a calculator, the calculator can
communicate with any of a number of different types of
computers by running a Kermit program on the computer.
By programming appropriate protocols into calculators,
they could work with modems and gain access to the
Internet. Calculations could then be performed remotely
on more powerful computers, and answers could be
sent back to the calculator. Or calculators could be used
for the delivery of curriculum material and lessons over
the Internet.
TECHNOLOGY IN EDUCATION
Curriculummaterials have changed withthe increased use
of graphing calculators in mathematics and science class-
rooms. Many precalculus and calculus textbooks and
science workbooks now contain exercises that incorporate
the use of calculators, which allows the exercise to be more
complicated than the types of problems that could be easily
solved with pencil and paper in a few minutes. With the
use of calculators, more realistic and thus more interesting
and extensive problems can be used to teach mathematics
and science concepts. Calculators are becoming a require-
ment in many mathematics classes and on some standar-
dized tests, such as the Scholastic Aptitude Test taken by
most U.S. high-school seniors who plan to attend college.
Educational policy has, in turn, inuenced the design of
graphing calculators. In the United States, the National
Council of Teachers of Mathematics promotedthe use of the
symbolic, graphic, and numeric views for teaching mathe-
matics. These views are reected in the design of graphing
calculators, which have keys dedicated to entering a sym-
bolic expression, graphing it, and showing a table of func-
tion values. Figure 8 shows a graphing calculator display of
the symbolic, graphic, and numeric views of sin(x) (18).
Figure 8.
8 ELECTRONIC CALCULATORS
FUTURE NEED FOR CALCULATORS
Technical students and professionals will always need to
do some back-of-the-envelope-type calculations quickly
and conveniently. The key-per-function model of a calcu-
lator ts in nicely with this need. So does a device that is
dedicated, personal, portable, lowcost, andhas longbattery
life. Users expectations will be inuenced by improve-
ments in computer speed and memory size. Also, video
game users have higher expectations for interactivity,
better controls, color, animation, quick responses, good
graphic design, and visual quality. For the future, calcu-
lators can take advantage of advances in computer tech-
nology and the decreasing cost of electronic components to
move to modern platforms that have the benets of
increased speed, more memory, better displays, color dis-
plays, more versatile input devices (such as pen and voice),
and more extensive communication capability (such as
wireless communication).
BIBLIOGRAPHY
1. A. Ralston and E. D. Reilly, Jr.,. (eds.), Encyclopedia of Com-
puter Science and Engineering, 2nd ed. New York: Van Nos-
trand Reinhold, 1983.
2. Jones Telecommunications and Multimedia Encyclopedia,
Jones Digital Century. Available: http://www.digitalcentury.-
com.
3. G. C. Beakley and R. E. Lovell, Computation, Calculators, and
Computers. New York: Macmillan, 1983.
4. T. W. Beers, D. K. Byrne, G. L. Eisenstein, R. W. Jones, andP. J.
Megowan,HP 48SX interfaces and applications, Hewlett-
Packard J, 42(3): 1321, 1991.
5. P. D. Brown, G. J. May, and M. Shyam, Electronic design of an
advanced technical handheld calculator, Hewlett-Packard J.,
38(8): 3439, 1987.
6. M. A. Smith, L. S. Moore, P. D. Brown, J. P. Dickie, D. L. Smith,
T. B. Lindberg, and M. J. Muranami, Hardware design of the
HP 48SXscientic expandable calculator, Hewlett-Packard J.,
42(3): 2534, 1991.
7. C. Maze, The rst HP liquid crystal display, Hewlett-Packard
J., 31(3): 2224, 1980.
8. T. Lindberg, Packaging the HP-71B handheld computer,
Hewlett-Packard J., 35(7): 1720, 1984.
9. B. R. Hauge, R. E. Dunlap, C. D. Hoekstra, C. N. Kwee, and P.
R. Van Loan, A multichip hybrid printed circuit board for
advanced handheld calculators, Hewlett-Packard J., 38(8):
2530, 1987.
10. D. E. Hackleman, N. L. Johnson, C. S. Lage, J. J. Vietor, and R.
L. Tillman,CMOSC: Low-power technology for personal com-
puters, Hewlett-Packard J., 34(1): 2328, 1983.
11. J. L. Peterson, and A. Silberschatz, Operating System Con-
cepts, 2nd ed. Reading: Addison-Wesley, 1985.
12. D. K. Byrne, C. M. Patton, D. Arnett, T. W. Beers, and P. J.
McClellan, An advanced scientic graphing calculator,
Hewlett-Packard J., 45(4): 622, 1994.
13. N. R. Scott, Computer Number Systems and Arithmetic. Engle-
wood Cliffs, NJ: Prentice-Hall, 1985.
14. T. M. Whitney, F. Rode, and C. C. Tung, The powerful pock-
etful: An electronic calculator challenges the slide rule, Hew-
lett-Packard J., 23(10): 29, 1972.
15. W. E. Egbert, Personal calculator algorithms I: Square roots,
Hewlett-Packard J., 28(9): 2223, 1977.
16. W. E. Egbert, Personal calculator algorithms II: Trigonometric
functions, Hewlett-Packard J., 28(10): 1720, 1977.
17. W. E. Egbert, Personal calculator algorithms IV: logarithmic
functions, Hewlett-Packard J., 29(8): 2932, 1978.
18. T. W. Beers, D. K. Byrne, J. A. Donnelly, R. W. Jones, and F.
Yuan, A graphing calculator for mathematics and science
classes, Hewlett-Packard J., 47(3): 1996.
DIANA K. BYRNE
Corvallis, Oregon
ELECTRONIC CALCULATORS 9
F
FAULT-TOLERANT COMPUTING
INTRODUCTION
Fault-tolerant computing can be dened as the process by
which a computing system continues to perform its speci-
ed tasks correctly in the presence of faults. These faults
could be transient, permanent, or intermittent faults. A
fault is said to be transient if it occurs for a very short
duration of time. Permanent faults are the faults that
continue to exist in the system, and intermittent faults
repeatedly appear for a period of time. They could be either
hardware or software faults caused by errors in specica-
tion, design, or implementation, or faults caused by man-
ufacturing defects. They also could be caused by external
disturbances or simply from the aging of the components.
The goal of fault-tolerant computing is to improve the
dependability of a system where dependability can be
dened as the ability of a system to deliver service at an
acceptable level of condence. Among all the attributes of
dependability, reliability, availability, fault coverage, and
safety commonly are usedto measure the dependability of a
system. Reliability of a system is dened as the probability
that the system performs its tasks correctly throughout a
given time interval. Availability of a system is the prob-
ability that the system is available to perform its tasks
correctly at time t. Fault coverage, in general, implies the
ability of the system to recover from faults and to continue
to operate correctly given that it still contains a sufcient
complex of functional hardware and software. Finally, the
safety of a system is the probability that the system either
will operate correctly or switch to a safe mode if it operates
incorrectly in the event of a fault.
The concept of improving dependability of computing
systems by incorporating strategies to tolerate faults is not
new. Earlier computers, such as the Bell Relay Computer
built in 1944, used ad hoc techniques to improve reliability.
The rst systematic approach for fault-tolerant design to
improve systemreliability was published by von Neumann
in 1956. Earlier designers improved the computer system
reliability by using fault-tolerance techniques to compen-
sate for the unreliable components that included vacuum
tubes and electromechanical relays that had high propen-
sity to fail. With the advent of more reliable transistor
technology, the focus shifted from improving reliability
to improving performance. Although component reliability
has drastically improved over the past 40 years, increases
in device densities have led to the realization of complex
computer systems that are more prone to failures. Further-
more, these systems are becoming more pervasive in all
areas of daily life, frommedical life support systems where
the effect of a fault could be catastrophic to applications
where the computer systems are exposed to harsh environ-
ments, thereby increasing their failure rates. Researchers
in industry and academe have proposed techniques to
reduce the number of faults. However, it has been recog-
nizedthat suchanapproachis not sufcient andthat a shift
in design paradigm to incorporate strategies to tolerate
faults is necessary to improve dependability. Moreover,
advances in technologies, such as very large-scale integra-
tion (VLSI), have led to manufacturing processes that are
not only complex but also sensitive, which results in lower
yields. Thus, it is prudent to develop novel fault-tolerance
techniques for yield enhancement.
Research in this area in the last 50 years has resulted in
numerous highly dependable systems that span a wide
range of applications from commercial transactions sys-
tems and process control to medical systems and space
applications. However, the ever increasing complexity of
computer systems and the ever increasing need for error-
free computation results, together with advances in
technologies, continue to create interesting opportunities
and unique challenges in fault-tolerant computing.
PRINCIPLES OF FAULT-TOLERANT COMPUTER SYSTEMS
Although the goal of fault-tolerant computer design is to
improve the dependability of the system, it should be noted
that a fault-tolerant system does not necessarily imply a
highly dependable system and that a highly dependable
system does not necessarily imply a fault-tolerant system.
Any fault-tolerance technique requires the use of some
form of redundancy, which increases both the cost and the
development time. Furthermore, redundancy also can
impact the performance, power consumption, weight,
and size of the system. Thus, a good fault-tolerant design
is a trade-off between the level of dependency provided and
the amount of redundancy used. The redundancy could be
inthe formof hardware, software, information, or temporal
redundancy.
One of the most challenging tasks in the design of a
fault-tolerant system is to evaluate the design for the
dependability requirements. All the evaluation methods
can be divided into two main groups, namely the qualita-
tive and the quantitative methods. Qualitative techni-
ques, which typically are subjective, are used when
certain parameters that are used in the design process
cannot be quantied. For example, a typical user should
not be expected to know about fault-tolerance techniques
used in the systemto use the systemeffectively. Thus, the
level of transparency of the fault-tolerance characteristics
of the systemto the user could determine the effectiveness
of the system.
As the name implies, in the quantitative methods, num-
bers are derived for a certain dependability attribute of the
systemand different systems canbe compared withrespect
to this attribute by comparing the numerical values. This
method requires the development of models. Several prob-
abilistic models have been developed based on combinator-
ial techniques and Markov and semi-Markov stochastic
processes. Some disadvantages of the combinatorial models
are that it is very difcult to model complex systems and
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
difcult to model systems where repairing a faulty module
is feasible.
Fundamental to all these models is the failure rate,
dened as the expected number of failures per a time
interval. Failure rates of most electronic devices follow a
bathtub curve, shown in Fig. 1. Usually, the useful life
phase is the most important phase in the life of a system,
and during this time, the failure rate is assumed to be
constant and denoted typically by l. During this phase,
reliability of a systemand the failure rate are related by the
following equation, which is known as the exponential
failure law.
Rt e
lt
Figure 2 shows the reliability function with the constant
failure rate. It should be noted that the failure rate l is
related to the mean time to failure (MTTF) as MTTF 1/l
where MTTF is dened as the expected time until the
occurrence of the rst failure of the system.
Researchers have developed computer-aided design
(CAD) tools based on quantitative methods, such as
CARE III (the NASA Langley Research Center), SHARPE
(Duke University), DEPEND (University of Illinois at
Urbana-Champaign), and HIMAP(Iowa State University).
In the last several years, a considerable interest has
developed in the experimental analysis of computer system
dependability. In the design phase, CAD tools are used to
evaluate the design by extensive simulations that include
simulatedfault injection; during the prototyping phase, the
system is allowed to run under a controlled environment.
During this time, physical fault injection is used for the
evaluation purposes. Several CAD tools are available,
including FTAPE developed at the University of Illinois
at Urbana Champaign and Ballista developed at the CAR-
NEGIE-MELLON University. In the following subsections,
several common redundancy strategies for fault-tolerance
will be described briey.
Hardware Redundancy
Hardware redundancy is one of the most common forms of
redundancy in which a computer systemis partitioned into
different subsystems or modules, and these are replicated
so that a faulty module can be replaced by a fault-free
module. Because each subsystem acts as a fault-contain-
ment region, the size of the fault-containment region
dependes on the partitioning strategy.
Three basic types of hardware redundancy are used,
namely, passive, active, and hybrid. In the passive techni-
ques, also knownas static techniques, the faults are masked
or hidden by preventing them from producing errors. Fault
masking is accomplished by voting mechanisms, typically
majority voting, and the technique does not require any
fault detectionor systemreconguration. The most common
form of this type of redundancy is triple modular redun-
dancy (TMR). As shown in Fig. 3, three identical modules
are used, and majority voting is done to determine the
output. Such a system can tolerate one fault. However,
the voter is the single point of failure; that is, the failure
of the voter leads to the compete failure of the system.
Ageneralization of the TMRtechnique is the N-modular
redundancy (NMR), where N modules are used and N
typically is an odd number. Such a system can mask up
to b
N 1
2
c faulty modules.
Ingeneral, the NMRsystemis aspecial case of anM-of-N
systemthat consists of Nmodules, out of whichat least Mof
them should work correctly in order for the system to
operate correctly. Thus, system reliability of an M-of-N
Failure
Rate
Infant mortality
phase
Useful life phase
Wear-out phase
Time
Figure 1. Bathtub curve.
Figure 2. Reliability function with constant failure rate.
Figure 3. Basic triple modular redundancy.
2 FAULT-TOLERANT COMPUTING
system is given by
Rt
N
iM
N
i
_ _
R
i
t
_
1 Rt
Ni
For a TMR system, N 3 and M 2 and if an nonideal
voter is assumed (R
voter
<1), then the system reliability is
given by
Rt
3
i2
3
i
_ _
R
i
t1 Rt
3i
R
voter
t 3R
2
t 2R
3
t
_ _
Figure 4 shows reliability plots of simplex, TMR, and NMR
when N 5 and 7. It can be seen that higher redundancy
leads to higher reliability in the beginning. However, the
systemreliability sharply falls at the end for higher redun-
dancies.
In the active hardware redundancy, also known as
dynamic hardware redundancy, some form of a recon-
guration process is performed to isolate the fault after it
is detected and located. This process typically is followed by
a recovery process to undo the effects of erroneous results.
Unlike passive redundancy, faults can produce errors that
are used to detect faults in dynamic redundancy. Thus, this
technique is used for applications where consequences of
erroneous results are not disastrous as long as the system
begins to operate correctly within an acceptable time
period. In general, active redundancy requires less redun-
dancy than passive techniques. Thus, this approach is
attractive for environments where the resources are lim-
ited, such as limited power and space. One major disad-
vantage of this approach is the introduction of considerable
delays duringthe fault location, reconguration, andrecov-
ery processes. Some active hardware redundancy schemes
include duplication with comparison, pair-and-a-spare
technique, and standby sparing.
The hybrid hardware redundancy approach incorpo-
rates the desirable features of both the passive and active
strategies. Fault masking is used to prevent the generation
of erroneous results, and dynamic techniques are used
when fault masking fails to prevent the production of
erroneous results. Hybrid approaches usually are most
expensive in terms of hardware. Examples of hybrid hard-
ware redundancy schemes include self-purging redun-
dancy, N-modular redundancy with spares, triple-duplex,
sift-out modular redundancy, and the triple-triplex archi-
tecture used in the Boeing 777 primary ight computer.
Figure 5 shows a TMRsystemwith a hot spare. The rst
fault is masked, and if this faulty module is replaced by the
spare module, then the second fault also can be masked.
Note that for a static system, masking two faults requires
vemodules. Reliabilityof the TMRsystemwithahot spare
can be determined based on the Markov model shown in
Fig. 6. Such models assume that (1) the systemstarts in the
perfect state, (2) a single mode of failure exists, that is, only
a single failure occurs at a time, and(3) eachmodule follows
the exponential failure law with a constant failure rate l.
A transition probability is associated with each state tran-
sition. It can be shown that if the module was operational
at time t, then the probability that the module has failed at
1.0
0.5
0
R
e
l
i
a
b
i
l
i
t
y
N=1
N=3
N=5
N=7
Time
Figure 4. Reliabilityplots of NMRsystemwithdifferent values of
N. Note that N 1 implies a simplex system.
Figure 5. TMR system with hot spare.
3M
S
OK
F
3M
S
OK
OK
2
2M
1M
S
2M
1M
S
1M
2M
S
1 1M
2M
S
OK
F
F
OK
F
OK
OK
F
F
OK
F
OK
4
F
3
1
3
t
2
t
1-2t
3
t
+
t
2
t
+
t
3
(
1
C
)
t
2
(1
C
)
t
1-4t
1-3t
Figure 6. The Markov model of the TMRsystemwitha hot spare.
FAULT-TOLERANT COMPUTING 3
time t Dt is
1 e
lDt
For lDt <1,
1 e
lDt
lDt
The equations of the Markov model of the TMRsystemwith
a hot spare can be written in the matrix form as
The above discrete-time equations can be written in a
compact form as
Pt Dt APt
and the solution as
PnDt A
n
P0
where
P0j
1
0
0
0
0
_
_
_
_
It can be seen that the system reliability R(t) is given by
Rt 1 P
F
t
4
i1
P
i
t
where P
F
t is the probability that the system has failed.
The continuous-time equations can be derived from the
above equations by letting Dt approach zero; that is
P
0
1
t lim
Dt !0
P
1
t Dt P
1
t
Dt
P
0
1
t lim
Dt !0
P
1
t Dt P
1
t
Dt
4lP
1
t
Similarly,
P
0
2
t 3C 1lP
1
t 3lP
2
t
P
0
3
t 1 C3lP
1
t 3lP
3
t
P
0
4
t 3lP
2
t 2lP
4
t 2Cl lP
3
t
P
0
F
t 2l 2lCP
3
t 2lP
4
t
Using Laplace transforms
sP
1
s P
1
0 4lP
1
s
P
1
s
1
s 4l
)P
1
t e
4lt
sP
2
s P
2
0 P
1
s3C 1l 3lP
2
s
P
2
s
3C 1l
s 3ls 4l
3C 1
1
s 3l
1
s 4l
_ _
P
2
t 3C 1e
3lt
e
4lt
P
4
t 3C
2
2C 1e
4lt
e
2lt
2e
3lt
Rt 1 P
F
t
4
i1
P
i
t
It can be seen that when the failure rate, l, is 0.1 failures/
hour and the fault coverage is perfect (C 1) then the
reliability of the TMR system with a hot spare is greater
thanthe static TMRsystemafter one hour. Moreover, it can
P
1
t Dt
P
2
t Dt
P
3
t Dt
P
4
t Dt
P
F
t Dt
_
_
_
1 4lDt 0 0 0 0
3lCDt lDt 1 3lD 0 0 0
3l1 CDt 0 1 3lD 0 0
0 3lDt 2lCDt lDt 1 2lDt 0
0 0 2l1 CDt 2lDt 1
_
_
_
_
P
1
t
P
2
t
P
3
t
P
4
t
P
5
t
_
_
_
_
4 FAULT-TOLERANT COMPUTING
be noted that the system reliability can be improved by
modifying the system so that it switches to simplex mode
after the second failure.
Software Redundancy
Unlike hardware faults, software faults are caused by mis-
takes in software design and implementation. Thus, a soft-
ware fault in two identical modules cannot be detected by
usual comparison. Many software fault-tolerance techni-
ques have used approaches similar to that of the hardware
redundancy. Most of these schemes assume fault indepen-
dence and full fault coverage and fall under the software
replication techniques, including N-version programming,
recovery blocks, and N self-checking programming.
The basic idea of N-version programming is to design
and code N different software modules; the voting is per-
formed, typically majority voting, on the results produced
by these modules. Each module is designed and implemen-
ted by a separate group of software engineers, based on the
same specications. Because the design and implementa-
tion of each module is done by an independent group, it is
expected that a mistake made by a group will not be
repeated by another group. This scheme complements N-
modular hardware redundancy. A TMR system where all
three processors are running the same copy of the software
module will be rendered useless if a software fault will
affect all the processors in the same way. N-version pro-
gramming can avoid the occurrence of such a situation.
This scheme is prone to faults from specication mistakes.
Thus, a proven methodology should be employed to verify
the correctness of the specications.
Inanultra- reliable system, eachversionshouldbe made
as diverse as possible. Differences may include software
developers, specications, algorithms, programming lan-
guages, and verication and validation processes. Usually
it is assumed that such a strategy would make different
versions fail independently. However, studies have shown
that although generally this is true, instances have been
reported where the failures were correlated. For example,
certain types of inputs, such as division by zero, that need
special handling and potentially are overlooked easily by
different developers could lead to correlated failures.
In the recovery block schemes, one version of the pro-
gram is considered to be the primary version and the
remaining N-1 versions are designated as the secondary
versions or spares. The primary version normally is used if
it passes the acceptance tests. Failure to pass the accep-
tance tests prompts the use of the rst secondary version.
This process is continued until one version is found that
passes the acceptance tests. Failure of the acceptance tests
by all the versions indicates failure of the system. This
approachis analogous to the cold standby sparing approach
of the hardware redundancy, and it can tolerate up to N-1
faults.
In Nself-checking programming, Ndifferent versions of
the software, together with their acceptance tests, are
written. The output of each program, together with its
acceptance tests results, are input to a selection logic
that selects the output of the program that has passed
the acceptance tests.
Information Redundancy
Hardware redundancy is very effective but expensive. In
information redundancy, redundant, information is added
to enable fault detection, and sometimes fault tolerance, by
correcting the affected information. For certain systems,
such as memory and bus, error-correcting codes are cost
effective andefcient. Informationredundancy includes all
error-detecting codes, such as various parity codes, m-of-n
codes, duplication codes, checksums, cyclic codes, arith-
metic codes, Berger codes, and Hamming error-correcting
codes.
Time Redundancy
All the previous redundancy methods require a substantial
amount of extra hardware, which tends to increase size,
weight, cost, and power consumption of the system. Time
redundancy attempts to reduce the hardware overhead by
using extra time. The basic concept is to repeat execution of
a software module to detect faults. The computation is
repeated more than once, and the results are compared.
An existence of discrepancy suggests a fault, and the com-
putations are repeated once more to determine if the dis-
crepancy still exists. Earlier, this scheme was used for
transient fault detection. However, with a bit of extra
hardware, permanent faults also can be detected. In one
of the schemes, the computation or transmission of data is
done in the usual way. In the second round of computation
or transmission, the data are encoded and the results are
decoded before comparing them with the previous results.
The encoding scheme should be such that it enables the
detection of the faults. Such schemes may include comple-
mentation and arithmetic shifts.
Coverage
The coverage probability C used in the earlier reliability-
modeling example was assumed to be a constant. Inreality,
of course, it maywell vary, dependingonthe circumstances.
In that example (TMR with a spare), the coverage prob-
ability is basically the probability that the voting and
switching mechanismis functional andoperates ina timely
manner. This probability in actuality may be dependent on
the nature of the fault, the time that it occurs, and whether
it is the rst or the second fault.
In systems in which the failure rate is dominated by the
reliability of the hardware modules being protected
through redundancy, the coverage probability may be of
relatively minor importance and treating it as a constant
may be adequate. If extremely high reliability is the goal,
however, it is easy to provide sufcient redundancy to
guarantee that a system failure from the exhaustion of
all hardware modules is arbitrarily small. In this case,
failures from imperfect coverage begin to dominate and
more sophisticated models are needed to account for this
fact.
It is useful to think of fault coverage as consisting of
three components: the probability that the fault is detected,
the probability that it is isolated to the defective module,
and the probability that normal operation can be resumed
successfully on the remaining healthy modules. These
FAULT-TOLERANT COMPUTING 5
probabilities often are time dependent, particularly in
cases in which uninterrupted operation is required or in
which loss of data or program continuity cannot be toler-
ated. If too much time elapses before a fault is detected, for
example, erroneous data may be released with potentially
unacceptable consequences. Similarly, if a database is cor-
rupted before the fault is isolated, it may be impossible to
resume normal operation.
Markov models of the sort previously described can be
extended to account for coverage effects by inserting addi-
tional states. After a fault, the systemtransitions to a state
in which a fault has occurred but has not yet been detected,
from there to a state in which it has been detected but not
yet isolated, from there to a state in which it has been
isolated but normal operation has not yet resumed success-
fully, and nally to a successful recovery state. Transitions
out of each of those states can be either to the next state in
the chain or to a failed state, with the probability of each
of these transitions most likelydependent onthe time spent
in the state in question. State transition models in which
the transition rates are dependent on the state dwelling-
time are called semi-Markov processes. These models can
be analyzed using techniques similar to those illustrated in
the earlier Markov analysis, but the analysis is complicated
by the fact that differences between the transition rates
that pertain to hardware or software faults and those that
are associated with coverage faults can be vastly different.
The mean time between hardware module failures, for
example, typically is on the order of thousands of hours,
whereas the time between the occurrence of a fault and its
detection may well be on the order of microseconds. The
resulting state-transition matrix is called stiff , and any
attempt to solve it numerically may not converge. For this
reason, some of the aforementioned reliability models use a
technique referred to as behavioral decomposition. Speci-
cally, the model is broken down into its component parts,
one subset of models used to account for coverage failures
and the second incorporating the outputs of those models
into one accounting for hardware and software faults.
FAULT TOLERANT COMPUTER APPLICATIONS
Incorporation of fault tolerance in computers has tremen-
dous impact on the design philosophy. A good design is a
trade-off between the cost of incorporating fault tolerance
and the cost of errors, including losses from downtime and
the cost of erroneous results.
Over the past 50 years, numerous fault-tolerant compu-
ters have been developed. Historically, the applications of
the fault-tolerant computers were conned to military,
communications, industrial, and aerospace applications
where the failure of a computer could result in substantial
nancial loses and possibly loss of life. Most of the fault-
tolerant computers developed can be classied into ve
general categories.
General-Purpose Computing
General-purpose computers include workstations and per-
sonal computers. These computers require a minimum
level of fault tolerance because errors that disrupt the
processing for a short period of time are acceptable, pro-
vided the system recovers from these errors. The goal is to
reduce the frequency of such outages and to increase relia-
bility, availability, and maintainability of the system.
Because the attributes of each of the main components,
namely the processor, memory, and input/output, are
unique, the fault-tolerant strategies employed for these
components are different, in general. For example, the
main memory may use a double-error detecting code on
data and on address and control information, whereas
cache may use parity onbothdataandaddress information.
Similarly, the processor may use parity on data paths and
control store and duplication and comparison of control
logic, whereas input/output could use parity on data and
control information.
Moreover, it is well knownthat as feature sizes decrease,
the systems are more prone to transient failures than other
failures and it is desirable to facilitate rapid recovery. One
of the most effective approaches for rapid error recovery for
transient faults is instruction retry, in which the appro-
priate instruction is retried once the error is detected. The
instruction retry concept has been extended to a wide
variety of other systems, including TMR systems and
Very Large Instruction Word (VLIW) architectures.
Long-Life Applications
Applications in which manual intervention is impractical
or impossible, such as unmanned space missions, demand
systems that have a high probability of surviving unat-
tended for long periods of time. Typical specications
require a 95% or better probability that the system is still
functional at the end of a ve- or ten-year mission. Com-
puters in these environments are expected to use the hard-
ware in an efcient way because of limited power
availability and constraint on weight and size. Long-life
systems sometimes can allow extended outages, provided
the system becomes operational again. The rst initiative
for the development of such computers was taken by NASA
for the Orbiting Astronomical Observatory, in which basic
fault masking at the transistor level was used. Classic
examples of systems for long-life applications include the
Self-Testing And Repairing (STAR) computer, the Fault-
Tolerant Space-borne Computer (FTSC), the JPL Voyager
computers, and the Fault-Tolerant Building Block Compu-
ter (FTBBC).
The STAR computer was the rst computer that used
dynamic recovery strategies that used extensive instru-
mentation for fault detection and signal errors. Galileo
Jupiter system is a distributed system based on 19 micro-
processors with 320 kilobytes of ROM. Block redundancy is
used for all the subsystems, and they operate as standby
pairs, except the command and data subsystem that oper-
ate as an active pair. Although special fault protection
software is used to minimize the effects of faults, fault
identication and isolation are by ground intervention.
Critical-Computation Applications
In these applications, errors in computations can be cata-
strophic, affecting such things as human safety. Applica-
tions include a certain type of industrial controllers,
6 FAULT-TOLERANT COMPUTING
aircraft ight control systems, and military applications. A
typical requirement for such a system is to have reliability
of 0.9999999 at the end of 3-hour time period.
An obvious example of a critical-computation applica-
tion is the space shuttle. Its computational core employs
software voting in a ve-computer system. The system is
responsible for guidance, navigation, and preight check-
outs. Four computers are used as a redundant set during
mission-critical phases, and the fth computer performs
noncritical tasks and also act as a backup for the primary
four-computer system. The outputs of the four computers
are voted at actuators while each computer compares its
output with the outputs of the other three computers. A
disagreement is voted by the redundancy-management
circuitry of each computer, and it will isolate its computer
if the vote is positive. Up to two computer failures can be
toleratedinthe votingmode operation. Asecondfailure will
force the system to function as a duplex system, that uses
comparison and self-tests, and it can tolerate one more
failure in this mode. The fth computer contains the ight
software package developed by another vendor to avoid
common software error.
Another interesting systemthat was developed recently
is the ight control computer system of Boeing 777. In
avionics, system design diversity is an important design
criterion to tolerate common-mode failures, such as design
aws. One of the specications includes meantime between
maintenance actions to be 25,000 hours. The system can
tolerate certain Byzantine faults, such as disagreement
between the replicated units, power and component fail-
ures, and electromagnetic interference. The system uses
triple-triplex redundancy, in which design diversity is
incorporated by using three dissimilar processors in each
triplex node. Each triplex node is isolated physically and
electrically from the other two nodes. However, recent
studies suggest that a single-point of failure exists in the
software system. Instead of using different teams to gen-
erate different versions of the source code in Ada Boeing
decided to use the same source code but three different Ada
compilers to generate the object code.
High-Availability Applications
The applications under this category include time-shared
systems, such as banking and reservation systems, where
providing prompt services to the users is critical. In these
applications, although system-wide outages are unaccep-
table, occasional loss of service to individual users is
acceptable provided the service is restored quickly so
that the downtime is minimized. In fact sometimes the
downtime needed to update software and upgrade hard-
ware may not be acceptable, and so well-coordinated
approaches are used to minimize the impact on the users.
Some examples of early highly available commercial sys-
tems include the NonStop Cyclone, Himalaya K10000,
and Integrity S2, all products of Tandem Computers
now part of Hewlett Packard; the Series 300 and Series
400 systems manufactured by Sequoia Systems, now
owned by Radisys Corporation; and the Stratus XA/R
Series 300 systems of the Stratus family.
Figure 7 shows the basic Tandem NonStop Cyclone
system. It is a loosely coupled, shared-bus multiprocessor
CYCLONE
I/O I/O
CYCLONE
I/O I/O
CYCLONE
I/O I/O
CYCLONE
I/O I/O
DYNABUS
Figure 7. Tandem nonstop Cyclone system.
FAULT-TOLERANT COMPUTING 7
system that was designed to prevent a single hardware or
software failure from disabling the entire system. The
processors are independent and use Tandems GURADIAN
90 fault-tolerant operating system with load balancing
capabilities. The NonStop Cyclone system employs both
hardware and software fault-tolerant strategies. By exten-
sively using hardware redundancy in all the subsystems
multiple processors, mirrored disks, multiple power sup-
plies, redundant buses, and redundant input/output con-
trollersa single component failure will not disable the
system. Error propagation is minimized by making sure
that fault detection, location, and isolation is done quickly.
Furthermore, Tandemhas incorporatedother features that
also improve the dependability of the system. For example,
the system supports on-site replaceable, hot-pluggable
cards known as eld replaceable units (FRUs) to minimize
any downtime.
In the newer S-series, NonStop Himalaya servers
(Figure 8), the processor and I/O buses are replaced by
proprietary networkarchitecture calledthe ServerNet that
incorporates numerous data integrity schemes. Further-
more, redundant power and cooling fans are provided to
ensure continuity in service. A recent survey suggests that
more than half of the credit card and automated teller
machine transactions are processed by NonStop Himalaya
computers.
Stratus Technologies, Inc. has introduced fault-tolerant
servers with the goal of providing high availability at lower
complexity and cost. For example, its V series server sys-
tems that are designed for high-volume applications use
TMR hardware scheme to provide at least 99.999% prob-
ability of surviving all hardware faults. These systems use
Intel Xeon Processors and run on a Stratus VOS operating
system that provides a highly secured environment.
A single copy manages a module that consists of tightly
coupled processors.
The Sequoia computers were designed to provide a high
level of resilience to both hardware and software faults and
to do so with a minimum of additional hardware. To this
end, extensive use was made of efcient error detectionand
correction mechanisms to detect and isolate faults com-
bined with the automatic generation of periodic system-
level checkpoints to enable rapid recovery without loss of
data or program continuity. Checkpoints are established
typically every40milliseconds, andfault recoveryis accom-
plished in well under a second, making the occurrence of a
fault effectively invisible to most users.
Recently Marathon Technologies Corporation has
released software solutions, everRun FT and everRun
HA, that enable any two standard Windows servers to
operate as a single, fully redundant Windows environment,
protecting applications from costly downtime due to fail-
ures within the hardware, network, and data.
Maintenance Postponement Applications
Applications where maintenance of computers is either
costly or difcult or perhaps impossible to perform, such
as some space applications, fall under this category. Usually
the breakdown maintenance costs are extremely high for
the systems that are locatedinremote areas. Amaintenance
crewcanvisit these sites at afrequencythat is cost-effective.
fault-tolerance techniques are used between these visits to
ensure that the system is working correctly. Applications
where the systems may be remotely located include tele-
phone switching systems, certainrenewable energy genera-
tion sites, and remote sensing and communication systems.
One of the most popular systems in the literature in this
group is AT&Ts Electronic Switching System (ESS). Each
subsystem is duplicated. While one set of these subsystems
perform all the functions, the other set acts as a backup.
PRINCIPLES OF FAULT-TOLERANT MULTIPROCESSORS
AND DISTRIBUTED SYSTEMS
Although the common perception is that the parallel sys-
tems are inherently fault-tolerant, they suffer from high
failure rates. Most of the massively parallel systems are
built based on performance requirements. However, their
dependability has become a serious issue; numerous tech-
niques have been proposed in the literature, and several
vendors have incorporated innovative fault-tolerance
schemes to address the issue. Most of these schemes are
hierarchical in nature incorporating both hardware and
software techniques. Fault-tolerance techniques are
designed for eachof the following levels suchthat the faults
that are not detected at lower levels should be handled at
higher levels.
At the technologyandcircuit levels, decisions are madeto
select the appropriate technology and circuit design tech-
niques that will lead to increased dependability of the
memory
ServerNet
transfer
engines
ServerNet
transfer
engines
memory
ServerNet
2
ServerNet
2
SCSI SCSI
SCSI
SCSI SCSI
SCSI
modular
ServerNet
expansion
board
modular
ServerNet
expansion
board
communications
or external I/O
communications
or external I/O
C
P
U
C
P
U
Figure 8. Tandem S-series NonStop server architecture.
8 FAULT-TOLERANT COMPUTING
modules. At the node level, architecture involves design of
VLSI chips that implement the processor, whereas at the
internode architecture level, considerations are given to
how the nodes are connected and effective reconguration
in the presence of faulty processors, switches, and links. At
the operating system level, recovery of the system is
addressed, which may involve checkpointing and rollback
after a certain part of the system has been identied faulty
and allocating tasks of the faulty processor to the remaining
operating processor. Finally, at the application level, the
user uses mechanisms to check the results.
Acouple of most popular approaches to fault tolerance in
multiprocessor systems are static or masking redundancy
and dynamic or standby redundancy
Static Redundancy
Static redundancy is used in multiprocessor system for
reliability and availability, safety, and to tolerate nonclas-
sic faults. Although redundancy for reliability and avail-
ability uses the usual NMR scheme, redundancy for safety
usually requires a trade-off between reliability and safety.
For example, one can use a k-out-of-n systemwhere a voter
produces the output y only if at least k modules agree onthe
output y. Otherwise the voter asserts the unsafe ag. The
reliabilityrefers to the probabilitythat the systemproduces
correct output, and safety refers to the probability that the
system either produces the correct output or the error is
detectable. Thus, high reliability implies high safety. How-
ever, the reverse is not true.
Voting of the NMR system becomes complicated when
nonclassic faults are present. In systems requiring ultra-
high reliability, a voting mechanism should be able to
handle arbitrary failures, including the malicious faults
where more than one faulty processor may collaborate to
disable the system. The Byzantine failure model was pro-
posed to handle such faults.
Dynamic Redundancy
Unlike static redundancy, dynamic redundancy in multi-
processor systems includes built-in fault detection capabil-
ity. Dynamic redundancy typically follows three basic steps:
(1) fault detection and location, (2) error recovery, and (3)
recongurationof the system. Whenafault is detectedanda
diagnosis procedure locates the fault, the system is usually
recongured by activating a spare module; if the spare
module is not available, then the recongured system
may have to operate under a degraded mode with respect
to the resources available. Finally, error recovery is per-
formed in which the spare unit takes over the functionality
of the faulty unit from where it left off. Although various
approaches have been proposed for these steps, some of
them have been successfully implemented on the real sys-
tems. For example, among the recovery strategies, the most
widely adopted strategy is the rollback recovery using
checkpoints. The straightforward recovery strategy is to
terminate execution of the program and to re-execute the
complete programfromthe beginning on all the processors,
which is known as the global restart. However, this leads to
severe performance degradation. Thus, the goal of all the
recovery strategies is to performeffective recovery fromthe
faults with minimum overheads. The rollback recovery
using checkpoints involves storing on adequate amount of
processor state information at discrete points during the
execution of the program so that the program can be rolled
back to these points and restarted fromthere in the event of
a failure. The challenge is when and how these checkpoints
are created. For interacting processes, inadequate check-
pointing may lead to a domino effect where the system is
forced to go to a global restart state. Over the years,
researchers have proposed strategies that effectively
address this issue.
Althoughrollback recovery schemes are effective inmost
systems, they all suffer from substantial inherent latency
that may not be acceptable for real-time systems. For such
systems, forward recovery schemes are typically used. The
basic concept commonto all the forwardrecovery schemes is
that when a failure is detected, the system discards the
current erroneous state and determines the correct state
without any loss of computation. These schemes are either
based on hardware or software redundancy. The hardware
based schemes can be classied further as static or dynamic
redundancy. Several schemes based on dynamic redun-
dancy and checkpointing that avoid rollback have been
proposed in the literature. Like rollback schemes, when a
failure is detected, the roll-forward checkpointing schemes
(RFCS) attempt to determine the faulty module and the
correct state of the system is restored once the fault diag-
nosis is done, although the diagnostic methods may differ.
For example, in one RFCS, the faulty processing module is
identied byretryingthe computationona spare processing
module. During this retry, the duplex modules continue to
operate normally. Figure 9 shows execution of two copies, A
and B of a task. Assume that B fails in the checkpointing
interval i. The checkpoints of A and B will not match at the
end of the checkpointing interval i at time t
i
. This mismatch
will trigger concurrent retry of the checkpoint interval i ona
spare module, while A and B continue to execute into the
checkpointing interval i+1. The previous checkpoint taken
i i +1
t
i
t
i+1
t
i--1
A
B
1
S
1: Copy state and executable code to spare
2: Compare states A and S and B and S
3: Copy state from A to B
X: A fault
2
3
time
Figure 9. A RFCS scheme.
FAULT-TOLERANT COMPUTING 9
at t
i1
together with the executable code is loaded into the
spare, and the checkpoint interval i in which the fault
occurred is retired on the spare module. When the spare
module completes the execution at time t
i1
the state of the
spare module is compared with the states of Aand Bat time
t
i
and Bis identied faulty. State of Acopied to B. Arollback
is required when both A and B have failed. In this scheme,
one fault is tolerated without paying the penalty of the
rollback scheme. Several version of this scheme have
been proposed in the literature.
FAULT-TOLERANCE IN VLSI CIRCUITS
The advent of VLSI circuits has made it possible to incor-
porate fault-tolerant techniques that were too expensive to
implement earlier. For example, it has enabled schemes to
provide redundant processors and fault detection and loca-
tion capabilities within the chip. This is mainly from
increasing packaging densities and decreasing cost and
power consumption. However, VLSI also has several dis-
advantages. Some of these disadvantages include a higher
percentage of internal faults compared with previous
small-scale integration (SSI) and large-scale integration
(LSI) during the manufacturing phase leading to lowyield;
increased complexity of the system has led to considerable
increase in design mistakes, and imperfections can lead to
multiple faults from smaller feature sizes. Furthermore,
VLSI circuits are more prone to common-mode failures, and
decreased feature sizes and lower operating voltages have
made the circuits more susceptible to external distur-
bances, such as a particles and variations in the supply
voltages, respectively, which lead to an increase in tran-
sient faults. Thus, the main motivations for incorporating
fault-tolerance inVLSI circuits is yieldenhancement andto
enable real-time and compile-time recongurationwiththe
goal of isolating a faulty functional unit, suchas a processor
or memory cells, during eld operation.
Strategies for manufacture-time and compile-time
recongurations are similar since they do not affect the
normal operation of the system and no real-time con-
straints are imposed on reconguration time. Real-timer
reconguration schemes are difcult to design since the
reconguration has to be performed without affecting the
operation of the system. If erroneous results are not accep-
table at all, then static or hybrid techniques are typically
used; otherwise, cheaper dynamic techniques would suf-
ce. The effectiveness of any reconguration scheme is
measured by the probability that a redundant unit can
replace a faulty unit and the amount of reconguration
overhead involved.
Numerous schemes have been proposed for all three
types of reconguration in the literature. One of the rst
schemes that was proposed for an array of processing
elements allowed each processing element to be converted
into a connecting element for signal passing so that a failed
processor resulted in converting other processors in the
corresponding rowand column to connecting elements and
they ceased to perform any computation. This simple
scheme is not practical for multiple faults since an entire
row and column must be disabled for each fault. This
problem has been addressed by several reconguration
schemes that either use spare columns and rows or redun-
dant processing elements are dispersed throughout the
chip, such as the interstitial redundancy scheme.
Advances in nanometer-scale geometries have made
devices prone to new types of defects that greatly limit
the yields. Furthermore, the existing external infrastruc-
ture, such as automatic test equipment (ATE), are not
capable of dealing with the new defect levels that nan-
ometer-scale technologies create interms of fault diagnosis,
test time, and handling & the amount of test data gener-
ated. In addition, continual development of ATE facilities
that can deal with the new manufacture issues is not
realistic. This has considerably slowed down the design
of various system-on-a-chip efforts. For example, although
the embedded memory typically occupies half of the IC
area, their defect densities tend to be twice that of logic.
Thus, to achieve and maintain cost advantages requires
improving memory yields. As described, one commonly
used strategy to increase yield is to incorporate redundant
elements that can replace the faulty elements during the
repair phase. The exiting external infrastructure cannot
perform these tasks in a cost-effective way. Thus, a critical
need exists for an on-chip (internal) infrastructure to
resolve these issues effectively. The semiconductor indus-
try has attempted to address this problem by introducing
embedded intellectual property blocks called the infra-
structure intellectual property (IPP). For embedded mem-
ory, the IPP may consist of various types of test and repair
capabilities.
TRENDS AND FUTURE
In the last few years, several developments have occurred
that woulddene the future activities inthe fault-tolerance
area in general.
Conventional fault-tolerance techniques do not render
themselves well in the area of mobile computing. For
example, most schemes for checkpointing coordination
and message logging schemes are usually complex and
difcult to control because of the mobility of the hosts.
Similarly, in the area of recongurable computing, includ-
ing eld programmable gate arrays (FPGAs), new fault-
tolerance strategies are being developed based on the
specic architectures.
Recent advances in nanoscience and nanometer-scale
devices, such as carbon nanotubes and single electron
transistors, will drastically increase device density and
at the same time reduce power needs, weight, and size.
However, these nanosystems are expected to suffer from
high failure rates, both transient and permanent faults,
because of the probabilistic behavior of the devices and lack
of maturity inthe manufacturing processes that will reduce
the process yield. In fact, it is expected that the fault
densities will be orders of magnitude greater than the
current technology. This new circuit design paradigm is
needed to address these issues. However, this will be an
evolving area as the device fabrication processes are being
scaled-up from the laboratory to the manufacturing level.
10 FAULT-TOLERANT COMPUTING
A good viable alternative approach for fault-tolerance
wouldbe to mimic biological systems. There is some activity
in this area, but it is still in its infancy. For example, an
approachto design complex computing systems withinher-
ent fault-tolerance capabilities based on the embryonics
was recently proposed. Similarly, a fault detection/location
and reconguration strategy was proposed last year that
was based on cell signaling, membrane trafcking, and
cytokinesis.
Furthermore, there is an industry trend in space and
aerospace applications to use commercial off-the-shelf
(COTS) components for affordability. Many COTS have
dependability issues, and they are known to fail when
operated near their physical limits, especially in military
applications. Thus, innovative fault-tolerance techniques
are needed that can strike an acceptable balance between
dependability and cost.
Moreover, although the hardware and software costs
have been dropping, the cost of computer downtime has
been increasing because of the increased complexity of the
systems. Studies suggest that the state-of-the-art fault-
tolerant techniques are not adequate to address this pro-
blem. Thus, a shift in the design paradigm may be needed.
This paradigm shift is driven in large part by the recogni-
tion that hardware is becoming increasingly reliable for
any given level of complexity, because, based on empirical
evidence, the failure rate of an integrated circuit increases
roughly as the 0.6th power of the number of its equivalent
gates. Thus, for a givenlevel of complexity, the systemfault
rate decreases as the level of integration increases. If, say,
the average number of gates per device in a system is
increased by a factor of 1000, the fault rate for that system
decreases by a factor of roughly 16. In contrast, the rapid
increase in the complexity of software over the years,
because of the enormous increase in the number and vari-
etyof applications implementedoncomputers, has not been
accompanied by a corresponding reduction in software
bugs. Signicant improvements have indeed been made
in software development, but these are largely offset by the
increase in software complexity and in the increasing
opportunity for software modules to interact in unforeseen
ways as the number of applications escalate. As a result, it
is expected that the emphasis in future fault-tolerant com-
puter design will shift from techniques for circumventing
hardware faults to mechanisms for surviving software
bugs.
FURTHER READING
G. Bertoni, L. Breveglieri, I. Koren, P. Maistri, and V. Piuri,
Detecting and locating faults in VLSI implementations of the
advanced encryption standard, Proc. 2003 IEEE Internat. Sympo-
sium on Defect and Fault Tolerance in VLSI Systems, 2003, pp.
105113.
N. Bowen and D. K. Pradhan, Issues in fault tolerant memory
management, IEEE Trans. Computers, 860880, 1996.
N. S. Bowen, and D. K. Pradhan, Processor and memory-based
checkpoint and rollback recovery, IEEE Computer, 26 (2): 2231,
1993.
M. Chatterjee, and D. K. Pradhan, A GLFSR-based pattern gen-
erator for near-perfect fault coverage in BIST, IEEE Trans. Com-
puters, 15351542, 2003.
C. Chen,andA. K. Somani, Fault containment incachememories for
TMRredundant processor systems, IEEETrans. Computers, 48(4):
386397, 1999.
R. Chillarege, and R. K. Iyer, Measurement-based analysis of error
latency, IEEE Trans. Computers, 529537, 1987.
K. Datta, R. Karanam, A. Mukherjee, B. Joshi, andA. Ravindran, A
bio-inspired self-repairable distributed fault-tolerant design
methodology with efcient redundancy insertion technique,
Proc. 16th IEEE NATW, 2006.
R. J. Hayne, and B. W. Johnson, Behavioral fault modeling in a
VHDL synthesis environment, Proc. IEEE VLSI Test Symposium,
Dana Point, California, 1999, pp. 333340.
B. W. Johnson, Design and analysis of fault tolerant digital
systems. Reading, MA: Addison-Wesley, 1989.
B. Joshi, and S. Hosseini, Diagnosis algorithms for multiprocessor
systems, IEEE Workshop on Embedded Fault-Tolerant Systems,
Boston, MA, 1998.
L. M., Kaufman, S. Bhide, and B. W. Johnson, Modeling of
common-mode failures in digital embedded systems, Proc. 2000
IEEE Annual Reliability and Maintainability Symposium, Los
Angeles, California, 2000, pp. 350357.
I. Koren, and C. Mani Krishna, Fault-Tolerant Systems. San
Francisco, CA: Morgan Kauffman, 2007.
S. Krishnaswamy, I. L. Markov, and J. P. Hayes, Logic circuit
testing for transient faults, Proc. European Test Symposium
(ETS05), 2005.
P. Lala, Self-Checking and Fault Tolerant Digital Design. San
Francisco, CA: Morgan Kaufmann, 2000.
L. Lamport, S. Shostak, and M. Pease, The Byzantine Generals
Problem, ACMTrans. Programming Languages andSystem, 1982,
pp. 382401.
M. R. Lyu, ed., Software Reliability Engineering, New York:
McGraw Hill, 1996.
M. Li, D. Goldberg, W. Tao, and Y. Tamir, Fault-tolerant cluster
management for reliable high-performance computing, 13th Inter-
national Conference on Parallel and Distributed Computing and
Systems, Anaheim, California, 2001, pp. 480485.
C. Liu, and D. K. Pradhan, EBIST: A novel test generator with
built-in-fault detection capability, IEEE Trans. CAD. In press.
A. Maheshwari, I. Koren, and W. Burleson, Techniques for tran-
sient fault sensitivity analysis and reduction in VLSI circuits,
Defect and Fault Tolerance in VLSI Systems, 2003.
A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D.
Mannaru, A. Riska, and D. Milojicic, Susceptibility of commodity
systems and software to memory soft errors, IEEE Trans. Com-
puters, Vol. 53(12): 15571568, 2004.
N. Oh, P. P. Shirvani, and E. J. McCluskey, Error detection by
duplicated instructions in superscalar processors, IEEE Trans
Reliability, 51: 6375.
R. A. Omari, A. K. Somani, andG. Manimaran, Anadaptivescheme
for fault-tolerant scheduling of soft real-time tasks in multipro-
cessor systems, J. Parallel and Distributed Computing, 65(5):
595608, 2005.
E. Papadopoulou, Critical Area Computation for Missing material
Defects in VLSI Circuits, IEEE Trans. CAD, 20: 503528, 2001.
D. K. Pradhan,(ed.), Fault-Tolerant Computer System Design.
Englowood Cliffs, NJ: Prentice Hall, 1996.
FAULT-TOLERANT COMPUTING 11
D. K. Pradhan, and N. K. Vaidya, Roll-forward checkpointing
scheme: A novel fault-tolerant architecture, IEEE Trans. Compu-
ters, 43 (10), 1994.
A. V. Ramesh, D. W. Twigg, U. R. Sandadi, T. C. Sharma, K. S.
Trivedi, and A. K. Somani, Integrated Reliability Modeling Envir-
onment, Reliability Engineering and System Safety, Elsevier
Science Limited; UK, Volume 65, Issue 1, March 1999, pp. 6575.
G. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, andS.
S. Mukherjee, Software-controlled fault tolerance, ACM Trans.
Architecture and Code Optimization, 128, 2005.
A. Schmid, and Y. Leblebici, A highly fault tolerant PLA archi-
tecture for failure-prone nanometer CMOS and novel quantum
device technologies, Defect and Fault Tolerance in VLSI Systems,
2004.
A. J. Schwab, B. W. Johnson, and J. Bechta-Dugan, Analysis
techniques for real-time, fault-tolerant, VLSI processing arrays,
Proc. 1995 IEEE Annual Reliability and Maintainability Sympo-
sium,Washington, D.C; 1995, pp. 137143.
D. Sharma, and D. K. Pradhan, An efcient coordinated check-
pointing scheme for multicomputers, Fault-Tolerant Parallel and
Distributed Systems. IEEE Computer Society Press, 1995.
P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi,
Modeling the effect of technology trends on the soft error rate of
combinational logic, Proc. 2002 Internat. Conference on Depend-
able Systems and Networks, 2002, pp. 389399.
M. L. Shooman, Reliability of computer systems and networks,
Hoboken, NJ: Wiley Inter-Science, 2002.
D. Siewiorek, and R. Swartz, The Theory and Practice of Reliable
System Design. Wellesley, MA: A. K. Peters, 1999.
D. P. Siewiorek, (ed.), Fault Tolerant Computing Highlights
from 25 Years, Special volume of the 25
th
International
Symposium on Fault-Tolerant Computing FTCS-25, Pasadena,
Califoria.
A. K. Somani, andN. H. Vaidya, Understandingfault toleranceand
reliability, IEEE Computer, April 1997, pp. 4550.
Tandem Computers Incorporated, NonStop Cyclone/R, Tandem
Product report, 1993.
N. H. Vaidya, and D. K. Pradhan, Fault-tolerant design strategies
for high reliability and safety, IEEE Trans. Computers, 42 (10),
1993.
J. vonNeumann, Probabilistic Logics andthe Synthesis of Reliable
Organisms from Unreliable Components, Automata Studies,
Annals of Mathematical Studies, Princeton University Press,
No. 34, 1956, pp. 4398.
Y. Yeh, Designconsiderations inBoeing777y-by-wirecomputers,
Proc. Third IEEE International High-Assurance Systems Engi-
neering Symposium, 1998, pp. 6472.
S. R. Welke, B. W. Johnson, and J. H. Aylor, Reliability modeling
of hardware-software systems, IEEE Trans. Reliability, 44 (3),
413418, 1995.
Y. Zhang, and K. Chakrabarty, Fault recovery based on check-
pointing for hard real-time embedded systems, Proc. 18th IEEE
International Symposium on Defect and Fault Tolerance in VLSI
Systems (DFT
0
03), 2003.
Y. Zorian, and M. Chandramouli, Manufacturability
with Embedded Infrastructure IPs. Available: http://www.
evaluationengineering.com/archive/articles/0504/0504embedde-
d_IP.asp.
BHARAT JOSHI
University of
North Carolina-Charlotte
Charlotte, North Carolina
DHIRAJ PRADHAN
University of Bristol
Bristol, United Kingdom
JACK STIFFLER
Reliable Technologies, Inc.
Weston, Massachusetts
12 FAULT-TOLERANT COMPUTING
F
FIBER OPTICAL COMMUNICATION NETWORKS
INTRODUCTION
The continuing growthof trafc and demandfor bandwidth
has changed dramatically the telecommunication market
and industry, and has created new opportunities and chal-
lenges. The proliferation of DSL and cable modems for
residential homes and Ethernet links for businesses have
spurred the need to upgrade network backbones with
newer and faster technologies. It is estimated that Internet
trafc doubles every 12 months driven by the popularity of
Web services and the leveraged business model for data
centers and centralized processing. The TelecomDeregula-
tion Act of 1996 brought competition to the marketplace
and altered the traditional trafc prole. Competition
brought about technological advances that reduced the
cost of bandwidth and allowed the deployment of data
services and applications that in turn fueled higher
demand for bandwidth. Coupled with the fact that most
networks were optimized for carrying voice trafc, the
telecommunication providers have had to change their
business models and the way they build networks. As a
result, voice and data trafc are now carried over separate
networks, which doubles the cost and renders service
delivery and network operation and management inef-
cient.
Optical networking seems to be the x for all problems
and limitations despite the downturn in the telecommuni-
cations industry in general and optical networking in par-
ticular, because all-optical or photonic switching is the only
core networking technology that is capable of supporting
the increasing amount of trafc andthe demands for higher
bandwidth, brought about by emerging applications such
as grid computing and tele-immersion. Needless to say that
optical bers offer far more bandwidth than copper cables
and are less susceptible to electromagnetic interference.
Several bers are bundled easily into a ber cable and
support transmissions with ultra-low bit error rates. As
a result, optical bers are currently used to alleviate the
bandwidth bottleneck in the network core, whereas tradi-
tional copper cables are limitedto the last mile. Advances in
wavelength-division multiplexing (WDM) have provided
abundant capacity in effect dividing the optical ber into
several channels (or wavelengths) where each wavelength
can support upwards of 10 Gbps. In the rst generation of
optical networks, deployed primarily to support SONET/
SDH, these wavelengths are provisioned manually and
comprise static connections between source and destina-
tion nodes. However, as the wavelength data rate
increases, the intermediate electronic processing at every
node becomes problematic and limits the available band-
width. The second generation of optical networks inte-
grates intelligent switching and routing at the optical
layer to achieve very high data throughput. All optical
transparent connections, or lightpaths, are dynamically
provisioned and routed where wavelength conversion
and ber delay lines can be used to provide spatial and
temporal reuse of channels. These networks are referred to
as wavelength-routed networks (WRNs) where several opti-
cal switching technologies exist to leverage the available
wavelengths among various trafc ows while using a
commonsignaling andcontrol interface suchas generalized
multi-protocol label switching (GMPLS).
The rest of this article rst discusses the SONET trans-
port, which is used heavily in the core networks and pre-
sents some of the proposed enhancements for reducing
SONETs inefciencies when supporting Ethernet trafc.
The WRNs are introduced, and the concepts of lightpath,
trafc grooming, and waveband switching are discussed
later inthis article. Several switching technologies are then
presented highlighting their advantages and disadvan-
tages as well as their technical feasibility. Finally, the
various control and signaling architectures in optical net-
works are discussed, emphasizing the interaction between
the IP and optical layers as well as path protection and
restoration schemes.
CURRENT TRANSPORTS
A Synchronous Optical Network (SONET) (1) has been the
predominant transport technology used in todays wide
area networks, and forms the underlying infrastructure
for most voice and data services. SONET was designed to
provide carrier-class service (99.999%) reliability, physical
layer connectivity, and self-healing. Time-division multi-
plexing (TDM) standards are used to multiplex different
signals that are transported optically and switched electro-
nically using timeslots to identify destinations. SONET is
deployed generally in ring networks to provide dedicated
point-to-point connections between any two source and
destination node pairs. Two common ring architectures
with varying degrees of protection exist: unidirectional
path switched rings (UPSRs) and bi-directional line
switched rings (BLSRs). These rings are often built using
add/drop multiplexers (ADMs) to terminate connections
and multiplex/demultiplex client signals onto/from the
SONET frames, which implies that the optical signal
undergoes opticalelectrooptical (OEO) conversions
at every intermediate hop.
The fundamental frame in traditional and optical
SONET is the Synchronous Transport Signal level-1
(STS-1), which consists of 810 bytes often depicted as a
90-column by 9-row structure. With a frame length of
125 ms (or 8000 frames per second), STS-1 has a bit rate of
51.84 Mbps. The STS-1 frame can be divided into two
main areas: the transport overhead and the synchronous
payload envelope (SPE). A SONET connection is provi-
sioned by the contiguous concatenation (CC) of STS-1
frames to achieve the desired rate. SONET framing has
proven inefcient for transporting bursty data trafc,
because additional protocols, such as asynchronous
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
transfer mode (ATM), are required to adapt the various
trafc ows, such as constant bit rate (voice) and unspe-
cied bit rate (data) connections, to the SPE containers.
The layering of ATM on SONET has worked adequately
but proved to be inefcient when carrying data trafc; the
ATM layering incurs about 15% overhead. In addition,
because of the SPE xed-size, SONET connections are
provisioned to support the peak data rates, which often
leads to over-allocation especially when the trafc load is
low.
A more recent approach is the next-generation SONET
(NG-SONET) specication (2), which provides carriers
withthe capabilities of optimizing the bandwidthallocation
and using any unused and fragmented capacity across the
ring network, to better match the client data rate as shown
in Table 1.
The new protocols required in NG-SONET are the gen-
eric framing procedure (GFP), virtual concatenation (VC),
and link capacity adjustment scheme (LCAS). GFP is an
encapsulation mechanism used to adapt different services
to the SONET frames. The GFP specication provides for
single-and multi-bit header error correction capabilities
and channel identiers for multiplexing client signals
and data packets onto the same SONET SPE. VC is the
main component of NG-SONET. It is used to distribute
trafc into several smaller containers, which are not neces-
sarily contiguous, but can be concatenated into virtual
groups (VCGs) to form the desired rate. The benets of
VC include:
1. Better network utilization: Low-order data connec-
tions can be provisioned in increments of 1.5-Mbps
virtual tributary (VT-1.5) instead of STS-1 frames.
2. Noncontiguous frame allocation: This facilitates the
provisioning of circuits and improves the network
utilization as available, but fragmented capacity
can be allocated rather than wasted.
3. Load balancing and service resiliency: Through the
use of diverse paths, the source canbalance the trafc
load among different paths, which provides some
degree of resiliency; a ber cut will only affect some
VCs, thereby degrading the service without complete
interruption.
LCAS is a two-way handshake protocol that allows for
dynamic provisioning and reconguration of VCGs to
enlarge or shrink incrementally the size of a SONET con-
nection. The basic approach is to have the Network Man-
agement System instruct the source node to resize an
existing connection. The source and destination nodes
coordinate the addition, deletion, and sequencing of the
new or existing VC by using the H4 control bytes in the
frame headers.
WAVELENGTH-ROUTED NETWORKS
Wavelength-divisionmultiplexing offers anefcient way to
exploit the enormous bandwidth of an optical ber; WDM
allows the transmission of multiple optical signals simul-
taneously on a single ber by dividing the ber transmis-
sionspectruminto several nonoverlapping wavelengths (or
channels), with each wavelength capable of supporting a
single communication channel operating at peak electronic
processing speed. With the rapid advance and use of dense
WDM (DWDM) technology, 100 or more wavelength chan-
nels can be multiplexed onto a single ber, each operating
at 10 to 100 Gbps, leading to Tbps aggregate bandwidth.
WDM optical network topologies can be divided into
three basic classes based on their architecture: (1) point-
to-point, (2) broadcast-and-select star, and (3) Wavelength-
routed. In point-to-point and broadcast-and-select net-
works, no routing function is provided, and as such, they
are simple but inexible and only suitable for small net-
works. (WRNs), on the other hand, support true routing
functionality and hence are highly efcient, scalable, and
more appropriate for wide areanetworks (3). Awavelength-
routed optical network is shown in Fig. 1. The network can
be built using switching nodes, wavelength routers, optical
cross-connects (OXCs), or optical adddrop multiplexers
(OADMs) connected by ber links to form an arbitrary
physical topology. Some routing nodes are attached to
access stations where data from several end users (e.g.,
IP, and ATM users) can be multiplexed or groomed on to a
single optical channel. The access station also provides
optical-to-electronic conversion and vice versa to interface
the optical network with the conventional electronic net-
work using transceivers.
Lightpath
In a WRN, end users communicate with one another via
optical channels, which are referred to as lightpaths. A
lightpath is an optical path (data channel) established
between the source node and the destination node by using
Table 1. Wasted Bandwidth
Application Data rate NG-SONET SONET
Ethernet 10 Mbps 12% 80%
Fast Ethernet 100 Mbps 1% 33%
Gigabit Ethernet 1 Gbps 5% 58%
Access Station:
Wavelength Router:
Lightpaths:
l
1
l
2
1
2
3
8
9
7
6
5
4
D
C
B
E
F
G
A
Figure 1. Structure of WRNs.
2 FIBER OPTICAL COMMUNICATION NETWORKS
a dedicated wavelength along multiple intermediate links.
Because of the limited number of wavelength channels,
wavelength reuse is necessary when the network has to
support a large number of users. Wavelength reuse allows
the same wavelengthto be reused inspatially disjoint parts
of the network. In particular, any two lightpaths can use
the same wavelength as long as they do not share any
common links. For example, Fig. 1 shows the lightpaths
A->1->7->G and D->4->3->C, which can use the same
wavelengthl
1
. However, two or more lightpaths traversing
the same ber linkmust be ondifferent wavelengths so that
they do not interfere with one another. Hence, the light-
paths A->1->7->G and B->2->1->7->G cannot use the
same wavelength simultaneously because they have com-
mon links.
Without wavelength conversion, a lightpath is
required to be on the same wavelength channel through-
out its path in the network. This process is known as the
wavelength-continuity constraint. Another important
constraint is the wavelength-capacity constraint, which
is caused by the limited number of wavelength channels
and transmitters/receivers in a network. Given these two
constraints, a challenging and critical probleminWRNs is
to determine the route that the lightpath should traverse
and which wavelength should be assigned. It is commonly
referred to as the routing and wavelength assignment
(RWA) problem, which has been shown to be NP-complete
for static trafc demands.
Optical Cross-Connects
A key node in WRNs is the OXC, which allows dynamic
setup and tear down of lightpaths as needed (i.e., without
having to be provisioned statically). The OXCs can connect
(i.e., switch) any input wavelength channel from an input
ber port to any one of the output ber ports inoptical form.
An OXC consists of optical switches preceded by wave-
length demultiplexers and followed by wavelength multi-
plexers. Thus, in an OXC, incoming bers are
demultiplexed into individual wavelengths, which are
switched to corresponding output ports and are then multi-
plexed onto outgoing bers. By appropriately conguring
the OXCs along the physical path, lightpaths can be estab-
lished between any pair of nodes (4).
Wavelength Conversion
The functionof a wavelengthconverter is to convert signals
fromaninput wavelengthto a different output wavelength.
Wavelength conversion is proven to be able to improve the
channel utilization in WRNs because the wavelength-
continuity constraint can be relaxed if the OXCs are
equipped with wavelength converters. For example,
in Fig. 1, if node 4 has wavelength converters, lightpath
E->5->4->3->Ccan be established using a different wave-
length on link 5->4 and 4->3 (e.g., using wavelength l
1
on
links E->5->4, and a different wavelength, say l
2
, on links
4->3->C).
Wavelength conversion can be implemented by (1)
(OEO) wavelength conversion and (2) all-optical wave-
lengthconversion. When using OEOwavelength conver-
sion, the optical signal is rst converted into the electronic
domain. The electronic signal is thenused to drive the input
of a tunable laser, which is tuned to the desired output
wavelength. This method can only provide opaque data
transmission(i.e., data bit rate anddata format dependent),
which is very complex while consuming a huge amount of
power. On the other hand, no optical-to-electronic conver-
sion is involved in the all-optical wavelength conversion
techniques and the optical signal remains in the optical
domain throughout the conversion process. Hence, all-
optical wavelength conversion using techniques such as
wave-mixing and cross-modulation are more attractive.
However, all-optical technologies for wavelengthconversion
are still not mature, and all-optical wavelength converters
arelikelytoremainverycostlyinthe near future. Therefore,
alot of attentionhas beenfocusedoncompromisingschemes
such as limited number, limited range, and sparse wave-
length conversion to achieve high network performance
(5,6).
Trafc Grooming
Although WDMtransmission equipment and OXCs enable
the establishment of lightpaths operating at a high rate
(currently at 10 Gb/s, or OC-192, expected to grow to 40
Gb/s, or OC-768), only afractionof endusers are expectedto
have a need for such high bandwidth that uses a full
wavelength. Most users typically generate lower speed
trafc, such as OC-12 (622 Mbps), OC-3 (155 Mbps), and
OC-1 (51.84 Mbps), using SONET framing. Hence, the
effective packing of these sub-wavelength tributaries
onto high-bandwidth full-wavelength channels (i.e., by
appropriate routing or wavelength and time-slot assign-
ment) is a very important problem and is known as the
trafc groomingproblem. For example, 64OC-3circuits can
be groomed onto a single OC-192 wavelength.
Trafc grooming has received considerable attention
recently (7,8). In WDM SONET, an ADM is used to multi-
plex (combine) low-speed SONET streams into a higher
speed trafc stream before transmission. Similarly, an
ADM receives a wavelength from the ring and demulti-
plexes it into several low-speed streams. Usually, an ADM
for a wavelength is required at a particular node only when
the wavelengthmust be addedor droppedat that particular
node. However, the cost of ADMs often makes up a sig-
nicant portion of the total cost, and as such, an objective of
trafc groomingis to achieve efcient utilizationof network
resources while minimizing the network cost and the num-
ber of ADMs required. Similarly, in mesh network topolo-
gies, the number of electronic ADMs and the network cost
can be reduced by carefully grooming the low-speed con-
nections and using OXCs for bypass trafc.
Waveband Switching
The rapid advance in dense WDM technology and world-
wide ber deployment have brought about a tremendous
increase inthe size(i.e., number of ports) of OXCs, as well as
the cost and difculty associated with controlling such
large OXCs. In fact, despite the remarkable technological
advances in building photonic cross-connect systems and
associated switch fabrics, the high cost (both capital and
operating expenditures) and unproven reliability of large
FIBER OPTICAL COMMUNICATION NETWORKS 3
switches (e.g., with 1000 ports or more) have not
justied their deployment. Recently, waveband switching
(WBS) in conjunction with new multigranular optical
cross-connects (MG-OXCs) have been proposed to reduce
this cost and complexity (911). The main idea of WBS is to
group several wavelengths together as a band, switch the
band using a single port whenever possible (e.g., as long as
it carries only bypass or express trafc), and demultiplex
the band to switch the individual wavelengths only when
sometrafc needs to beaddedor dropped. Acomplementary
hardware is an MG-OXC that not only can switch trafc at
multiple levels such as ber, waveband, and individual
wavelength (or even sub wavelength), but also it can add
and drop trafc at multiple levels, as well as multiplex and
demultiplex trafc fromone level to another. By using WBS
in conjunction with MG-OXCs, the total number of ports
required in such a network to support a given amount of
trafc is much lower than that in a traditional WRN that
uses ordinary OXCs (that switch trafc only at the
wavelength level). The reason is that 60% to 80% of trafc
simply bypasses the nodes in the backbone, and hence,
the wavelengths carrying such transit trafc do not need
to be individually switched in WBSnetworks (as opposed to
WRNs wherein every such wavelength still has to be
switched using a single port). In addition to reducing
the port count, which is a major factor contributing to
the overall cost of switching fabrics, the use of bands can
reduce complexity, simplify network management, and
provide better scalability.
WBS is different from wavelength routing and tradi-
tional trafc grooming in many ways. For example, tech-
niques developed for trafc grooming in WRNs, which are
useful mainly for reducing the electronic processing and/or
the number of wavelengths required, cannot be applied
directly to effectively grouping wavelengths into wave-
bands. This restriction is because in WRNs, one can multi-
plexjust about anyset of lower bit rate (i.e., subwavelength)
trafc such as OC-1s into a wavelength, which is subject
only to the wavelength-capacity constraint. However, in
WBS networks, at least one more constraint exists: Only
the trafc carried by a xed set of wavelengths (typically
consecutive) can be grouped into a band.
SWITCHING
Similar to the situation in electronic network, three under-
lying switching technologies for optical networks exist:
optical circuit switching (usually referred to as wavelength
routing in the literature), optical packet switching, and
optical burst switching as shown in Fig. 2.
Optical Circuit Switching
Optical circuit switching (OCS) takes a similar approach as
in the circuit-switched telecommunication networks. To
transport trafc, lightpaths are set up among client nodes
(such as IP routers or ATM switches) across the optical
network, with each lightpath occupying one dedicated
Ingress router Egress Route
Intermediate node
4
1
2
3
Lightpath
request
Lightpath
ACK
packets
wavelengths
(a)
Ingress router Egress Router
Intermediate optical node
Electronic layer
header payload
Header processing,
routing lookup
Control Unit
E-O conversion O-E conversion
Payload is
delayed
Optical layer
Input
interface
Output
interface
Switch
fabric
(b)
Ingress router
Egress Router
Intermediate optical node
control
packet
burst
1
2
3
Traffic
Source
Burst
Assembler
packets
Traffic
Dest
Burst
De-assembler
(c)
Figure 2. Optical switching: (a) OCS, (b) OPS, and (c) OBS.
4 FIBER OPTICAL COMMUNICATION NETWORKS
wavelength on every traversed ber link. The lightpaths
are treated as logical links by client nodes, and the user
dataare thentransportedover these lightpaths. Figure 2(a)
illustrates the lightpath setup procedure and the data
packets transport.
The major advantage of OCS is that data trafc is
transported purely in the optical domain, and the inter-
mediate hop-by-hop electronic processing is eliminated, in
addition to other benets such as protocol transparency,
quality of service (QoS) support, and trafc engineering. In
addition, OCS is only suitable for large, smooth, and long
duration trafc ows but not for bursty data trafc. How-
ever, OCS does not scale well; to connect N client nodes,
O(N
2
) lightpaths are needed. Moreover, the total number of
lightpaths that can be set up over an optical network is
limited, because of the limited number of wavelengths per
ber anddedicated use of wavelengths by lightpaths. These
issues have driven the research community to study alter-
native switching technologies.
Optical Packet Switching
To address the poor scalability and inefciency of trans-
porting bursty data trafc using OCS, optical packet
switching (OPS) has been introduced and well studied
with a long history in the literature. OPS has been moti-
vated by the (electronic) packet switching used in IP net-
works where data packets are statistically multiplexed
onto the transmission mediumwithout establishing a dedi-
cated connection. The major difculty of OPS is that the
optical layer is a dumb layer, which is different from the
intelligent electronic layer. Specically, the data packet
cannot be processed in the optical layer, but it has to be
processed in the electronic layer to perform routing and
forwarding functionality. To reduce processing overhead
and delay, only the packet header is converted from the
optical layer to the electronic layer for processing (12).
As illustrated in Fig. 2(b), the OPS node typically has a
control unit, input and output interfaces, and a switch
fabric. The input interface receives incoming packets
from input ports (wavelengths) and performs wavelength
division multiplexing in addition to many other function-
alities, such as 3R (reamplication, reshaping, and retim-
ing) regenerationto restore incoming signal quality, packet
delineation to identify the beginning/end of the header and
payload of a packet, and opticaltoelectronic (OE) con-
version of the packet header. The control unit processes the
packet header, looks up the routing/forwarding table to
determine the next hop, and congures the switch fabric to
switch the payload, which has been delayed in ber delay
lines (FDLs) while the header is being processed. The
updated packet header is then converted back to the optical
layer and combined with the payload at the output inter-
face, whichperforms wavelength-divisionmultiplexing, 3R
regeneration, and power equalization before transmitting
the packet on an output port wavelength to the next hop.
Packet loss and contention may occur in the control unit
when multiple packets arrive simultaneously. In this case,
the packet headers are buffered generally and processed
one-by-one. Furthermore, the contention may also happen
onthe payloads whenmultiple packets fromdifferent input
ports need to be switched to the same output wavelength.
Note that a payload contention results in a packet header
contention but not vice versa. The payload contentions can
be resolved in the optical layer using FDLs, wavelength
conversion, or deection routing. FDLs delay a payload for
deterministic periods of time, and thus, multiple payloads
going to the same output port simultaneously can be
delayed with different periods of time to resolve the con-
tention. Wavelength conversion can switch multiple pay-
loads to different output wavelengths. Deection routing
deects a contending payload to an alternative route at a
different output port. Note that deectionrouting shouldbe
attempted last because it may cause contentions on the
alternative paths and may lead to unordered packet
reception (13).
OPSmay take asynchronous or asynchronous approach.
The former approach is more feasible technically (most
research efforts have focused on this approach), where
packets are of xed length (like ATM cells) and the
input/output interfaces perform packet synchronization
in addition to other functions. In the asynchronous
approach, the packets are of variable length and the packet
processing in the input/output interface is asynchronous as
well. This task is highly demanding with existing technol-
ogies and is expected to be viable in the very long term. A
recent development of OPS is the optical label switching,
which is a direct application of multi-protocol label switch-
ing (MPLS) technology. MPLS associates a label to a for-
wardingequivalence class (FEC), whichas far as the optical
network is concerned, can be considered informally as a
pair of ingress/egress routers. The packet header contains a
label instead of the source/destination address, and the
data forwarding is performed based on the label, instead
of the destination address. A major benet of label switch-
ing is the speedup for the routing/forwarding table lookup,
because the label is of xed length and is easier to handle
than the variable length destination address prex.
Optical Burst Switching
Although OPS may be benecial in the long run, it is
difcult to build a cost-effective OPS nodes using the cur-
rent technologies, primarily caused by to the lack of
optical random access memory and the strict synchroni-
zation requirement (the asynchronous OPS is even more
difcult). Optical burst switching (OBS) has been proposed
as a novel alternative to combine the benet of OPS and
OCS while eliminating their disadvantages (1416).
Through statistical multiplexing of bursts, OBS signi-
cantly improves bandwidth efciency and scalability over
OCS. In addition, when compared with OPS, the optical
random access memory and/or ber delay lines are not
required in OBS (although having them would result in
a better performance), and the synchronization is less
strict. Instead of processing, routing, and forwarding
each packet, OBS assembles multiple packets into a burst
at the ingress router and the burst is switched in the
network core as one unit. In other words, multiple packets
are bundled with only one control packet (or header) to be
processed, resulting in much less control overhead. The
burst assembly at the ingress router may employ some
FIBER OPTICAL COMMUNICATION NETWORKS 5
simple scheme, for example, assembling packets (going to
the same egress router) that have arrived during a xed
period into one burst or simply assembling packets into one
burst until a certain burst length is reached, or uses more
sophisticated schemes, for example, capturing an higher
layer protocol data unit (PDU), for example, a TCP
segment, into a burst.
Different OBS approaches primarily fall into three cate-
gories: reserve-a-xed-duration (RFD), tell-and-go (TAG),
or in-band-terminator (IBT). Figure. 2(c) illustrates a RFD-
based protocol, called just enough time (JET), where each
ingress router assembles incoming data packets that are
destined to the same egress router into a burst, according to
some burst assembly scheme (14). For each burst, a control
packet is rst sent out on a control wavelength toward the
egress router and the burst follows the control packet (on a
separate data wavelength) after an offset time. This offset
time can be made no less than the total processing delay to
be encountered by the control packet, to reduce the need for
ber delay lines, and at the same time much less than the
round-trip propagation delay between the ingress and the
egress routers. The control packet, which goes through
optical-electronic-optical conversion at every intermediate
node, just as the packet header in OPS does, attempts to
reserve a data wavelength for just enough time (speci-
cally, between the time that the burst arrives and departs),
to accommodate the succeeding burst. If the reservation
succeeds, the switch fabric is congured to switch the data
burst. However, if the reservation fails, because no wave-
length is available at the time of the burst arrival on the
outgoing link, the burst will be dropped (retransmissions
are handled by higher layers such as TCP). When a burst
arrives at the egress router, it is disassembled into data
packets that are then forwarded toward their respective
destinations.
In the IBT-based OBS, the burst contains an IBT (e.g.,
silence of a voice circuit) and the control packet may be sent
in-band preceding the burst or out-of-band over a control
wavelength(before the burst). At anintermediate node, the
bandwidth (wavelength) is reserved as soon as the control
packet is received and released when the IBTof the burst is
detected. Hence, a key issue in the IBT-based OBS is to
detect optically the IBT. The TAG-based OBS is similar to
circuit switching; a setup control packet is rst sent over a
control wavelength to reserve bandwidth for the burst. The
burst is then transmitted without waiting for the acknowl-
edgment that the bandwidth has been reserved success-
fully at intermediate nodes. Arelease control packet can be
sent afterward to release the bandwidth, or alternatively
the intermediate nodes automatically release the band-
width after a timeout interval if they have not received a
refresh control packet from the ingress router.
Similar to OPS, the contentions may happen in OBS
when multiple bursts need to go to the same output port, or
when multiple control packets from different bers arrive
simultaneously. The contention resolution techniques in
OPS can be employed for OBS. Nevertheless, compared
with OPS, the contention in OBS (particularly in terms of
control packets) is expected to be much lighter because of
the packets bundling. As far as the performance is con-
cerned, the RFD-based OBS (e.g., using JET) can achieve
higher bandwidth utilization and lower burst dropping
than the IBT and TAG-based OBS, because it reserves
the bandwidth for just enough time to switch the burst,
and can reserve the bandwidth well in advance to reduce
burst dropping.
Comparison of Optical Switching Technologies
A qualitative comparison of OCS, OPS, and OBS is pre-
sented in Table 2 where it is evident that OBScombines the
benets of OCS and OPS while eliminating their short-
comings. However, todays optical networks primarily
employ OCSbecause of its implementationfeasibility using
existing technologies. OBS will be the next progression in
the evolution of optical networks, whereas OPS is expected
to be the long-term goal.
CONTROL AND SIGNALING
Optical networks using WDM technology provide an enor-
mous network capacity to satisfy and sustain the exponen-
tial growth in Internet trafc. However, all end-user
communication today uses the IP protocol, and hence, it
has become clear that the IP protocol is the common con-
vergence layer in telecommunication networks. It is there-
fore important to integrate the IP layer with WDM to
transport end-user trafc over optical networks efciently.
Control Architecture
General consensus exists that the optical network control
plane should use IP-based protocols for dynamic provision-
ing and restoration of lightpaths within and across optical
networks. As aresult, it has beenproposedto reuse or adapt
the signaling and routing mechanisms developed for IP
trafc engineering in optical networks, to create a common
control plane capable of managing IP routers as well as
optical switches.
Two general models have beenproposed to operate anIP
over an optical network. under the domain services model,
the optical network primarily offers high bandwidth con-
nectivity in the form of lightpaths. Standardized signaling
across the user networkinterface (UNI) is usedto invoke the
Table 2. Comparison of OCS, OPS, and OBS
Technology
Bandwidth
utilization
Setup
latency
Implementation
difculty Overhead
Adaptivity to
bursty trafc
and fault Scalability
OCS Low High Low Low Low Poor
OPS High Low High High High Good
OBS High Low Medium Low High Good
6 FIBER OPTICAL COMMUNICATION NETWORKS
services of the optical network for lightpath creation, dele-
tion, modication, and status query, whereas the network-
to-network interface (NNI) provides a method of commu-
nication and signaling among subnetworks within the
optical network. Thus, the domain service model is essen-
tially a client (i.e., IP layer)server (i.e., optical layer) net-
work architecture, wherein the different layers of the
network remain isolated from each other (17). It is also
known as the overlay model and is well suited for an
environment that consists of multiple administrative
domains, which is prevalent in most carrier networks
today. On the other hand, in the Unied Services model,
the IP and optical networks are treated as a single inte-
grated network from a control plane view. The optical
switches are treated just like any other router, and in
principle, no distinction exists between UNIs and NNIs
for routing and signaling purposes. This model is also
known as the peer-to-peer model, wherein the services of
the optical network are obtained in a more seamless man-
ner as compared with the overlay model. It allows a net-
work operator to create a single network domain composed
of different network elements, thereby allowing them
greater exibility than in the overlay model. The peer
model does, however, present a scalability problembecause
of the amount of information to be handled by any network
element within an administrative domain. In addition,
nonoptical devices must know the features of optical
devices and vice versa, which can present signicant dif-
culties in network operation. Athird augmented model has
also been proposed (18), wherein separate routing
instances in the IP and optical domains exist but informa-
tion from one routing instance is passed through the other
routing instance. This model is also known as the hybrid
model representing a middle ground between the overlay
and the peer models; the hybrid model supports multiple
administrative domains as in the overlay model, and sup-
ports heterogeneous technologies withina single domainas
in the peer model.
Signaling
The (GMPLS) framework (19) has been proposed as the
control plane for the various architectures. Similar to
traditional MPLS, GMPLS extends the IP layer routing
and signaling information to the optical layer for dynamic
path setup. In its simplest form, labels are assigned to
wavelengths to provide mappings between the IP layer
addresses and the optical wavelengths. Several extensions
have been added for time slots and sets of contiguous
wavelengths to support subwavelength and multiwave-
length bandwidth granularities.
GMPLS signaling such as resource reservation (RSVP)
and constraint route label distribution (CR-LDP) protocols
map the IP routing information into precongured labeled
switched paths (LSPs) in the optical layer. These LSPs are
generally set upona hop-by-hop basis or specied explicitly
when trafc engineering is required. GMPLS labels can be
stacked to provide a hierarchy of LSPbundling and explicit
routing. Despite their similar functionalities, RSVP and
CR-LDP operate differently. RSVP uses PATH and RESV
messages to signal LSP setup and activation. PATH
messages travel from source to destination nodes and
communicate classication information. RESV messages
travel from destination to source nodes to reserve the
appropriate resources. RSVP uses UDP/IP packets to dis-
tribute labels, and as such, it can survive hardware or
software failures caused by IP rerouting. CR-LDP on the
other hand assumes that the network is reliable and uses
TCP/IP packets instead. It has much lower overhead than
RSVP, but it cannot survive network failures quickly. The
advantages and disadvantages of RSVP and CR-LDP have
long beendiscussed and compared inthe literature without
a clear winner. It seems, however, that RSVP is the indus-
trys preferred protocol because it is coupled more tightly
with IP-based signaling protocols.
Protection and restoration in GMPLS involve the com-
putation of primary and backup LSPs and fault detection
and localization. Primary and backup LSP computations
consider trafc engineering requirements and network
constraints. LSP protection includes dedicated and shared
mechanisms with varying degrees of restoration speeds. In
dedicated protection (11), data are transmitted on the
primary and backup LSPs simultaneously, and as a result,
11 protection offers fast restoration and recovery from
failures. In shared protection (m:n), m backup LSPs are
precongured to provide protection for n primary LSPs.
Data trafc is switched onto the backup LSPs at the source
only after a failure has been detected. As a result, m:n
schemes are slower than dedicated protection but use con-
siderablyless bandwidth. Fault detectionandmanagement
are handled by the link management protocol (LMP), which
is also usedto manage the control channels andto verify the
physical connectivity.
SUMMARY
Optical networks can sustain a much higher throughput
than what can be achieved by pure electronic switching/
routingtechniques despite the impressive progress made in
electronics and electronic switching. As such, optical net-
working holds the key to a future of unimagined commu-
nication services where true broadband connectivity and
converged multimedia can be realized. Other anticipated
services such as grid computing, video on demand, and
high-speedInternet connections will require large amounts
of bandwidth over wide-scale deployments that will perme-
ate optical networking undoubtedly well into the last mile.
However, exploiting the phenomenal data capacity of opti-
cal networks is anything but trivial. Both traditional archi-
tectures, suchas SONET, andemergingparadigms, suchas
WRNs, require complex architectures and protocols while
providing various degrees of statistical multiplexing.
Although SONET is well entrenched and understood, it
is very expensive and its static nature limits its scalability
and, therefore, its reach into the last mile. WRNs, on the
other hand, provide far more efcient and dynamic topol-
ogies that support a signicantly larger number of all-
optical connections.
Clearly, WRNs present several technological chal-
lenges, the most pressing of which is how to exploit this
vast optical bandwidth efciently while supporting bursty
FIBER OPTICAL COMMUNICATION NETWORKS 7
data trafc. Techniques such as waveband switching and
trafc grooming help the realization of WRNs by reducing
the overall cost. Switching technologies such as OBS and
OPS provide statistical multiplexing and enhance the elas-
ticity of WRNs in support of varying trafc volumes and
requirements. Common and standardized signaling inter-
faces such as GMPLS allow for dynamic provisioning and
ease of operation and maintenance. As a result, optical
networks are more scalable and exible than their electro-
nic counterparts and can support general-purpose and
special-purpose networks while providing a high degree
of protection and restoration against failures.
Given todays available technologies, optical networks
are realizedusingOCStechniques that are less scalable and
efcient than OBS and OPS when supporting data trafc.
However, as optical technologies mature, the next genera-
tion of optical networks will employ OBS and OPS techni-
ques that are far more efcient at leveraging network
resources. As such, much of the current research is focused
on developing switching techniques that are fast and reli-
able, routing protocols that are scalable and have fast con-
vergence, and signaling protocols that exploit network-wide
resources efciently while integrating the optical and data
layers seamlessly.
BIBLIOGRAPHY
1. S. Gorshe, ANSI T1X1.5, 2001062 SONET base standard,
2001. Available:http://www.t1.org.
2. T. Hills, Next-Gen SONET, Lightreading report, 2002. Avail-
able: http://www.lightreading.com/document.asp?doc id
14781.
3. B. Mukherjee, Optical Communication Networks, New York:
McGraw-Hill, 1997.
4. B. Mukherjee, WDM optical communication networks: Pro-
gress and challenges, IEEE J. Selected Areas Commun.,
18: 18101824, 2000.
5. K. C. Lee and V. O. K. Li, A wavelength-convertible optical
network, IEEE/OSA J. of Lightwave Technology, 11:
962970,1993.
6. M. Kovacevic and A. Acampora, Benets of wavelength trans-
lation in all optical clear-channel networks, IEEE J. on
Selected Areas in Communications, 14(5): 868880, 1996.
7. X. Zhang and C. Qiao, An effective and comprehensive
approach for trafc grooming and wavelength assignment in
SONET/WDM rings, IEEE/ACM Trans. Networking, 8(5):
608617, 2000.
8. E. Modiano, Trafc grooming in WDM networks, IEEE Com-
mun. Mag., 38(7): 124129, 2001.
9. X. Cao, V. Anand, and C. Qiao, Waveband switching in optical
networks, IEEE Commun. Mag., 41(4): 105112, 2003.
10. Pin-Han Ho and H. T. Mouftah, Routing and wavelength
assignment with multigranularity trafc in optical networks,
IEEE/OSA J. of Lightwave Technology,(8): 2002.
11. X. Cao, V. Anand, Y. Xiong, and C. Qiao, A study of waveband
switching with multi-layer multi-granular optical cross-
connects, IEEE J. Selected Areas Commun., 21(7): 1081
1095, 2003.
12. D. K. Hunter et al., WASPNET: a wavelength switched packet
network, IEEE Commun. Magazine, 120129, 1999.
13. M. Mahony, D. Simeonidou, D. Hunter, and A. Tzanakaki, The
application of optical packet switching in future communica-
tion networks, IEEE Commun. Mag., 39(3): 128135, 2001.
14. C. Qiao and M. Yoo, Optical Burst Switching (OBS) - A new
paradigm for an optical Internet, J. High Speed Networks,
8(1): 6984, 1999.
15. M. Yoo and C. Qiao, Just-Enough-Time (JET): a high speed
protocol for bursty trafc in optical networks, IEEE Annual
Meeting on Lasers and Electro-Optics Society LEOS 1997,
Technologies for a Global Information Infrastructure, 1997,
pp. 2627.
16. J. Turner, Terabit burst switching, J. High Speed Networks,
8(1): 316, 1999.
17. A. Khalil, A. Hadjiantonis, G. Ellinas, and M. Ali, A novel IP-
over-optical network interconnection model for the next gen-
eration optical Internet, Global Telecommunications Confer-
ence, GLOBECOM03, 7: 15, 2003.
18. C. Assi, A. Shami, M. Ali, R. Kurtz, and D. Guo, Optical
networking and real-time provisioning: an integrated vision
for the next-generation Internet, IEEENetwork, 15(4): 3645,
2001.
19. A. Banerjee, L. Drake, L. Lang, B. Turner, D. Awduche, L.
Berger, K. Kompella, and Y. Rekhter, Generalized multipro-
tocol label switching: an overview of signaling enhancements
andrecoverytechniques, IEEECommun. Mag., 39(7):144151,
2001.
V. ANAND
SUNY College at Brockport
Brockport, New York
X. CAO
Georgia State University
Atlanta, Georgia
S. SHEESHIA
American University of Science
and Technology
Beirut, Lebanon
C. XIN
Norfolk State University
Norfolk, North Carolina
C. QIAO
SUNY Buffalo
Buffalo, New York
8 FIBER OPTICAL COMMUNICATION NETWORKS
H
HIGH-LEVEL SYNTHESIS
INTRODUCTION
Over the years, digital electronic systems have progressed
from vacuum-tube to complex integrated circuits, some of
which contain millions of transistors. Electronic circuits
can be separated into two groups, digital and analog cir-
cuits. Analog circuits operate on analog quantities that are
continuous in value and in time, whereas digital circuits
operate on digital quantities that are discrete in value and
time (1). Examples of analog and digital systems are shown
in Fig. 1.
Digital electronic systems (technically referred to as
digital logic systems) represent information in digits.
The digits used in digital systems are the 0 and 1 that
belongto the binarymathematical number system. Inlogic,
the 0and1values couldbeinterpretedas True andFalse. In
circuits, the True and False could be thought of as High
voltage and Low voltage. These correspondences set the
relations among logic (True and False), binary mathe-
matics (0 and 1), and circuits (High and Low).
Logic, in its basic shape, deals with reasoning that
checks the validity of a certain propositiona proposition
could be either True or False. The relation among logic,
binary mathematics, and circuits enables a smooth tran-
sition of processes expressed in propositional logic to
binary mathematical functions and equations (Boolean
algebra), and to digital circuits. Agreat scientic wealthof
information exists that strongly supports the relations
among the three different branches of science that lead to
the foundation of modern digital hardware and logic
design.
Boolean algebra uses three basic logic operations: AND,
OR, and NOT. Truth tables and symbols of the logic opera-
tors AND, OR, andNOTare showninFig. 2. Digital circuits
implement the logic operations AND, OR, and NOT as
hardware elements called gates that performlogic opera-
tions on binary inputs. The AND-gate performs an AND
operation, an OR-gate performs an OR operation, and an
Inverter performs the negation operation NOT. The actual
internal circuitry of gates is built with transistors; two
different circuit implementations of inverters are shown
in Fig. 3. Examples of AND, OR, and NOT gates of inte-
grated circuits (ICsalso known as chips) are shown in
Fig. 4. Besides the three essential logic operations, four
other important operations exist: the NOR (NOT-OR),
NAND (NOT-AND), Exclusive-OR (XOR), and Exclusive-
NOR (XNOR).
A logic circuit is usually created by combining gates
together to implement a certain logic function. A logic
function could be a combination of logic variables (such
as A, B, and C) with logic operations; logic variables can
take only the values 0 or 1. The created circuit could be
implemented using a suitable gate structure. The design
process usually starts from a specication of the intended
circuit; for example, consider the design and implementa-
tionof athree-variable majority function. The functionF(A,
B, C) will return a 1 (High or True) whenever the number of
1s in the inputs is greater thanor equal to the number of 0s.
The truthtable of F is shown in Fig. 5(a). The terms that
make the function F return a 1 are F(0, 1, 1), F(1, 0, 1), F(1,
1, 0), or F(1, 1, 1). This could be alternatively formulated as
in the following equation:
F A
0
BC AB
0
C ABC
0
ABC
In Figure 5(b), the implementations using a standard
ANDORInverter gate structure are shown. Some other
specications might require functions with more number
of inputs and accordingly a more complicated design pro-
cess.
The complexity of the digital logic circuit that corre-
sponds to a Boolean function is directly related to the
complexity of the base algebraic function. Boolean func-
tions may be simplied by several means. The simplica-
tion step is usually called optimization or minimization as
it has direct effects on the cost of the implemented circuit
and its performance. The optimization techniques range
from simple (manual) to complex (automated using a
computer).
The basic hardware design steps can be summarized in
the following list:
1. Specication of the required circuit.
2. Formulation of the specication to derive algebraic
equations.
3. Optimization of the obtained equations
4. Implementation of the optimized equations using
suitable hardware (IC) technology.
The above steps are usually joined with an essential
verication procedure that ensures the correctness and
completeness of each design step.
Basically, three types of IC technologies can be used to
implement logic functions on Ref. 2 these are full-custom,
semi-custom, and programmable logic devices (PLDs). In
full-custom implementations, the designer cares about the
realization of the desired logic function to the deepest
details, including the gate-level and the transistor-level
optimizations to produce a high-performance implementa-
tion. In semi-custom implementations, the designer uses
some ready logic-circuit blocks and completes the wiring to
achieve an acceptable performance implementation in a
shorter time than full-custom procedures. In PLDs, the
logic blocks and the wiring are ready. In implementing a
function on a PLD, the designer will only decide of which
wires and blocks to use; this step is usually referred to as
programming the device.
The task of manually designing hardware tends to be
extremely tedious, and sometimes impossible, with the
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Analog Amplifier
Speaker
Microphone
A Simple Analog System
Personal Digital Assistant and a Mobile Phone
Speaker
Microphone
A Digital System
Figure 1. A simple analog system and a digital system; the analog signal amplies the input signal using analog electronic
components. The digital system can still include analog components like a speaker and a microphone; the internal processing is digital.
Input X Input Y
Output:
X AND Y
Input X Input Y
Output:
X OR Y
Input X
Output:
NOT X
False False False False False False False True
False True False False True True True False
True False False True False True
True True True True True True
(a)
Input X Input Y
Output:
X AND Y
Input X Input Y
Output:
X OR Y
Input X
Output:
NOT X
0 0 0 0 0 0 0 1
0 1 0 0 1 1 1 0
1 0 0 1 0 1
1 1 1 1 1 1
(b)
(c)
AND Gate OR Gate
Inverter
0
1
0
0
0
0
0
1
0
1
1
1
0
1
1
1
0
0
0 1 0
0 1 1
0
1 1
1
Figure 2. (a) Truth tables for AND, OR, and Inverter. (b) Truth tables for AND, OR, and Inverter in binary numbers. (c) Symbols for AND,
OR, and Inverter with their operation.
Output Input
+V
DD
130 1.6 k 4 k
1 k
+V
CC
Output
Input
TTL Inverter CMOS Inverter
Figure 3. Complementary metal-oxide semiconductor (CMOS) and transistor-transistor logic (TTL) inverters.
2 HIGH-LEVEL SYNTHESIS
increasing complexity of modern digital circuits. Fortu-
nately, the demand on large digital systems has been
accompanied with a fast advancement in IC technolo-
gies. Indeed, IC technology has been growing faster than
the ability of designers to produce hardware designs.
Hence, a growing interest has occurred in developing
techniques and tools that facilitate the process of hard-
ware design.
The task of making hardware design simpler has been
inspired largely by the success story in facilitating the pro-
gramming of traditional computers done by software
designers. This success has motivated eager hardware
designerstofollowcloselythefootstepsofsoftwaredesigners,
which leads to a synergy between these two disciplines that
creates what is called hardware/software codesign.
SOFTWARE DESIGN
A computer is composed basically from a computational
unit made out of logic components whose main task is to
perform arithmetic and logic operations; this is usually
called the arithmetic and logic unit (ALU). The computa-
tions performed by the ALU are usually controlled by a
neighboring unit called the control unit (CU). The ALUand
the CU construct the central processing unit (CPU) that is
usually attached to a storage unit, memory unit, and input
and output units to build a typical digital computer. A
simplied digital computer is shown in Fig. 6.
To perform an operation using the ALU, the computer
shouldprovide asequence of bits (machine instruction) that
include signals to enable the appropriate operation, the
inputs, and the destination of the output. To run a whole
program (sequence of instruction), the computations are
provided sequentially to the computer. As the program
sizes grow, dealing with 0s and 1s becomes difcult. Efforts
to facilitate dealing with computer programs concentrated
on the creation of translators that hides the complexity of
dealing with programming using 0s and 1s. An early pro-
posed translator produced the binary sequence of bits
(machine instructions) from easy-to-handle instructions
written using letters and numbers called assembly
language instructions. The translator performing the
above job is called an assembler (see Fig. 7).
Before long, the limitations of assembly instructions
became apparent for programs consisting of thousands of
GND
Vcc
GND
Vcc
GND
Vcc
Figure 4. The 74LS21 (AND), 74LS32 (OR), and 74LS04 (Inverter) TTL ICs.
A
B
C
F(A, B, C)
A
B
C
A
B
C
A
B
C
Input A Input B Input C Output F
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
(a)
(b)
Figure 5. (a) Truth table. (b) Standard implementation of the
majority function.
CPU
CU
ALU
I/O
MEM
Storage
Figure 6. A typical organization of a digital computer.
HIGH-LEVEL SYNTHESIS 3
instructions. The solution came in favor of translation
again; this time the translator is called a compiler. Com-
pilers automatically translate sequential programs,
written in a high-level language like C and Pascal, into
equivalent assembly instructions (see Fig. 7). Transla-
tors like assemblers and compilers helped software
designers ascend to higher levels of abstraction. With
compilers, a software designer can code with a fewer
number of lines that are easy to understand. Then, the
compiler will do the whole remaining job of translation to
hide all the low-level complex details from a software
designer.
TOWARD AUTOMATED HARDWARE DESIGN
Translation from higher levels of abstraction for software
has motivated the creation of automated hardware design
(synthesis) tools. The idea of hardware synthesis sounds
very similar to that for software compilation. A designer
can produce hardware circuits by automatically synthe-
sizing an easy-to-understand description of the required
circuit, to provide a list of performance-related require-
ments.
Several advantages exist to automating part or all of the
hardware design process and to moving automation to
higher levels of abstraction. First, automation assures a
much shorter design cycle and time to market for the
produced hardware. Second, automation allows for more
exploration of different design styles because different
designs can be synthesized and evaluated within a short
time. Finally, withwell-developed designautomationtools,
they may outperformhuman designers in generating high-
quality designs.
HARDWARE DESIGN APPROACHES
Two different approaches emerged from the debate over
ways to automate hardware design. On the one hand, the
capture-and-simulate proponents believe that human
designers have good design experience that cannot be
automated. They also believe that a designer can build a
design in a bottom-up style from elementary components
such as transistors and gates. Because the designer is
concerned with the deepest details of the design, optimized
and cheap designs could be produced. On the other hand,
the describe-and-synthesis advocates believe that synthe-
sizing algorithms can outperform human designers. They
also believe that a top-down fashion would be better suited
for designing complex systems. In describe-and-synthesize
methodology, the designers rst describe the design. Then,
computer aided design (CAD) tools can generate the
physical and electrical structure. This approach describes
the intended designs using special languages called hard-
ware description languages (HDLs). Some HDLs are very
similar to traditional programming languages like C and
Pascal (3).
Both design approaches may be correct and useful at
some point. For instance, circuits made from replicated
small cells (like memory) are to perform efciently if the
cell is captured, simulated, and optimized to the deepest-
level components (such as transistors). Another compli-
cated heterogeneous design that will be developed and
mapped onto a ready prefabricated device, like a PLD
where no optimizations are possible onthe electronics level,
can be described and synthesized automatically. However,
modern synthesis tools are well equipped with powerful
automatic optimization tools.
HIGH-LEVEL HARDWARE SYNTHESIS
Hardware synthesis is a general term used to refer to the
processes involved in automatically generating a hardware
design from its specication. High-level synthesis (HLS)
could be dened as the translation from a behavioral
description of the intended hardware circuit into a struc-
tural description similar to the compilation of program-
ming languages (such as C and Pascal) into assembly
language. The behavioral description represents an algo-
rithm, equation, and so on, Whereas a structural descrip-
tion represents the hardware components that implement
the behavioral description. Despite the general similarity
between hardware and software compilations, hardware
synthesis is a multilevel and complicated task. In software
compilation, you translate from a high-level language to a
low-level language, Whereas in hardware synthesis, you
step through a series of levels.
void swap ( int &a, int &b)
{
int temp;
temp = a;
a = b;
b = temp;
}
High-level
C-language
code
Intel Assembly
Language code
Binary
Machine Code
a, b Swap MACRO
AX, a MOV
BX, b MOV
a, BX MOV
b, AX MOV
RET
ENDM
0110111010110011001010010010101
1110111010110011001010000010101
1111111010110011001010000010101
0111111010110011000000010010101
0100001010110010000111011010101
0110111010110011111010010010101
0111000101100110010000000010101
Assembler
Compiler
Figure 7. The translation process of high-level programs.
4 HIGH-LEVEL SYNTHESIS
To explain more on behavior, structure, and their cor-
respondences, Fig. 8 shows Gajskis Y-chart. In this chart,
each axis represents a type of description (behavioral,
structural, and physical). On the behavioral side, the
main concern is for algorithms, equations, and functions
but not for implementations. On the structural side, imple-
mentationconstructs are shown; the behavior is implemen-
tedbyconnectingcomponents withknownbehavior. Onthe
physical side, circuit size, component placements, and wire
routes on the developed chip (or board) are the main focus.
The chained synthesis tasks at each level of the design
process include system synthesis, register-transfer synth-
esis, logic synthesis, and circuit synthesis. System synth-
esis starts with a set of processes communicating though
either shared variables or message passing. It generates a
structure of processors, memories, controllers, and inter-
face adapters from a set of system components. Each com-
ponent can be described using a register-transfer language
(RTL). RTLdescriptions model ahardware designas circuit
blocks and interconnecting wires. Each of these circuit
blocks could be described using Boolean expressions. Logic
synthesis translates Boolean expressions into a list of logic
gates and their interconnections (netlist). The used gates
could be components froma given library such as NANDor
NOR. In many cases, a structural description using one
library must be converted into one using another library
(usually referred to as technology mapping). Based on the
produced netlist, circuit synthesis generates a transistor
schematic from a set of inputoutput current, voltage and
frequency characteristics, or equations. The synthesized
transistor schematic contains transistor types, para-
meters, and sizes.
Early contributions to HLS were made in the 1960s. The
ALERT (4) system was developed at IBM (Armonk NY)
ALERT automatically translates behavioral specications
written in APL (5) into logic-level implementations. The
MIMOLAsystem(1976) generated a CPUfroma high-level
input specication (6). HLS has witnessed & considerable
growth since the early 1980s, and currently, it plays a key
role in modern hardware design.
HIGH-LEVEL SYNTHESIS TOOLS
A typical modern hardware synthesis tool includes HLS
logic synthesis, placement, and routing steps as shown in
Fig. 9. In terms of Gajskis Y-chart vocabulary, these mod-
ern tools synthesize a behavioral description into a
Circuit synthesis
Logic synthesis
Register-transfer synthesis
System synthesis
Transistor layouts
Cell
Chips
Boards, multichip modules
Transistor functions
Boolean expressions
Register transfers
Flowcharts, algorithms
Transistors
Gates, flip-flops
Registers, ALUs, MUXs
Processors, memories, buses
STRUCTURAL
DOMAIN
BEHAVIORAL
DOMAIN
PHYSICAL
DOMAIN
Figure 8. Gajskis Y-chart.
Behavioral Description
High-level Synthesis
Allocation
Binding
Scheduling
Register-Transfer Level
Logic Synthesis
Combination and Sequential Logic
Optimization
Technology Mapping
Netlist
Hardware Implementation
Placement and Routing
Figure 9. The process of describe-and-synthesize for hardware
development.
HIGH-LEVEL SYNTHESIS 5
structural network of components. The structural network
is thensynthesizedevenmore, optimized, placedphysically
in a certain layout, and then routed through. The HLS step
includes, rst, allocating necessary resources for the com-
putations needed in the provided behavioral description
(the allocation stage). Second, the allocated resources are
bound to the corresponding operations (the binding stage).
Third, the operations order of execution is scheduled (the
scheduling stage). The output of the high-level synthesizer
is an RT-level description. The RT-level description is then
synthesized logically to produce an optimized netlist. Gate
netlists are then converted into circuit modules by placing
cells of physical elements (transistors) into several rows
and connecting input/output (I/O) pins through routing in
the channels between the cells. The following example
illustrates the HLS stages (allocation, binding, and sche-
duling).
Consider a behavioral specication that contains the
statement, s a
2
b
2
4b. The variables a and b are
predened. Assume that the designer has allocated two
multipliers (m
1
and m
2
) and one adder (ad) for s. However,
to compute s, a total of three multipliers and two adders
could be used as shown in the dataow graph in Fig. 10.
A possible binding and schedule for the computations
of s are shown in Fig. 11. In the rst step, the multiplier
m
1
is bound withthe computationof a
2
, and the multiplier
m
2
is bound with the computation of b
2
. In the second
step, m
1
is reused to compute (4b); also the adder (ad) is
used to perform (a
2
b
2
). In the third and last step, the
adder is reused to add (4b) to (a
2
b
2
). Different bindings
and schedules are possible. Bindings and schedules could
be carried out to satisfy a certain optimization, for exam-
ple, to minimize the number of computational steps,
routing, or maybe multiplexing.
HARDWARE DESCRIPTION LANGUAGES
HDLs, like traditional programming languages, are often
categorized according to their level of abstraction. Beha-
vioral HDLs focus on algorithmic specications and sup-
port constructs commonly found in high-level imperative
programming languages, such as assignment, and condi-
tionals.
Verilog (7) and VHDL (Very High Speed Integrated
Circuit Hardware Description Language) (8) are by far
the most commonly used HDLs in the industry. Both of
these HDLs support different styles for describing hard-
ware, for example, behavioral style, and structural gate-
level style, VHDL became IEEE Standard 1076 in 1987.
Verilog became IEEE Standard 1364 in December 1995.
a
x x x
+
+
a b b b 4
s = a
2
+ b
2
+ 4 b
a
2
+ b
2
4b
Step 1
Step 2
Step 3
m
1
m
2
m
3
ad
1
ad
2
Figure 10. A possible allocation, binding, and scheduling of s a
2
b
2
4b.
a
x x
x +
+
a b b
b 4
s = a
2
+ b
2
+ 4b
a
2
+ b
2
4b
Step 1
Step 2
Step 3
m
1
m
2
m
1
ad
a
Figure 11. Another possible allocation, binding, and scheduling of s a
2
b
2
4b.
6 HIGH-LEVEL SYNTHESIS
The Verilog language uses the module construct to
declare logic blocks (with several inputs and outputs). In
Fig. 12, a Verilog description of a half-adder circuit is
shown.
In VHDL, each structural block consists of an inter-
face description and architecture. VHDL enables beha-
vioral descriptions in dataow and algorithmic styles.
The half-adder circuit of Fig. 12 has a dataowbehavioral
VHDL description as shown in Fig. 13; a structural
description is shown in Fig. 14.
Efforts for creating tools with higher levels of abstrac-
tion lead to the production of many powerful modern hard-
ware design tools. Ian Page and Wayne Luk (9) developed a
compiler that transformed a subset of Occaminto a netlist.
Nearly 10 years later, we have witnessed the development
of Handel-C (9), the rst commercially available high-level
language for targeting programmable logic devices (suchas
eld-programmable gate arraysFPGAs).
Handel-C is a parallel programming language based on
the theories of communicating sequential processes (CSPs)
and Occam with a C-like syntax familiar to most program-
mers. This language is used for describing computations
that are to be compiled into hardware. AHandel-Cprogram
is not compiled into machine code but into a description of
gates and ip-ops, which is then used as aninput to FPGA
design software. Investments for research into rapid devel-
opment of recongurable circuits using Handel-C have
been largely made at Celoxica (oxfordshire, united king-
dom)(10). The Handel-Ccompiler comes packaged with the
Celoxica DK Design Suite.
Almost all ANSI-C types are supported in Handel-C.
Also, Handel-Csupports all ANSI-Cstorage class speciers
and type qualiers except volatile and register, which have
no meaning in hardware. Handel-C offers additional types
for creating hardware components, such as memory, ports,
buses, andwires. Handel-Cvariables canonly be initialized
if theyare global or if they are declaredas static or constant.
Figure 15 shows C and Handel-C types and objects in
addition to the design ow of Handel-C. Types are not
limited to width in Handel-C because, when targeting
hardware, no need exists to be tied to a certain width.
Variables can be of different widths, thus minimizing the
hardware usage. For instance, if we have a variable a that
can hold a value between 1 and 5, then it is enough to use 3
bits only (declared as int 3 a).
The notion of time in Handel-C is fundamental. Each
assignment happens in exactly one clock cycle; everything
Module Half_Adder (a, b, c, s);
input a, b;
output c, s; //Output sum and carry.
and Gate1 (c, a, b); //an AND gate with two inputs a and b
//and one output c
xor Gate2 (s, a, b) //a XOR gate with two inputs a and b
//and one output s
endmodule
Figure 12. A Verilog description of a half-adder circuit.
entity Half_Adder is
port (
a: in STD_LOGIC;
b: in STD_LOGIC;
c: out STD_LOGIC;
s: out STD_LOGIC);
end Half_Adder
architecture behavioral of Half_Adder is
begin
s <= (a xor b) after 5 ns;
c <= (a and b) after 5 ns;
end behavioral;
Figure 13. A behavioral VHDL description of a Half_Adder.
entity Half_Adder is
port (
a, b: in bit;
c, s: out bit;);
end Half_Adder
architecture structural of Half_Adder is
port (x, y: in bit; o: out bit); component AND2
port (x, y: in bit; o: out bit); component EXOR2
begin
Gate1 : AND2 port map (a, b, c);
Gate2 : EXOR2 port map (a, b, s);
end structural;
Figure 14. A structural VHDL description of a Half_Adder.
HIGH-LEVEL SYNTHESIS 7
else is free. An essential feature in Handel-C is the par
construct, which executes instructions in parallel.
Figure 16 provides an example showing the effect of using
par.
Building on the work carried out in Oxfords Hardware
Compilation Group by Page and Luk, Saul at Oxfords
Programming Research Group (12) introduced a different
codesign compiler, Dash FPGA-Based Systems. This com-
piler provides a cosynthesis and cosimulation environment
for mixed FPGA and processor architectures. It compiles a
C-like description to a solution containing both processors
and custom hardware.
Luk and McKeever (13) introduced Pebble, a simple
language designed to improve the productivity and effec-
tiveness of hardware design. This language improves pro-
ductivity by adopting reusable word-level and bit-level
descriptions that can be customized by different parameter
values, such as design size and the number of pipeline
stages. Suchdescriptionscanbecompiledwithout attening
into various VHDL dialects. Pebble improves design effec-
tiveness by supporting optional constraint descriptions,
such as placement attributes, at various levels of abstrac-
tion; it also supports runtime recongurable designs.
Todman and Luk (14) proposed a method that combines
declarative and imperative hardware descriptions. They
investigated the use of Cobble language, which allows
abstractions to be performed in an imperative setting.
Designs created in Cobble benet from efcient bit-level
implementations developedinPebble. Transformations are
suggested to allow the declarative Pebble blocks to be used
in cobbles imperative programs.
Weinhardt (15) proposes ahigh-level language program-
ming approach for recongurable computers. This
approach automatically partitions the design between
hardware and software and synthesizes pipelined circuits
from parallel for loops.
C Legacy Code
Data Refinement
Data Parallelism
Code Optimization
Handel-C Compiler
Netlist
(a)
Implementation
Refinement
Conventional C Only
double
float
union
Architectural Types
Compound Types
Special Types
(b)
In Both
int
unsigned
char
long
short
enum
register
struct
static
extern
volatile
void
const
auto
signed
typedef
Handel-C Only
chan
ram
rom
wom
signal
chanin
chanout
mpram
typeof
undefined
<>
inline
interface
sema
Figure 15. C and Handel-C types and objects. Handel-C types can be classied into common logic types, architectural types, compound
types, and special types.
8 HIGH-LEVEL SYNTHESIS
Najjar et al. (16) presented a high-level, algorithmic
language and optimizing compiler for the development of
image-processing applications on RC-systems. SA-C, a sin-
gle assignment variant of the C programming language,
was designed for this purpose.
A prototype HDL called Lava was developed by Satnam
Singh at Xilinx, Inc. (San Jose, CA) and Mary Sheeran and
Koen Claessen at Chalmers University in Sweden (17).
Lava allows circuit tiles to be composed using powerful
higher order combinators. This language is embedded in
the Haskell lazy functional programming language.
Xilinxs implementation of Lava is designed to support
the rapid representation, implementation, and analysis
of high-performance FPGA circuits.
Besides the above advances in the area of high-level
hardware synthesis, the current market has other tools
employed to aid programmable hardware implementa-
tions. These tools include the Forge compiler from Xilinx,
the SystemC language, the Nimble compiler for Agileware
architecture from Nimble Technology (now Actuate Cor-
poration, San Mateo, CA) and Superlog.
Forge is a tool for developing recongurable hardware,
mainly FPGAs. Forge uses Java with no changes to syntax.
It also requires no hardware designskills. The Forge design
suite compiles into Verilog, whichis suitable for integration
with standard HLS and simulation tools.
SystemC is based on a methodology that can be used
effectively to create a cycle-accurate model of a system
consisting of software, hardware, and their interfaces in
C. SystemCis easyto learnfor people who alreadyuse C/
C. SystemC produces an executable specication, while
inconsistencies and errors are avoided. The executable
specication helps to validate the system functionality
before it is implemented. The momentum in building the
SystemC language and modeling platform is to nd a
proper solution for representing functionality, communica-
tion, and software and hardware implementations at var-
ious levels of abstraction.
The Nimble compiler is an ANSI-C-based compiler for a
particular type of architecture called Agileware. The Agile-
ware architecture consists of a general-purpose CPUand a
dynamically congurable data path coprocessor with a
memory hierarchy. It can parallelize and compile the
code into hardware and software without user interven-
tion. Nimble can extract computationally intensive loops,
turn them into data ow graphs, and then compile them
into a recongurable data path.
Superlog is an advanced version of Verilog. It adds more
abstract features to the language to allow designers to
handle large and complex chip designs without getting
too much into the details. In addition, Superlog adds
many object-oriented features as well as advanced pro-
gramming construct to Verilog.
Other famous HLS and hardware design tools include
Alteras Quartus (San Jose, CA), Xilinxs ISE,and Mentor
Graphicss HDL Designer, Leonardo Spectrum, Precision
Synthesis, and ModelSim (Wilsonville, OR).
HIGHER LEVEL HARDWARE DESIGN METHODOLOGIES
The area for deriving hardware implementations from
high-level specications has been witnessing a continuous
growth. The aims always have beento reachhigher levels of
abstraction through correct, well-dened renement steps.
Many frameworks for developing correct hardware have
been brought out in the literature (1820).
Hoare and colleagues (20) in the Provably Correct
Systems project (ProCoS) suggested a mathematical basis
for the development of embedded and real-time computer
systems. They used FPGAs as back-end hardware for
realizing their developed designs. The framework
included novel specication languages and verication
techniques for four levels of development:
A halt instruction.
Early computers could do little more than this basic
instruction set. As machines evolved and changed, greater
hardware capability was added, e.g., the addition of multi-
plication and division units, oating point units, multiple
registers, and complex instruction decoders. Instruction
sets expanded to reect the additional hardware capability
by combining frequently occurring instruction sequences
into a single instruction. The expanding CISCs continued
well into the 1980s until the introduction of RISCmachines
changed this pattern.
Classes of Instruction Set Architectures
Instructionsets are oftenclassiedaccording to the method
used to access operands. ISAs that support memory-to-
memory operations are sometimes called SS architectures
(for StorageStorage), whereas ISAs that support basic
arithmetic operations only in registers are called RR
(RegisterRegister) architectures.
Consider an addition, A B C, where the values of
A, B, and Chave been assigned memory locations 100, 200,
and 300, respectively. If an instruction set supports
three-address memory-to-memory instructions, a single
instruction,
Add A; B; C
would perform the required operation. This instruction
would cause the contents of memory locations 200 and
300 to be moved into registers in the arithmetic logic
unit (ALU), the add performed in the ALU, and then the
result stored into location 100.
However, it is unlikely that an instruction set would
provide this three-address instruction. One reason is that
the instruction requires many bytes of storage for all
operand information and, therefore, is slow to load and
interpret. Another reason is that later operations might
needthe result of the operation(for example, if B Cwere a
subexpression of a later, more complex expression), so it is
advantageous to retain the result for subsequent instruc-
tions to use.
A two-address register-to-memory alternative might be
as follows:
Load R1; B ; R1 : B
Add R1; C ; R1 : R1 C
Store A; R1 ; A : R1
whereas a one-address alternative would be similar, with
the references to Rl (register 1) removed. In the latter
scheme, there would be only one hardware register avail-
able for use and, therefore, no need to specify it in each
instruction. (Example hardwares are the IBM 1620 and
7094.)
Most modern ISAs belong to the RR category and use
general-purpose registers (organized either independently
or as stacks) as operands. Arithmetic instructions require
that at least one operand is in a register while load and
store instructions (or push and pop for stack-based
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
machines) copy data between registers and memory. ISAs
for RISC machines require both operands to be in registers
for arithmetic instructions. If the ISAdenes a register le
of some number of registers, the instruction set will have
commands that access, compute with, and modify all of
those registers. If certain registers have special uses, such
as a stack pointer, instructions associated with those regis-
ters will dene the special uses.
The various alternatives that ISAs make available,
such as
Instruction pipelines
Load/store architecture
Adjust RPL Field of Selector ARPL Loop Count (with condition) LOOP
Call Procedure (in different segment) CALL Pop Word/Register(s) from Stack POP
Convert Byte to Word CWB Push Word/Register(s) onto Stack PUSH
Convert Doubleword to Qword CDQ Rotate thru Carry Left RCL
Clear Carry Flag CLC Rotate thru Carry Right RCR
Clear Direction Flag CLD Read from Model Specic Register RDMSR
Clear Interrupt Flag CLI Read Performance Monitormg Counters RDPMC
Clear Task-Switched Flag in CRO CLTS Read Time-Stamp Counter RDTSC
Complement Carry Flag CMC Input String REP INS
Conditional Move CMOVcc Load String REP LODS
Compare to Operands CMP Move String REP MOVS
Compare String Operands CMP[S[W/D]] Output String REP OUTS
Compare/Exchange CMPXCHG Store String [REP] STOS
Compare/Exchange 8 Bytes CMPXCHG8B Compare String REP[N][E] CMPS
CPU Identication CPUID Scan String [REP] [N][E] SCANS
Convert Word to Doubleword CWD Return from Procedure RET
Convert Word to Doubleword CWDE Rotate Left ROL
Decimal Adjust AL after Addition DAA Rotate Right ROR
Decimal Adjust AL after Subtraction DAS Resume from System Management Mode RSM
Decrement by 1 DEC Store AH into Flags SAHF
Unsigned Divide DIV Shift Arithmetic Left SAL
Make Stack Frame for Proc. ENTER Shift Arithmetic Right SAR
Halt HLT Subtract with Borrow SBB
Signed Divide IDIV Jziyte oet on Condition SETcc
Signed Multiply IMUL Store Global Descriptor Tabel Register SGTD
Input From Port IN Shift Left [Double] SHL[D]
Increment by 1 INC Shift Right [Double] SHR[D]
Input from DX Port INS Store Interrupt Descriptor Table Register SIDT
Interrupt Type n INT n Store Local Descriptor Table SLDT
Single-Step Interrupt 3 INT Store Machine Status Word SMSW
Interrupt 4 on Overow INTO Set Carry Flag STC
Invalidate Cache INVD Set Direction Flag SDC
Invalidate TLB Entry INVLPG Set Interrupt Flag STI
Interrupt Return IRET/IRETD Store Task Register STR
Jump if Condition is Met Jcc Integer Subtract SUB
Jump on CX/ECX Zero JCXZ/JECXZ Logical Compare TEST
Unconditional Jump (same segment) JMP Undened Instruction UD2
Load Flags into AH Register LAHF Verify a Segment for Reading VERR
Load Access Rights Byte LAR Wait WAIT
Load Pointer to DS LDS Writeback and Invalidate Data Cache WVINVD
Load Effective Address LEA Write to Model-Specic Register WRMSR
High Level Procedure Exit LEAVE Exchange and Add XCHG
Load Pointer to ES LES Table Look-up Translation XLAT[B]
Load Pointer to FS LFS Logical Exclusive OR XOR
INSTRUCTION SETS 9
Mask operations
Z
T
o
F
i
F
j
dt F
2
; i j
F
i
F
j
0; i 6 j
8
>
>
<
>
>
:
1
The column functions (data signals) are constructed by
combining the scan signals properly. Their value at a given
time depends on the ON/OFF states of the pixels they are
meant to activate, as follows:
G
j
t
1
N
p
X
p
1
a
ij
F
i
t i; j 1; . . . ; p 2
Orthogonality of the scan signals ensures that individual
pixels remainunaffectedby the state of the others along the
same column. The advantages of MLA include the power
savings achievable because of the moderate supply voltages
required by the technique and the possibility of reducing
the frame frequency without fearing frame response. In
the LC jargon, frame response is referred to as the relaxa-
tion of the liquid crystal directors over a frame time, which
leads to contrast lowering and image ickering. By means
of MLA, the LC is pushed several times within a frame so
that virtually no relaxation occurs: The image contrast can
be preserved, the icker restrained, and the artifacts like
smearing on moving objects (e.g., while scrolling) are sup-
pressed by eventually adopting faster responding LCmate-
rial. On the downside, all of this technology comes at the
expense of an increase in the number of driving voltages to
be generated(three rowvoltages andp 1columnvoltages
are needed) and of more complex driver realizations.
Alternative MLA Schemes
The sets of the orthogonal scan signals used in MLA are
assigned through matrices. Each signal is described along
the rows, whereas eachcolumnshows the constant normal-
ized value (1,1, or 0) assumed by the resulting waveform
at the corresponding time slot. Theoretically, the matrices
canbe rectangular, where the number of rows indicates the
number of concurrently addressed display rows (p), and the
number of columns indicates the number of MLA scans
needed to cover the whole panel (being in turn equal to the
number of time slots composing each orthogonal signal).
Different types of matrices that meet the orthogonality
constraints are available and used diversely in practical
implementations of MLAschemes (5). Arst class is the set
of Walsh functions, coming up as 2
s
orthogonal functions
derived fromHadamard matrices. The class of the so-called
Walking 1 functions is also used extensively: They are
LCD DESIGN TECHNIQUES 5
built up of p1 positive pulses (1) and 1 negative pulse
(1) shifting right or left from one function to the other.
Hadamard and Walking 1 matrices, like many other
standardized patterns, only contain 1 and 1 entries.
However, since the number of nonzero entries inthe matrix
columns plus one corresponds to the number of column
voltage levels used for waveform generation, it is desirable
to introduce zeroes in the matrices. In this respect, con-
ference matrices with one zero per row represent a smart
solution.
Once the matrix structure has been chosen, the value of
p must be also set, i.e., the number of rows addressed in
parallel. The choice is dictated by optical performance
considerations: It is widely agreed that for p up to 8, frame
response is effectively suppressed, so that p is usually set to
2, 3, or 4 in commercial mobile display drivers (for ease of
reference, 2-MLA, 3-MLA, or 4-MLA, respectively). For
2-MLA and 4-MLA, either Walsh or Walking 1 functions
can be taken, whereas 3-MLAworks with 4 4 conference
matrices (the 0 entry corresponding to one nonselected row
out of 4). Also, a particularly interesting solution exists,
sometimes referred to as virtual-3-MLA, which is brought
inas a modication of the 4-MLAscheme. Infact, virtual-3-
MLA is identical to 4-MLA when the driver operation is
regarded, but the fourthrowdriver output out of every set of
four rows is not connectedto the display but is just left open.
On the display side, row number 4 is connected to the fth
driver row output, and so on for the remaining ones. It can
be calculated that with virtual-3-MLA, only two voltage
levels are required to construct the data signals, which
represents signicant improvement, making the driver less
complex and reducing the power dissipation.
The rationale of MLA does not entail any built-in mea-
sure to guarantee optical uniformity in the brightness of
the pixels. Extended features are commonly then added up
on top of the basic MLA scheme. The drive pattern switch-
ing is a popular example consisting of the periodical inter-
change of the set of used orthogonal functions; the change
usually is scheduled to occur before every p-rowaddressing
sequence.
ACTIVE-MATRIX LCDS
Addressing Concept
In AMLCDs, each pixel within the display matrix can be
accessed only when the corresponding switch is closed.
Pixel driving is realized by means of three separate
electrodes, as shown in Fig. 5, where a circuit model of
the pixel has been worked out. The concept of AMLCD
driving is theoretically simple with respect to PMLCD:
Adjacent rows are addressed sequentially by applying a
positive pulse to the rowelectrodes. The rowelectrodes are
connected directly to the switches gates and then are used
to switch ON/OFF at the same time for all pixels along a given
row. In turn, the column electrodes are used to build up the
voltage throughout the LC cell according to the pixel data.
In practical scenarios, row-interlaced patterns are used in
place of the rudimentary row-sequential addressing, with
benecial effects with respect to the driver overall power
consumption. Even if the addressing principle is straight-
forward, various dedicated countermeasures usually are
exploited to target AMLCD-specic drawbacks, which com-
plicate the realization of the driving schemes.
Effects of NonIdealities in Active-Matrix Addressing
Parasitic gate-source and gate-drain capacitances of the
switching transistors have been explicitly drawn in Fig. 5.
They are responsible for several undesired consequences
that affect the display dynamic operation. The gate-drain
capacitance, for instance, brings in an appreciable over-
shoot that distorts the driving voltage throughout the pixel
cell at the switching point of the row driving pulses. The
mechanism is shown in Fig. 6 for a typical voltage wave-
form: The effect is known as kickback effect and is prac-
tically caused by some extra charge loaded into the pixel
capacitance. The overshoot, unless mitigated purposely,
results in some sort of visible display ickering. A common
technique used to ght the kickback effect is described as
common electrode modulation: In essence, the common
electrode is driven so as to compensate for the additional
Row
electrode
Common
electrode
Column
electrode
IC
C
GS
C
GD
Row waveforms
Row N1
Row N
Row N+ 1
Column
M1
Column
M
Column
M+ 1
Common
electrode
Pixel data
Figure 5. Pixel model and display matrix addressing for AMLCDs.
6 LCD DESIGN TECHNIQUES
spurious charge injection affecting the pixel capacitance at
critical time points within the frame.
The kickback phenomenon is not the sole effect of the
active-matrix nonidealities. The quality of the display is
also inuenced by the sensitivity of the LC capacitance to
the applied voltage and by different forms of leakage cur-
rents, generally depending oneither the liquidcrystal itself
or the technology of the active devices. The LC capacitance
variability often nullify the advantages of common elec-
trode modulation. On the other hand, leakage determines a
loss in contrast and compromises the visual uniformity of
the displayed pictures, as well as inducing vertical cross-
talk among pixels. Both LC capacitance variations and
leakage can be controlled by using a storage capacitor.
The storage capacitor is connected between the switch
drain and the row electrode of the previous or next display
row (or to a separate electrode), so as to work in parallel
with the pixels real capacitance. Kickback correction is
then made more effective, as the increase in the overall
capacitance makes the cell less sensitive to any parasitic
current. Another specic measure against leakage currents
inthe active device is the driving voltage polarity inversion,
whichis a technique that belongs to a broad class of polarity
inversion methods used extensively to mitigate different
forms of optical artifacts.
GRAY TONES AND COLOR GENERATION TECHNIQUES
The concept of ON/OFF pixel driving is suitable for mere
black- and- white displays. Extending this principle to
gray-scale panels requires that different gray-levels be
constructed throughout a sequence of black/white states
driven over consecutive frames. Frame rate control (FRC)
serves this purpose: Sequences of N frames (N-FRC) are
grouped together to compose a superframe. Over each
superframe, the sequence of black- and- white states cre-
ated consecutively at each pixel are perceived as homoge-
neous gray tones thanks to the persistency of the human
visual system (3). Proper operation only requires that the
superframe frequency, forced to equal the frame frequency
divided by N, be above a minimumadmissible value: Over a
50-Hz superframe frequency usually is agreed on for
icker-free image visualization. A renement of the FRC
method is frame length control, where different gray tones
are generated over a superframe by varying the time
length ratio between adjacent frames in the superframe
(andthereof the durationof individual black/white phases).
FRC is a very basic gray shading solution, so that
cooperating schemes usually are combined in industrial
drivers to enhance the color gamut: A common design
choice is to modulate the data signal pulses to enrich the
color resolution (pulse width modulation, PWM). As for the
hardware, joint PWM-FRC is costly in terms of extra chip
complexity, but it successfully cuts off the power consump-
tion. The programmable gray shades are dened by means
of a gray-scale table (GST), which species the sequence of
ON/OFF state scheduled to produce a given tone. An impor-
tant concern in designing the color generation mechanism
is the smart conguration of such table: For instance, when
applying a mixed FRC-PWM approach, pure-FRC color
tones (i.e., gray tones obtained without in-frame shade
modulation) should be avoided strictly in the GST to pre-
vent awkward perceptible artifacts.
Generalization of the gray shading techniques to color
displays is straightforward, as color channels are created
bydiverse color lteringat eachsubpixel, without anyextra
complexity on the driver side.
OPTICAL PERFROMANCE IN LCD DRIVING
Frequency Dependence of the Electro-Optical Characteristic
When thinking of optical performance, the main aspect to
be considered is the non-negligible dependence of the LCD
electro-optical transmission curve on the frequency of the
applied signals, which can be modeled in different ways
based on the particular viewone may want to stress. At the
physical level, the frequency dependence of the LC
Frame 1
Kickback effect
+Vsat
Vsat
+V
th
V
th
0
Frame 2
Colum
Common
electrode
Pixel data
Row N
Figure 6. Driving waveforms affected by the kickback effect in AMLCDs panels.
LCD DESIGN TECHNIQUES 7
characteristic can be ascribed to frequency drifts of the
threshold voltage. More deeply, threshold frequency varia-
tions can be traced back to frequency shifts of the liquid
crystal dielectric constant, which will most likely show
discrepancies between the programmed rms voltage across
LC cells and the actual driving level. Extensive data are
categorized by LCD manufacturers, which helps in select-
ing the most suitable LC technology. In a fact, the dynamic
operation of an LCD is jointly determined by the cross-
correlation among several constructive and material-
dependent parameters (6,7). Not only must LC-specic
factors be regarded, but also information about the display
module arrangement is equally important. As an example,
desired versus actual voltage mismatching may also arise
from the capacitive coupling between row and column
electrodes, for both individual pixels and between adjacent
pixels.
For design purposes, all frequency dependencies can be
translated into an inputoutput frequency-selective rela-
tionship between the LCD applied driving waveforms and
the effective voltage signals controlling the LC cells. This
approach is the most common, which also suits the elec-
trical modeling of the LC cell as a simplied passive RC
network. By studying the display response, several design
rules can be worked out to serve as a reference when
designing the driver architecture. First and foremost, it
is desirable that the frequency band of the drive signals be
narrow to prevent uneven frequency ltering and optical
distortion. MLA guarantees band collimation since most of
the spectrum becomes concentrated around p times the
frame frequency. However, experimental results showthat
when other schemes (such as PWM) are mounted on top of
MLA, frequency multiplication may turn out to reinforce
more than to suppress artifacts. Polarity inversion is
another established methodology for bandwidth reduction,
which entails that the signs of both the rowand the column
signals be periodically inverted, with an inversion period
set on a superframe, frame, or block-of-N-lines basis. Dot
inversion also is possible in AMLCDs, where the inversion
takes place from one pixel location to the adjacent one.
Whatever inversion is used, the lowest frequency of the
spectrum is upshifted, and the DC offset in the driving
signals is suppressed. The latter is another important
result, since the DC component is a primary cause of LC
degeneration and of panel lifetime reduction.
Keeping the waveform frequency spectrum under con-
trol also is vital with respect to energy awareness. Curbing
the chip power consumption is possible when frame/super-
frame frequencies are low enough: In fact, the superframe
frequency can be made even lower than 50 Hz if phase
mixing is supported.
Phase Mixing. Phase mixing exploits the spatial low-
pass ltering capability of the human eye to construct
the same gray level in adjacent pixels (blocks of pixels)
by driving the same number of ON/OFF states (phases) but
throughout different sequences over the superframe. If
individual ON/OFF states out of the GST are referred to as
phases for each gray tone, phase mixing implies scheduling
different phases during the same frame for eachpixel out of
a region of adjacent ones. Phase mixing better distributes
the voltage switching activities over the columns and pro-
duces a lowering of global frequency. To yield the best
optical performance, phase mixing is typically applied on
an RGB-subpixel basis (subpixel blocks instead of pixel
blocks) and the phase pattern (which phases are driven
at which position) is switched from one pixel (respectively,
subpixel) block to another. The designer wanting to imple-
ment phase mixing arranges a phase mixing table holding
the basic phase sequencing for pixel blocks within the
display matrix together with the related phase switching
rule (which phase follows a given one at any location). The
setting of the phase switching table has a strong impact on
the chip functionalities, so that it must be granted parti-
cular care.
Flicker
The meaning of icker in the scope of LCD artifacts is
familiar; however, the reasons why icker affects display
visualization and the countermeasures needed to remove it
might be less clear. Flicker generally stems from an incor-
rect distribution of the driving waveforms frequency spec-
trum: As it is, time-uneven or amplitude-unbalanced
contributions to the rms value over a frame are likely to
bring about icker. Hence, common solutions to suppress
icker are part of those general measures used to regulate
the frequency spectrum of the voltage waveforms: Low-
ering the frame frequency, using MLA-based driving (for
icker caused by uneven rms contribution over a frame),
and embedding some smart phase-mixing scheme (for
icker caused by uneven rms contributions over a super-
frame) currently are deployed in industrial PMLCD mod-
ules. As for AMLCDs, icker always represents a crucial
concern because of the kickback effect unless dedicated
measures are taken, like the common electrode modulation
technique or the use of an additional storage capacitor as
described above.
Crosstalk
By crosstalk we dene all pattern-dependent effects of
mutual interference among the gray-scale values of pixels
(3,8). Those effects tend to grow worse with increasing
display size, higher resolution, and faster responding LC,
all of which unfortunately are features that the display
markets evolution is more and more heading toward.
Diverse mechanisms are responsible for crosstalk,
although they can be connected generally to the
frequency-selectiveness of the LC response.
Static Crosstalk Artefacts. It is commonly agreed that
static crosstalk is dened as all sorts of crosstalk-related
visual artifacts affecting the displaying of single (in this
sense, static) pictures. Therefore, frame switching over time
suchas invideostreamingis not considered. Static crosstalk
appears as differences in the brightness of theoretically
equal gray-scale pixels and manifests in multiple manners.
Simply stated, at least three types can be distinguished:
vertical crosstalk, horizontal crosstalk, and block shadow-
ing. Vertical crosstalk usually hinges ondifferent frequency
contents of different column waveforms. It is growing more
and more important withthe increasing steepness of the LC
8 LCD DESIGN TECHNIQUES
transmission-voltage curve as required for acceptable con-
trast in visualization. Horizontal crosstalk occurs when
differences in the LC dielectric constant for black- and-
white pixels induce spatially asymmetrical capacitive cou-
pling betweenrows andcolumns. The amount of perceptible
artifacts depends on the width of dark/bright horizontal
blocks along a row. Finally, when current spikes result
from symmetrically and simultaneously changing column
waveforms in areas where sizable blocks of darker pixels
determine different coupling between rows and columns,
vertical block shadowing is likely.
Dynamic Crosstalk Artefacts. In conjunction with static
artifacts, LCDmodules supporting video streaming may be
affected by dynamic crosstalk. The exact characterization
of dynamic crosstalk often turns out to be difcult, since
many cooperative causes contribute to it. Loosely speaking,
dynamic crosstalkalso called splicingcan be asso-
ciated with uneven voltage contributions to the perceived
rms value on switching from one frame to another. This
view of the problem generally allows for quantifying the
impact of improper driving schemes and for putting into
action concrete measures to oppose splicing.
Crosstalk Minimization. The problem of reducing cross-
talk has been attacked diversely. Apart from technological
advances in the display manufacturing, either dedicated
and more sophisticated hardware in the driver integrated
circuits (such as built-in voltage-correction facilities) or
specialization of the addressing schemes have been
devised. Beyond the particular features of the huge amount
of available alternatives, it is, however, possible to outline
some basic design guidelines that help to identify the very
essential concepts underneath crosstalk suppression.
Uniformity of the rms contributions over time, for
instance, can be pursued through smart selection of the
MLA driving mode (e.g., virtual-3-MLA) or the adoption of
specic schemes such as the so-called self-calibrating driv-
ing method (SCDM), described in the literature. Both such
approaches actually eliminate static crosstalk and signi-
cantly reduce, although do not entirely suppress, dynamic
crosstalk. In particular, virtual-3-MLA also makes the
overall optical performance insensitive to asymmetries or
inaccuracies in the column voltage levels and contempora-
rily allows for reducing the number of such levels, which is
unquestionably benecial with respect to static artifacts.
Similar results can be attained by using rectangular
instead of square phase mixing tables and by enabling
drive patternswitching. However, joint applicationof those
methods should be supported by extensive back-end per-
formance evaluation activities to diagnose potential side
effects that may stem from their reciprocal interaction at
particular image patterns. High-accuracy simulations
usually serve this goal.
The above-mentioned polarity inversion modes com-
monly are also employed as good measures to alleviate
static crosstalk artifacts along with all other shortcomings
of the nonlinear response of LC cells. However, an impor-
tant point must be made in selecting the most appropriate
inversion strategies. For instance, although separate row
or column inversion effectively reduces the impact of
horizontal and vertical crosstalk, respectively, vertical or
horizontal crosstalk is likely to manifest if either is used
independently of each other, with limited advantages in
terms of power consumption. Simultaneous suppression of
both vertical and horizontal crosstalk is possible with dot
inversion in AMLCDs at the cost of extra driving energy.
Finally, frame inversionpromotes lowpower operation, but
its efciency in crosstalk reduction is minimal.
The selection of unconventional FRC and PWM
schemes, at particular choices of the number of frames in
the superframe and of the signal modulation pulses in the
scan signals, frequently leads to some controllable reshap-
ing of the frequency spectrum with favorable effects on
static crosstalk. It must be noted, however, that all spectral
manipulations are only effective when they match concre-
tely the frequency characteristic of the particular liquid:
Early assessment of possible drawbacks is mandatory, and
customized top-level simulators are valuable in this
respect.
Useful suggestions can be also drawn when looking into
the technology side of the matter. Interactions between
display module parameters and static crosstalk can be
tracked down easily: It is well established that static cross-
talk is hampered when the resistivity of the ITO tracks is
lowered, whena less frequency-dependent liquid is used, or
when the deviations in the LC cell capacitances are
restrained (the cell capacitance basically depends on the
inter-cell gap).
As a nal remark, from a theoretical perspective, the
phenomenology behind static crosstalk is more easily kept
under control when compared with dynamic effects.
Experimental verication or system-level simulations are
often the sole viable approaches to work into the issues of
dynamic artifacts.
Gray-Tone Visualization Artifacts. When GSTs are too
simplistic, artifacts altering the visualization of particular
color patterns may develop. For instance, with patterns
characterized by (although not limited to) the incremental
distributionover the pixels of the full gray-tone gamut itself
from one side of the display to the other, spurious vertical
dim lines may occur. Popular solutions rest essentially on
some clever redenition of the GST, e.g., by the elimination
of FRC-only gray tones, the usage of redundancies in the
PWM scheme for generating identical gray levels, or the
shift of every gray tone one level up with respect to the
default GST. A key factor is that the color alteration only
affects some particular and well-known gray tones, so that
the problem usually can be conned. Because problematic
gray tones commonly cause static crosstalk artifacts, their
removal yields an added value with respect to crosstalk
suppression.
LCD DRIVER DESIGN OVERVIEW
Architectural Concept
Ina display module, the driver electronics is responsible for
the generation of the proper panel driving waveforms
depending on the pixel RGB levels within the image to
be displayed. Therefore, optimal driver design is
LCD DESIGN TECHNIQUES 9
imperative for the realizationof high-quality display-based
appliances. Display drivers generally are developed as
application-specic integrated circuits (ASICs). When con-
sidering the hardware architecture, drivers targeting
PMLCDs and AMLCDs can be treated jointly in that
they share the same building units.
At the architecture level, analog subsystems play a
central role in generating the high level voltages required
for driving the rows and the columns, and they usually
occupy most of the electronics on-chip area. On the other
hand, digital blocks do not commonly require any massive
area effort: Yet, a deal of vital functions takes place in
digital logic. The digital part accommodates all units
involved in instruction decoding (the driver is usually fed
with commands from an on-board microcontroller) and
interface-data handling as well as with display specic
functionalities responsible for orthogonal function and
scan signal generation, timing, and switching scheduling.
Finally, many display drivers also are equipped and
shipped with some sort of built-in video memory for local
pixel data storage; this facilitates the independence of
operation from the host system.
LCD Driver Design Flow
A typical industrial LCD driver design ow includes all
those steps needed for mixed analogdigital, very-large-
scale-of-integration ASIC design, usually structured
throughout the following milestones:
1. Project proposal: analysis of system level require-
ments based on the application needs.
2. Feasibility study: architecture denition based on
system level requirements, preliminary evaluation
based on a dedicated LCD module simulator, and
project planning.
3. Project start approval.
4. Architecture implementation: block level design of
both analog and digital parts (design front-end). Ana-
log circuit design and digital register-transfer-level
description and coding are included. Top-level simu-
lation can be also performed at this stage, usually by
means of behavioral modeling for the analog parts.
5. Analog and digital place & route and timing verica-
tion (design back-end). Cosimulation of functional
testbenches with hardware description, including
back-annotated timing information, is frequently
employed at this stage. Top-level verication and
checks follow.
6. Prototype-based on-chip evaluation.
All the steps outlined above describe some very general
design phases that are common to the implementations of
both active-matrix and passive-matrix addressing display
drivers.
Front-End Driving Scheme Selection and Performance
Evaluation: Simulation Tools
When regarding the structure of a typical LCDdesign ow,
entailing the availability of testable prototypes at the very
endof the designchain, the interest intools devotedto early
stage evaluation of the driver-display module performance
should become plain. As a matter of fact, although the
hardware development environments comprise mostly
standard analog and digital ASIC design tools (those for
circuit design, register-transfer-level description and
synthesis, technology mapping, cell place & route, and
chip verication), specialized tools should be featured to
simulate all functional interactions between the front-end
driver and the back-end panel. Indeed, merely relying on
the chip prototype evaluation phase for performance
assessment is often impractical because of the severe costs
associated with hardware redesigns on late detection of
operational faults. Proprietary simulation tools are likely
to be embedded into advanced industrial LCDdesign ows,
and they usually allow for:
1. Functional simulation of the hardware-embedded or
software-programmable driving schemes.
2. Functional simulationof all targetedcolor generation
modes.
3. Functional simulations of all embedded measures to
prevent visual artifact generation (for testing and
evaluation purposes).
4. Highly recongurable input modes (such as single-
picture display mode, or multiframe driving mode for
video streaming evaluation).
5. Sophisticated LCD-modeling engine (including the
frequency model of the LC cell response).
6. Recongurability with respect to all signicant phy-
sical display parameters (such as material conduc-
tivity, material resistivity, LC cell geometry, and
electrode track lengths).
7. Multiple and exible output formats and perfor-
mance gures.
The availability of anLCDcircuit model is a particularly
important aspect, as it opens the possibility of performing
reliable evaluationof the module dynamic operationwithin
different frequency ranges.
Display Modeling for LCDDesign. Any simulationengine
usedfor driver-displayjoint modelingcannot functionwith-
out some formof electrical characterization of an LC cell to
work as the electrical load of the driving circuits. A satis-
factorily accurate model of the frequency behavior of the LC
cell broadlyusedinpractice (9) treats the cell as asimplied
RC network whose resistances and capacitances can be
calculated as functions of some very basic process and
material parameters for the liquid crystal, the polyimide
layers, and the connection tracks (e.g., LCconductivity and
ITO sheet resistivity). The resulting typical frequency
response of the LC cell turns out to be that of a 2-pole,
1-zero network. Consequently, apart from the claim that
the driving signals bandwidth be narrow for crosstalk
minimization, their lowest frequency bound must be also
highenoughto prevent distortioninduced by the rst poles
attenuation.
10 LCD DESIGN TECHNIQUES
BIBLIOGRAPHY
1. P. Yeh, and C. Gu, Optics of Liquid Crystal Displays, 1st ed.,
John Wiley & Sons, 1999.
2. E. Lueder, Liquid Crystal Displays: Addressing Schemes and
Elecrto-Optical Effects, John Wiley & Sons, 2001.
3. T. J. Scheffer, and J. Nehring, Supertwisted nematic LCDs,
SID Int. Sym. Dig. Tech. Papers, M-12, 2000.
4. T. N. Ruckmongathan, Addressing techniques for RMS
responding LCDs - A review, Proc. 12th Int. Display Res.
Conf. Japan Display 92, 7780, 1992.
5. M. Kitamura, A. Nakazawa, K. Kawaguchi, H. Motegi, Y.
Hirai, T. Kuwata, H. Koh, M. Itoh, and H. Araki, Recent
developments in multi-line addressing of STN-LCDs, SID
Int. Sym. Dig. Tech. Papers, 355358, 1996.
6. H. Seiberle, and M. Schadt, LC-conductivity and cell para-
meters; their inuence on twisted nematic and supertwisted
nematic liquid crystal displays, Mol. Cryst, Liq. Cryst., 239,
229244, 1994.
7. K. Tarumi, H. Numata, H. Pru cher, and B. Schuler, On the
relationship between the material parameters and the switch-
ing dynamics ontwisted nematic liquidcrystals, Proc. 12thInt.
Display Res. Conf. Japan Display 92, 587590, 1992.
8. L. MacDonald, and A. C. Lowe, Display Systems: Design and
Applications, John Wiley & Sons, 1997.
9. H. Seiberle, and M. Schadt, Inuence of charge carriers and
display parameters on the performance of passively and
actively addressed, SID Int. Sym. Dig. Tech. Papers, 2528,
1992.
FURTHER READING
J. A. Castellano, LiquidGold: The Story Of LiquidCrystal Displays
and the Creation of an Industry, World Scientic Publishing
Company, 2005.
P. A. Keller, Electronic Display Measurement: Concepts, Techn-
iques, and Instrumentation, 1st ed., Wiley-Interscience, 1997.
M. A. Karim, Electro-Optical Displays, CRC, 1992.
P. M. Alt, and P. Pleshko, Scanning limitations of liquid crystal
displays, IEEE Trans. El. Dev., ED-21(2), 146155, 1974.
K. E. Kuijk, Minimum-voltage driving of STN LCDs by optimized
multiple-row addressing, J. Soc. Inf. Display, 8(2), 147153, 2000.
M. Watanabe, High resolution, large diagonal color STN for desk-
top monitor application, SID Int. Sym. Dig. Tech. Papers, 34, M-
8187, 1997.
S. Nishitani, H. Mano, andY. Kudo, Newdrive methodto eliminate
crosstalk in STN-LCDs, SID Int. Sym. Dig. Tech. Papers, 97100,
1993.
SIMONE SMORFA
MAURO OLIVIERI
La Sapienza, University
of Rome
Rome, Italy
ROBERTO MANCUSO
Philips Semiconductors
Zurich, Switzerland
LCD DESIGN TECHNIQUES 11
L
LOGIC SYNTHESIS
The designprocess for anelectronics systembegins whenan
idea is transformed into a set of specications to be veried
by the future system. These specications become the basis
for a series of steps or design tasks that eventually will
produce a circuit that represents the physical expression of
the original idea. The process of generating a nal circuit
from the initial specications is known as circuit synthesis.
The design ow for a digital system is composed of a
series of stages in which system models are established in
accordance with different criteria. Each stage corresponds
to a level of abstraction.
To illustrate how these levels of abstraction may be
classied, we might, for example, consider three levels:
the system level, the RT (register transfer) level, and the
logic level. In the system level, the architecture and algo-
rithms necessary to verify the required performance are
specied. The RT level represents the system specication
as anRTmodel, inthis case establishing anarchitecture for
data ow between registers subject to functional transfor-
mations. Finally, the logic level determines the systems
functionality using logic equations and descriptions of
nite state machines (FSMs). The data handled is logic
data with values such as 0, 1, X, Z, etc.
Design tasks in each of the levels usually are supported
by different computer aided design (CAD) tools. In each
level, the design process basically involves two stages: (1)
description of the systemat the corresponding level and (2)
verication of the descriptions behavior via simulation.
The synthesis process consists of obtaining the system
structure from a description of the behavior. Depending
onthe level of abstractioninwhichthe workis beingcarried
out, the synthesis will be high level synthesis, logic synth-
esis, etc. This article addresses logic synthesis, which
involves the generation of a circuit at the logic level based
on an RT level design specication. The automation of the
synthesis process has allowed the development of several
tools that facilitate the tasks involved. Automatic synthesis
tools offer several advantages when implementing an elec-
tronic circuit. First, automationallows the designowto be
completed in less time, which is relevant particularly today
because of the high competitiveness and the requirements
to solve demands in a short period of time. Second, auto-
matic synthesis also makes the exploration of design space
more viable because it enables different requirements, such
as cost, speed, and power, to be analyzed. Third, a funda-
mental aspect of the whole design process is its robustness,
that is, its certainty that the product is free fromany errors
attributable to the designer. In this regard, the use of
automatic synthesis tools guarantees the correct construc-
tion of the system being designed.
The following section of this article deals with aspects
associated with logic design such as data types, system
components, and modes of operation. Next, the hardware
description languages will be presented as tools to specify
digital systems. Two standard languages (VHDL and
Verilog) will be examined in detail, and the use of VHDL
for synthesis will be explained to illustrate specic aspects
of logic synthesis descriptions. The article ends with an
illustrative example of the principal concepts discussed.
LOGIC DESIGN ORGANIZATION: DATAPATH AND
CONTROL UNIT
The RT level is the level of abstraction immediately above
the logic level (1,2). In contrast with the logic level, gen-
erally concerned with bitstreams, the RT level handles
data. Data is a binary word of n bits. Data are processed
through arithmetic or logic operations that normally affect
one or two data: A + B, NOT(A), and so on.
Data are stored in registers, which constitute the
electronic component for storing n bits. Source data must
be read fromits registers andthe result of the operationis
then written in another register to be stored there. The
data operation is performed in a functional unit (for
example, an arithmetic-logic unit). The writing operation
is sequential and, in a synchronous system, for example, is
therefore executed while the clock is active. The operations
of the functional unit and the reading of the register are
combinational functions.
Data is transmitted fromthe source registers towardthe
functional unit and from there to the target register via
buses, which are basically n cables with an associated
protocol to allow their use.
The core operation is data transfer between registers
hence the name given to this level. It includes both reading
and writing operations. Description techniques suitable at
the logic level (FSMs and switching theory) are not suf-
cient at the RT level. One of the simplest ways to describe
these operations is as follows:
writing : R
target
< DataA DataB
reading : DataOut R
source
where
Instruction set
Instruction format
Operation codes
Addressing modes
Operation: Arithmetic/Logical
Branch
Given the simplicity of the pipeline, the control part of
RISC is implemented in hardwareunlike its CISC coun-
terpart, which relies heavily on the use of microcoding.
However, this is the most misunderstood part of RISC
architecture which has even resulted in the inappropriate
name: RISC. Reduced instruction set computing implies
that the number of instructions in RISC is small. This has
created a widespread misunderstanding that the main
feature characterizing RISC is a small instruction set.
This is not true. The number of instructions in the instruc-
tion set of RISC can be substantial. This number of RISC
instructions can grow until the complexity of the control
logic begins to impose an increase in the clock period. In
practice, this point is far beyond the number of instructions
commonly used. Therefore we have reached a possibly
paradoxical situation, namely, that several of representa-
tive RISC machines known today have an instruction set
larger than that of CISC.
For example: IBM PC-RT Instruction architecture con-
tains 118 instructions, while IBM RS/6000 (PowerPC) con-
tains 184instructions. This shouldbecontrastedto the IBM
System/360 containing 143 instructions and to the IBM
System/370 containing 208. The rst two are representa-
tives of RISC architecture, while the latter two are not.
Fixed Format Instructions
What reallymatters for RISCis that the instructions have a
xed and predetermined format which facilitates decoding
in one cycle and simplies the control hardware. Usually
the size of RISC instructions is also xed to the size of the
word (32 bits); however, there are cases where RISC can
contain two sizes of instructions, namely, 32 bits and 16
bits. Next is the case of the IBM ROMP processor used in
the rst commercial RISC IBM PC/RT. The xed format
feature is very important because RISC must decode its
instruction in one cycle. It is also very valuable for super-
scalar implementations (12). Fixed size instructions allow
the Instruction Fetch Unit to be efciently pipelined (by
being able to determine the next instruction address with-
out decoding the current one). This guarantees only single
I-TLB access per instruction.
One-cycle decode is especially important so that the
outcome of the Branch instruction can be determined in
one cycle in which the new target instruction address will
be issued as well. The operation associated with detecting
and processing a Branch instruction during the Decode
cycle is illustrated in Fig. 4. In order to minimize the
number of lost cycles, Branch instructions need to be
resolved, as well, during the Decode stage. This requires
a separate address adder as well as comparator, both of
which are used in the Instruction Decode Unit. In the best
case, one cycle must be lost when Branch instruction is
encountered.
Figure 3. The operation of Load/Store pipeline.
4 REDUCED INSTRUCTION SET COMPUTING
Simple Addressing Modes
Simple Addressing Modes are the requirements of the
pipeline. That is, in order to be able to perform the address
calculation in the same predetermined number of pipeline
cycles in the pipeline, the address computation needs to
conformto the other modes of computation. It is a fortunate
fact that in real programs the requirements for the address
computations favors three relatively simple addressing
modes:
1. Immediate
2. BaseDisplacement
3. BaseIndex
Those three addressing modes take approximately over
80% of all the addressing modes according to Ref. (3): (1)
30%to 40%, (2) 40%to 50%, and(3) 10%to 20%. The process
of calculating the operand address associated with Load
and Store instructions is shown in Fig. 3.
Separate Instruction and Data Caches
One of the often overlooked but essential characteristics of
RISC machines is the existence of cache memory. The
second most important characteristic of RISC (after pipe-
lining) is its use of the locality principle. The locality
principle is established onthe observationthat, onaverage,
the program spends 90% of the time in the 10%of the code.
The instruction selection criteria in RISC is also based on
that very same observation that 10%of the instructions are
responsible for 90% of the code. Often the principle of the
locality is referred to as a 9010 rule (13).
In case of the cache, this locality can be spatial and
temporal. Spatial locality means that the most likely loca-
tioninthe memory to be referenced next will be the location
in the neighborhood of the location that was just referenced
previously. On the other hand, temporal locality means
that the most likely location to be referenced next will be
from the set of memory locations that were referenced just
recently. The cache operates on this principle.
Figure 4. Branch instruction.
REDUCED INSTRUCTION SET COMPUTING 5
The RISCmachines are based on the exploitation of that
principle as well. The rst level in the memory hierarchy is
the general-purpose register le GPR, where we expect to
nd the operands most of the time. Otherwise the Register-
to-Register operation feature would not be very effective.
However, if the operands are not to be foundinthe GPR, the
time to fetch the operands should not be excessive. This
requires the existence of a fast memory next to the CPU
the Cache. The cache access should also be fast so that the
time allocated for Memory Access in the pipeline is not
exceeded. One-cycle cache is a requirement for RISC
machine, and the performance is seriously degraded if
the cache access requires two or more CPU cycles. In order
to maintain the required one-cycle cache bandwidth the
data and instruction access should not collide. It is from
there that the separationof instructionanddatacaches, the
so-called Harvard architecture, is a must feature for RISC.
Branch and Execute Instruction
BranchandExecute or DelayedBranchinstructionis anew
feature of the instruction architecture that was introduced
and fully exploited in RISC. When a Branch instruction is
encounteredinthe pipeline, onecycle will beinevitablylost.
This is illustrated in Fig. 5.
RISCarchitecture solves the lost cycle problemby intro-
ducingBranchandExecute instruction(5,9) (also knownas
Delayed Branch instruction), which consists of an instruc-
tionpair: Branchandthe BranchSubject instructionwhich
is always executed. It is the task of the compiler to nd an
instruction which can be placed in that otherwise wasted
pipeline cycle.
The subject instruction can be found in the instruction
stream preceding the Branch instruction, in the target
instruction stream, or in the fall-through instruction
stream. It is the task of the compiler to nd such an
instruction and to ll-in this execution cycle (14).
Given the frequency of the Branch instructions, which
varies from1 out of 5 to 1 out of 15 (depending onthe nature
of the code), the number of those otherwise lost cycles canbe
substantial. Fortunately a good compiler can ll-in 70% of
those cycles which amounts to an up to 15% performance
improvement (13). This is the single most performance
contributing instruction from the RISC instruction archi-
tecture.
However, in the later generations of superscalar RISC
machines (which execute more than one instruction in the
pipeline cycle), the Branch and Execute instructions have
been abandoned in favor of Brand Prediction(12,15).
The Load instruction can also exhibit this lost pipeline
cycle as shown in Fig. 6.
The same principle of scheduling an independent
instruction in the otherwise lost cycle, which was applied
for in Branch and Execute, can be applied to the Load
instruction. This is also known as delayed load.
An example of what the compiler can do to schedule
instructions andutilize those otherwise lost cycles is shown
in Fig. 7 (13,14).
Optimizing Compiler
A close coupling of the compiler and the architecture is one
of the key and essential features in RISC that was used in
order to maximally exploit the parallelism introduced by
pipelining. The original intent of the RISCarchitecture was
to create a machine that is only visible through the compi-
ler(5,9). All the programming was to be done in High-Level
Language and only a minimal portion in Assembler. The
notion of the Optimizing Compiler was introduced in
RISC (5,9,14). This compiler was capable of producing
a code that was as good as the code written in assembler
(the hand-code). Though there was strict attention given to
the architecture principle (1,3), adhering to the absence of
the implementation details from the principle of the opera-
tion, this is perhaps the only place where this principle was
violated. Namely, the optimizing compiler needs to know
Figure 5. Pipeline ow of the Branch instruction.
Figure 6. Lost cycle during the execution of the load instruction.
Figure 7. An example of instruction scheduling by compiler.
6 REDUCED INSTRUCTION SET COMPUTING
the details of the implementation, the pipeline in particu-
lar, in order to be able to efciently schedule the instruc-
tions. The work of the optimizing compiler is illustrated in
Fig. 7.
One Instruction per Cycle
Theobjectiveof one instructionper cycle (CPI 1) execution
was the ultimate goal of RISC machines. This goal can be
theoretically achieved in the presence of innite size caches
and thus no pipeline conicts, which is not attainable in
practice. Given the frequent branches in the program and
their interruption to the pipeline, Loads and Stores that
cannot be scheduled, and nally the effect of nite size
caches, the number of lost cycles adds up, bringing the
CPI further away from 1. In the real implementations
the CPI varies and a CPI1.3 is considered quite good,
while CPI between 1.4 to 1.5 is more common in single-
instructionissue implementations of the RISCarchitecture.
However, once the CPI was brought close to 1, the next
goal in implementing RISC machines was to bring CPI
below 1 in order for the architecture to deliver more per-
formance. This goal requires an implementation that can
execute more than one instruction inthe pipeline cycle, a so
called superscalar implementation (12,16). A substantial
effort has been made on the part of the leading RISC
machine designers to build such machines. However,
machines that execute up to four instructions in one cycle
are common today, and a machine that executes up to six
instructions in one cycle was introduced in 1997.
Pipelining
Finally, the single most important feature of RISC is pipe-
lining. The degree of parallelism in the RISC machine is
determined by the depth of the pipeline. It could be stated
that all the features of RISC(that were listed inthis article)
couldeasilybederivedfromthe requirements for pipelining
and maintaining an efcient execution model. The sole
purpose of many of those features is to support an efcient
execution of RISC pipeline. It is clear that without pipelin-
ing, the goal of CPI 1 is not possible. An example of the
instruction execution in the absence of pipelining is shown
in Fig. 8.
We may be led to think that by increasing the number of
pipeline stages (the pipeline depth), thus introducing more
parallelism, we may increase the RISC machine perfor-
mance further. However, this idea does not lead to a simple
and straightforward realization. The increase in the num-
ber of pipeline stages introduces not only an overhead in
hardware (needed to implement the additional pipeline
registers), but also the overhead in time due to the delay
of the latches used to implement the pipeline stages as well
as the cycle time lost due to the clock skews and clock jitter.
This could very soon bring us to the point of diminishing
returns where further increase in the pipeline depth would
result in less performance. An additional side effect of
deeplypipelinedsystems is hardware complexitynecessary
to resolve all the possible conicts that can occur between
the increased number of instructions residing in the pipe-
line at one time. The number of the pipeline stages is mainly
determined by the type of the instruction core (the most
frequent instructions) and the operations required by those
instructions. The pipeline depth depends, as well, on the
technology used. If the machine is implemented in a very
high speed technology characterized by the very small
number of gate levels (such as GaAs or ECL), and a very
good control of the clock skews, it makes sense to pipeline
the machine deeper. The RISC machines that achieve
performance through the use of many pipeline stages are
known as superpipelined machines.
Today the most common number of pipeline stages
encountered is ve (as in the examples given in this
text). However, 12 or more pipeline stages are encountered
in some machine implementations.
The features of RISC architecture that support pipelin-
ing are listed in Table 1.
Figure 8. Instruction execution in the absence of pipelining.
REDUCED INSTRUCTION SET COMPUTING 7
HISTORICAL PERSPECTIVE
The architecture of RISCdidnot come about as aplanedor a
sudden development. It was rather a long and evolutionary
process inthe history of computer development inwhichwe
learned how to build better and more efcient computer
systems. From the rst denition of the architecture in
1964 (1), there are the three main branches of the computer
architecture that evolved during the years. They are shown
in Fig. 9.
The CISC development was characterized by (1) the
PDP-11 and VAX-11 machine architecture that was devel-
oped by Digital Equipment Corporation (DEC) and (2) all
the other architectures that were derived from that devel-
opment. The middle branch is the IBM 360/370 line of
computers, which is characterized by a balanced mix of
CISC and RISC features. The RISC line evolved from the
development line characterized by Control Data Corpora-
tion CDC 6600, Cyber, and ultimately the CRAY-I super-
computer. All of the computers belonging to this branch
were originally designated as supercomputers at the time of
their introduction. The ultimate quest for performance and
excellent engineering was a characteristic of that branch.
Almost all of the computers inthe line precedingRISCcarry
the signature of one man: Seymour Cray, who is by many
given the credit for the invention of RISC.
History of RISC
The RISC project started in 1975 at the IBM Thomas
J. Watson Research Center under the name of the 801.
801 is the number used to designate the building in which
the project started (similar to the 360 building). The origi-
nal intent of the 801 project was to develop an emulator for
System/360 code (5). The IBM 801 was built in ECL tech-
nology and was completed by the early 1980s (5,8). This
project was not known to the world outside of IBMuntil the
early 1980s, and the results of that work are mainly unpub-
lished. The idea of a simpler computer, especially the one
that canbe implementedonthe single chipinthe university
environment, was appealing; two other projects with simi-
lar objectives started in the early 1980s at the University of
California Berkeley and Stanford University (6,7). These
two academic projects had much more inuence on the
industry than the IBM 801 project. Sun Microsystems
developed its own architecture currently known as SPARC
as a result of the University of California Berkeley work.
Similarly, the Stanford University workwas directly trans-
ferred to MIPS (17).
The chronology illustrating RISC development is illu-
strated in Fig. 10.
The features of some contemporary RISCprocessors are
shown in Table 2.
Figure 9. Main branches in development of computer architec-
ture.
Figure 10. History of RISC development.
8 REDUCED INSTRUCTION SET COMPUTING
BIBLIOGRAPHY
1. G. M. Amdahl, G. A. Blaauw, and F. P. Brooks, Architecture of
the IBM System/360, IBM J. Res. Develop., 8: 87101, 1964.
2. D. P. Siewiorek, C. G. Bell, and A. Newell, Computer Struc-
tures: Principles and Examples, Advanced Computer Science
Series, New York: McGraw-Hill, 1982.
3. G. A. Blaauw and F. P. Brooks, The structure of System/360,
IBM Syst. J., 3: 119135, 1964.
4. R. P. Case and A. Padegs, Architecture of the IBMSystem/370,
Commun. ACM, 21: 7396, 1978.
5. G. Radin, The 801 Minicomputer, IBM Thomas J. Watson
Research Center, Rep. RC 9125, 1981; also in SIGARCH Com-
put. Archit. News, 10 (2): 3947, 1982.
6. D. A. Pattersonand C. H. Sequin, AVLSI RISC, IEEEComput.
Mag., 15 (9): 821, 1982.
7. J. L. Hennessy, VLSI processor architecture, IEEE Trans.
Comput., C-33: 12211246, 1984.
8. J. Cocke andV. Markstein, The evolutionof RISCtechnologyat
IBM, IBM J. Res. Develop., 34: 411, 1990.
9. M. E. Hopkins, A perspective on the 801/reduced instruction
set computer, IBM Syst. J., 26: 107121, 1987.
10. L. J. Shustek, Analysis and performance of computer instruc-
tion sets, PhD thesis, Stanford Univ., 1978.
11. D. Bhandarkar and D. W. Clark, Performance from architec-
ture: Comparing a RISC and a CISC with similar hardware
organization, Proc. 4th Int. Conf. ASPLOS, Santa Clara, CA,
1991.
12. G. F. Grohosky, Machine organization of the IBM RISC
System/6000 processor, IBM J. Res. Develop., 34: 37, 1990.
13. J. Hennessy and D. Patterson, Computer Architecture: A
Quantitative Approach, San Mateo, CA: Morgan Kaufman.
14. H. S. Warren, Jr., Instruction scheduling for the IBM RISC
System/6000 processor, IBM J. Res. Develop., 34: 37, 1990.
15. J. K. F. Lee and A. J. Smith, Branch prediction strategies and
branch target buffer design, Comput., 17 (1): 1984, 622.
16. J. Cocke, G. Grohosky, and V. Oklobdzija, Instruction control
mechanism for a computing system with register renaming,
MAP table and queues indicating available registers, U.S.
Patent No. 4,992,938, 1991.
17. G. Kane, MIPS RISC Architecture, Englewood Cliffs, NJ:
Prentice-Hall, 1988.
READING LIST
D. W. Anderson, F. J. Sparacio, R. M. Tomasulo, The IBM 360
Model 91: Machine philosophy and instruction handling, IBM J.
Res. Develop., 11: 824, 1967.
Digital RISC Architecture Technical Handbook, Digital Equip-
ment Corporation, 1991.
V. G. Oklobdzija, Issues in CPUcoprocessor communication and
synchronization, EUROMICRO 88, 14th Symp. Microprocessing
Microprogramming, Zurich, Switzerland, 1988, p. 695.
R. M. Tomasulo, An efcient algorithm for exploring multiple
arithmetic units, IBM J. Res. Develop., 11: 2533, 1967.
VOJIN G. OKLOBDZIJA
Integration Corporation
Berkeley, California
REDUCED INSTRUCTION SET COMPUTING 9
P
PEN-BASED COMPUTING
INTRODUCTION
An Overview of Pen-Computing
Pen-based computing rst came under the mainstream
spotlight in the late 1980s when GOComputers developed
the rst computer operating system (OS) customized for
pen/stylus input called PenPoint, which was used in the
early tablet PCs by companies such as Apple Computer
(cuperline CA) and IBM (Among NY). A pen-based com-
puter replaces the keyboard and the mouse with a pen,
with which the user writes, draws, and gestures on the
screen that effectively becomes digital paper. The value
proposition of pen-based computing is that it allows a user
to leverage familiarity and skills already developed for
using the pen and paper metaphor. Thus, pen-based
computing is open to a wider range of people (essentially
everybody that can read and write) than conventional
keyboard and mouse-based systems, and is inline with the
theme of Ubiquitous Computing, as such a computer is
perceived as an electronic workbook, and thus provides a
work environment resembling that which exists without
computers. pen-based computers exist primarily in two
forms, as mentioned above; tablet PCs, which often have a
clip board-like prole and personal digital assistants
(PDAs) that have a portable/handheld prole. Both forms
(particularly the PDA) lend themselves very well to appli-
cations such as on site data entry/manipulation where the
conventional approach is pen and paper-form based (1).
The Digitizing Tablet
The function of the digitizing tablet within a pen-based
computer is to detect and to measure the position of the
pen on the writing surface at its nominal sampling rate.
Typically, this sampling rate varies between 50200 Hz
depending on the application, in which a higher sampling
rate causes a ner resolution of cursor movement, and the
computer can measure fast strokes accurately. The digitiz-
ing tablet is combined with the display to give the user the
same highlevel of interactivity and short cognitive feedback
time between their pen stroke/gesture and the correspond-
ing digital ink mark. The user perceives the digitizing tablet
and screen as one, which makes it a direct-manipulation
input device and gives the user a similar experience to that
of writing/drawing using the conventional pen and paper
method. To enable this high level of interactivity, the digi-
tizer must also operate with speed and precision. The dis-
play must be a at panel display for the integrated unit to
provide the user with the optimum writing surface. The
display and digitizer can be combined in two ways. If the
digitizer is mounted on top of the display (as illustrated
below in Fig.1), the digitizing tablet must be transparent,
although it can never be so innitely, and thus the displays
contrast is reduced and the user sees a blurred image.
The other way to combine the two is to mount the display
on top of the digitizer; although it does not have to be
transparent, the accuracy of the digitizer is reduced
because of the greater distance between its surface and
the tip of the pen when it is in contact with the writing
surface.
The two types of digitizers are active and passive. Active
digitizers are the most common type used in pen-based
computers. These digitizers measure the positionof the pen
using an electromagnetic/RF signal. This signal is either
transmitted by a two-dimensional grid of conducting wires
or coils within the digitizer or transmitted by the pen.
The digitizer transmits the signal in two ways: it is
either induced through a coil in the pen and conducted
through a tether to the computer, or as is more commonly
seen the pen, it reects the signal back to the digitizer or
disturbs the magnetic-eld generated by the set of coils in
the exact location of the pen tip. This reection or distur-
bance is detected subsequently by the digitizer. Figure 2
depicts the latter conguration where a magnetic-eld is
transmitted and received by a set of coils, and this is
disturbed by the inductance in the tip of a pen/stylus.
In this type of conguration, the horizontal and vertical
positionis represented inthe original signal/magnetic-eld
transmittedby either signal strength, frequency, or timing,
where each wire or coil in the grid carries a higher value
than its neighboring counterpart. The position of the pen is
evaluated using the values reected back or disturbed.
The conguration where the pen reects or disturbs the
signal is found more commonly in modern day pen-based
computers, as it allows for a lighter pen that is not attached
to the computer, which provides the user with an experi-
ence closer to that of using conventional pen and paper.
This method also allows more advanced information about
the users pen strokes to be measured, which can be used to
provide more sophisticated and accurate handwriting
recognition. The pressure being exerted on the screen/
tablet can be measured using a capacitative probe within
the tip of the pen, where its capacitance changes as it closes
(as a result of being pressed against the screen/tablet),
which changes the strength of the signal being reected
back. The angle of the pencanbe measured using electronic
switches that change the frequency of the signal reected
back, andthese switches operate ina similar manner to tilt-
switches. However, these advanced features require the
pento be poweredbyabatteryor bythe computer viaacord.
Typically, passive digitizers consist of two resistive
sheets separated by a grid of dielectric spacers, with a
voltage applied across one of the sheets in both the hor-
izontal andvertical directions. Whenthe penmakes contact
with the screen/tablet, it connects the two sheets together
at the point of impact, which acts to change the resistance
(which is proportional to length of the resistive material)
across the sheet in the two directions. In turn, it changes
the two voltages across the sheet. essentially, these vol-
tages represent a coordinate-pair of the pens position.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Being sensitive to pressure, passive digitizers also can
receive input fromthe users ngers, and thus the computer
can provide the user with a more natural/familiar mechan-
ism for pressing on-screen (virtual) buttons. A disadvan-
tage of this pressure-based operation is an increase in
errors when using the pen, which are caused by the user
resting against or holding the screen/tablet. However, pas-
sive digitizers can also be congured to be sensitive only to
pen input by placing the dielectric spacers between the two
resistive sheets closer together, where the higher amount of
pressure needed to force the two sheets together canonly be
exerted through the small area of the pen tip.
Handwriting Recognition
Handwriting recognition (HWX) gives the user the ability
to interface to a computer through the already familiar
activity of handwriting, where the user writes on a digitiz-
ing tablet and the computer then converts this to text. Most
users, especially those without prior knowledge of comput-
ing and the mainstream/conventional interface that is the
keyboard and mouse, initially see this as a very intuitive
and attractive human-computer interface as it allows them
to leverage a skill that has already been acquired and
developed. Although a major disadvantage of pen-based
computers is that current HWXmethods are not completely
accurate. Characters can be recognized incorrectly and
these recognition errors must be corrected subsequently
bythe user. Another aspect of current HWXtechnologythat
impacts the users productivity and experience in a nega-
tive way is the fact that characters have to be entered one at
a time, as the technology required to support the recogni-
tion of cursive handwriting is still a long way off and
possibly requires a newprogramming paradigm(4). Hand-
writing of individuals varies immensely, and an indivi-
duals handwriting tends to change over time in both the
short-term and the long-term. This change is apparent in
how people tend to shape and draw the same letter differ-
ently depending on where it is written within a word what
the preceding and subsequent letters are, all of which
complicates the HWX process.
The two types of handwriting recognition are on line
(real-time) and Off line (performed on stored handwritten
data). When using conventional pen and paper, the user
sees the ink appear on the paper instantaneously; thus, on
line recognitionis employed inpen-based computers for the
user to be presented withthe text formof their handwriting
immediately. The most commonmethodfor implementinga
HWX engine (and recognition engines in general) is a
neural network, which excel in classifying input data
into one of saveral categories. Neural networks are also
ideal for HWXas they cope very well with noisy input data.
This quality is required as each user writes each letter
slightly different each time, and of course altogether dif-
ferently from other users as explained previously. Another
asset of neural networks useful in the HWXsetting is their
ability to learn over time through back-propagation, where
each time the user rejects a result (letter) by correcting it,
Figure 1. A digitizing tablet shown in the conguration where
the digitizer is mounted on top of the screen (2) .
Figure 2. A digitizing tablet shown in the conguration
where a set of colls generated a magenetic-eld which is
disturbed by the presence of the pen tip in that exact
location (3).
2 PEN-BASED COMPUTING
the neural network adjusts its weights such that the prob-
ability of the HWX engine making the same error again is
reduced. Although this methodology can be used to reduce
the probability of recognition errors for a particular use.
When met with another user, the probability of a recogni-
tion error for each letter may have increased compared to
what it would have been before the HWX engine was
trained for recognizing the handwriting of a specic user.
These problems are overcome to a good extent by using a
character set where all the letters consist of a single stroke
and their form is designed for ease of recognition (i.e., a
specic HWXengine has been pre trained for use with that
particular character set). One such character set is shown
in Fig 3.
As can be seen in Fig. 3, the characters are chosen to
resemble their alphabetic equivalents as much as possible.
The use of such a character set forces different users to
write largely the same way and to not form their own
writing style, which is the tendency when using more
than one stroke per letter.
The User Interface
The overall objective of a user interface that supports
handwriting recognition is to provide the user with vir-
tually the same experience as using conventional pen as
paper. Thus, most of its requirements can be derived from
the pen and paper interface. This means that ideally, no
constraints should as to where on the digitizing tablet/
screen (paper) the user writes, the size of their writing,
or even when they write and when online recognition can
take place. Obviously, conventional pen and paper does not
impose any restrictions on special characters with accents,
so ideally the user could write such characters and they
should be recognized and converted to text form. Although
even in pen-based computing, the standard character set
used is ASCII as opposed to Unicode, which is the standar-
dized character set that contains all characters of every
language throughout the world.
As the pen is the sole input device, it is used for all the
users interfacingaction. As well as text entry, it is also used
for operations that a conventional mouse would be used for,
such as selecting menu options, clicking on icons, and so.
The action of moving the mouse cursor over a specic icon/
menu-option and clicking on it is replaced entirely with
simply tapping the pen on the item you would have clicked
on. Adragging operationis performedbytappingthe penon
the item to be dragged twice in quick succession (corre-
sponding to the double clicking associated with the mouse),
keeping the pen in contact with the tablet after the second
tap and then dragging the selected item, and nally lifting
the pen off the tablet to drop the item. A pen-based com-
puter has no function keys or control keys (such as return,
space-bar, etc.) like a keyboard does, and as such a pen-
based computers user interface must support something
called gesture recognition, where the functions and the
operations associated with these keyboard keys are acti-
vated by the user drawing a gesture corresponding to that
function/operation. Typical examples common among most
pen-based computing user interfaces are crossing a word
out to delete it, circling a word to select it for editing, and
drawing a horizontal line to insert a space within text.
The user will enter textual, as well as gesture and
graphic (drawing/sketching) input, and thus the user
interface is required to distinguish between these forms
of input. The user interface of some pen-based computing
OS, especially those that are extensions of conventional
OS, separate the different types of input by forcing the
user to set the appropriate mode, for example pressing an
on-screen button to enter the drawing/sketching mode.
This technique is uncomfortable for the user as it detracts
from the conventional pen and paper experience, and so
the preferred method is for the OS to use semantic context
information to separate the input. For example, if the user
is in the middle of writing a word and they write the
character O, the OS would consider the fact that they
were writing a word and so would not confuse the char-
acter O with a circle or the number zero. This latter
method is typical of OS written specically for pen-based
computing (Pen-centric OS).
The causes of recognition errors fall into two main
categories. One is where the errors are caused by indis-
tinguishable pairs of characters, (for example the inability
of the recognizer to distinguish between 2 and Z). The
best solution in this case is for the OS to make use of
semantic context information. The other main source of
errors is when some of the users character forms are
unrecognizable. As explained previously, this situation
can be improved in two ways. One is for the user to adapt
their handwriting style, so that their characters are a
closer match to the recognizers pre-stored models; the
other way is to adapt the recognizers pre-storedmodels; to
be a closer match with the users character forms (trained
recognition) (5).
An Experiment Investigating the Dependencies of User
Satisfaction with HWX Performance
A joint experiment carried out by Hewlett Packard (palo
Alto, CA) and The University of Bristol Research. Labora-
tories (Bnstd, UK) in 1995 investigated the inuence of
HWX performance on user satisfaction (5). Twenty-four
subjects with no prior knowledge of pen-based computing
carried out predetermined tasks using three test applica-
tions as well as text copying tasks, after being given a brief
tutorial on pen-based computing. The applications were
run on an IBM-compatible PC with Microsoft Windows for
Pen Software and using a Wacom digitizing tablet (which
Figure 3. The Grafti character set developed by
palm computing. The dot on each character shows
its starting point.
PEN-BASED COMPUTING 3
did not present a direct-manipulationinput device as it was
not integrated with the screen). The three applications
were Fax/Memo, Records, and Diary, and they were
devised to contrast with each other in the amount of
HWX required for task completion, tolerance to erroneous
HWX text, and the balance between use of the pen for text
entry and other functions performed normally using a
mouse. The mean recognition rate for lowercase letters
was found to be 90.9%, and 76.1% for upper case letters,
withthe lower recognitionrate for uppercase letters caused
mainly by identical upper and lower case forms with letters
such as C, O, S, V. Pen-based computing OS attempt to
deal with this problem by comparing the size of drawn
letters relative to each other or relative to comb-guides
when the user input is conned to certain elds/boxes.
As can be observed Fig. 4, the application that required
the least amount of text recognition (Records) was rated as
most appropriate for use with pen input, and the applica-
tion with the most amount of text recognition (Diary) was
rated as the least appropriate in this respect. Figure 4 also
shows that higher recognitionaccuracy is met withahigher
appropriateness rating by the user, and the more depen-
dent an application is on text recognition the stronger this
relationship is.
An indication of this last point can be observed from the
plots shown in Fig. 4, as the average gradient of an applica-
tions plot increases as it becomes more dependent on text
recognition. The results shown in g. 4 also suggest that
the pen interface is most effective in performing the non
textual functions associated normally with the mouse.
Thus, improving recognition accuracy would increase the
diversity of applications in pen-based computing, as those
more dependent ontext entry would be made more effective
and viable.
COGNITIVE MODELLING
Introduction to Cognitive Models
Cognitive models as applied to humancomputer interac-
tion represent the mental processes that occur in the mind
of the user as they perform tasks by way of interacting
witha user interface. Arange of cognitive models exits, and
eachmodels a specic level withinthe users goal-hierarchy
fromhigh-level goal and task analysis to low-level analysis
of motor-level (physical) activity. cognitive models fall into
two broad categories: those that address how a user
acquires or formulates a plan of activity and those that
address how a user executes that plan. Considering appli-
cations that support pen input, whether they are exten-
sions of conventional applications that support only
keyboard and mouse input or special pen-centric applica-
tions, the actual tasks and associated subtasks that need
to be performed are often the same as those in conventional
applications. Only the task execution differs because of the
different user interface. Thus, only cognitive models that
address the users execution of a plan once they have
acquired/formulated it shall be considered, as a means of
evaluating the pen interface.
Adaptation of the Keystroke-Level Model for Pen-based
Computers
The keystroke-level model (KLM) (6) is detailed in Table 1.
This model was developed and validated by Card, Newell,
and Moran, and is used to make detailed predictions about
user performance with a keyboard and mouse interface in
terms of execution times. It is aimed at simple command
sequences and low-level unit tasks within the interaction
hierarchy and is regarded widely as a standard in the eld
of human-computer interaction.
KLM assumes a user rst builds up a mental represen-
tation of a task in working out deciding exactly how they
will accomplish the task using the facilities and function-
ality offered by the system. This assumption means that no
high-level mental activity is considered during the second
phase of completing a task, which is the actual execution of
the plan acquired and is this execution phase that KLM
focuses on. KLM consists of seven operators, ve physical
motor operators, a mental operator, and a system response
operator. The KLM model of the task execution by a user
consists of interleaved instances of the operators. Table 2
details the penstroke-level model (PLM), which represents
a corresponding set of operators for pen-based systems.
As stated previously, the actual tasks that need to be
performed to achieve specic goals are largely the same
with pen-based and keyboard and mouse-based systems,
10
9
8
7
6
5
4
3
2
1
80 - 83.9 84 - 87.9 88 - 91.9
% recognition accuracy
fax
records
diary
a
p
p
r
o
p
r
i
a
t
e
n
e
s
s
r
a
t
i
n
92 - 95.9
Figure 4. Plots of appropriateness rating against recognition
accuracy .
Table 1. A description of the KLM models seven operators
Operator Description
K Key stroking, actually striking keys,
including shifts and other modier
keys
B Pressing a mouse button
P Pointing, moving the mouse
(or similar device) as a target
H Homing, switching the hand
between mouse and keyboard
D Drawing lines using the mouse
M Mentally preparing for physical
action
R System response which may be
ignored if the user does not have
to wait for it, as in copy typing
4 PEN-BASED COMPUTING
and I have thus adapted the KLM model into the PLM
model for application to pen-based systems.
From Table 2, it can be reserved that at worst, pen-
based systems have one less physical motor operator than
keyboard and mouse-based systems, and at best two of its
four physical motor operators have quicker execution
times compared with the corresponding operators of key-
board and mouse-based systems for each user. This abser-
vation suggests that pen-based systems are faster to use
than keyboard and mouse-based systems, and the pre-
vious discussion would suggest that this is especially true
for applications that contain many pointing and dragging
tasks.
Experiment to Compare the Performance of the Mouse, Stylus
and Tablet and Trackball in Pointing and Dragging Tasks
An experiment by Buxton, et al. in 1991 (7) compared the
performance of the mouse, stylus (and digitizing tablet),
and the trackball (which is essentially an upside-down
mouse, with a button by the ball) in elemental pointing
and dragging tasks. Performance was measured in terms
of mean execution times of identical elemental tasks.
However, the digitizing tablet was used in a form where
it was not integrated with the display and sat on the users
desktop, and thus it could not be considered a direct-
manipulation interface as is the case when it is integrated
with the screen. The participants of the experiment were
12 paid volunteers who were computer-literate college
students. During both the pointing and the dragging
task, two targets were on either side of the screen as
shown in Fig. 5.
In the experiment, an elemental pointing action was
considered to be moving the cursor over a target and then
clicking the mouse/trackball button or pressing the pen
down onto the tablet to close its tip-switch, (as opposed to
the tapping action with the modern stylus and digitizer
combinations which terminated an action and initiated the
next action. An elemental dragging action was considered
to be selecting anobject withinone target andholding down
the mouse/trackball button or maintaining the pressure
of the stylus on the tablet, dragging it to within the other
target and then releasing the mouse/trackball button or
pressure on the tablet to drop the object in that target,
which terminated an action and initiated the next action
where each time the new object would appear halfway
between the two targets. Fitts law provides a formula
[shown below in its most common form in Equation (1) to
Table 2. The PLM model, a set of operators for a pen-based user interface corresponding to those of the KLM model
Operator Description
K Striking a key or any combination of keys is replaced in pen-based systems with writing a single character or drawing
a single gesture. It is widely accepted in the eld of HCI that a reasonably good typist can type faster than they can write.
Although holding down a combination of keys is effectively one individual key press for each key, as the position of
each key is stored as a one chunk in human memory. Some keys are not as familiar to a user as the character keys,
as they are used less frequently. Thus if the user has to look for a key (e.g. the shift key) then it is reasonable to assume
that the action of drawing a single gesture would be of a comparable speed.
B This operation is replaced in pen-based systems with tapping the pen once on the screen, which is essentially the same action
as marking a full-stop/period or doting an i or j, and it is a better developed motor skill. No reason exists to believe or
evidence either way to suggest that B or B is faster than the other for a specic user.
P This operation is replaced in pen-based systems with the action of moving the pen over the screen, but the pen does not have
to be in continuous contact with the digitizing tablet. Thus, the cursor can be moved from one side of the screen to the other
simply by lifting it from one side and placing it on the other side. This action makes the pen interface more ergonomic than the
mouse in this type of situation as the users hand need no longer be in a state of tension while performing this action.
This action is a prime example of the benets of a direct-manipulation interface.
H The homing operator is one for which no corresponding operator exists for pen-based systems, as the pen is the sole
input device. This decreases the overall execution time when using a pen interface compared with a keyboard and mouse.
D This operation is replaced in pen-based systems with the action of drawing with the pen, where again the benets of
direct manipulation are observed, as it is just like drawing using pen and paper, which is a much better developed set
of motor-skills, especially when drawing curved lines/strokes that compose a sketch/drawing.
M Because the KLM model assumes the user has formulated a plan and worked out how to execute it, the mental preparation
modelled by the M and M-operator is simply the time taken by the user to recall what to do next. Thus it is reasonable to
assume that the time taken up by an occurrence of an M or M-operator is the same for pen-based systems as it is for
keyboard and mouse-based systems for each user. Although it is reasonable to assume that if one was the slower of the two
it would be M, because with keyboard and mouse-based systems, the user has to recall which input device (keyboard or
mouse) they need to use.
R Assuming the two systems were of a similar overall capability, it is reasonable to assume that the system response time for
a pen-based system and a keyboard and mouse-based system would be the same. However, the process of HWX is more
intensive computationally than reading keyboard input, so a pen-based systems processor would need to be faster than
that of a keyboard and mouse-based system to be perceived by the user as being of comparable speed.
Figure 5. The on-screen user-interface used for dragging tasks.
PEN-BASED COMPUTING 5
calculate the time taken for a specic user to move the
cursor to an on-screen target.
Movement Time a blog2Distance=Size 1 (1)
As can be observed from Equation (1), according to Fitts
law the time taken to move to the target depends on the
distance the cursor needs to be moved and the size of the
target. Constants a and b are determinable empirically for
each user. As Equation (1) shows, a greater distance is
moved in a greater time, and a smaller target is more
difcult to acquire and thus also increases movement
time. The distance between the targets shown in Fig. 5
was varied over the range of discrete values (A 8, 16, 32,
64 units) where a unit refers to eight pixels. The size
parameter of Equation (1) was represented by the width
of the targets, as the movement of the cursor would largely
be side-to-side, and this too was varied over the discrete
range (W 1, 2, 4, 8 units). All values of the distance
between targets A were fully crossed with all values of the
width of the targets W for both the pointing and the drag-
ging tasks, and each A-W combination was used for a block
of 10 elemental tasks (pointing or dragging), where the
users objective was to carry out the ten tasks in succession
as quickly and a accurately as possible. Sixteen blocks were
ordered randomly into a session, and ve sessions were
completed for each device for each of the two types of tasks.
The results showed that subjects occasionally would drop
the object a long way fromthe target. This was not because
of normal motor variability but occurred because of the
difculty in sustaining the tension in the hand required to
perform dragging. It was particularly evident with the
trackball, where the ball has to be rolled with the ngers
while holding the button down with the thumb. The results
were adjusted to remove these errors by eliminating ele-
mental task executions within each block that were termi-
nated (by a click or release) a horizontal distance from the
mean termination distance greater than three standard
deviations. This adjustment was made separately for each
subject, A-Wcombination, device, and task type. Elemental
task executions immediately after those that were termi-
nated erroneously (which the user was notied of via a
beep) were also eliminated, as many people who had inves-
tigated repetitive, self paced, and serial tasks concluded
that erroneous executions were disruptive to the user
and could cause an abnormally long response time
for the following trial, which would have skewed the aver-
age execution time. Analysis also showed a signicant
reduction in execution times after the rst session,
and thus the entire rst session for each subject, for each
device task type combination was also eliminated. Figure 6
and Table 3 show the mean movement times (execution
times) over all blocks for each device and for the two task
types, after the adjustments mentioned above were made.
As can be seen in Fig. 6 and in Table 3, the pen and
digitizingtablet was fastest inbothtasktypes, althoughthe
performance of the mouse was comparable in the pointing
task. The results show that the performance of each device
was better in the pointing task than in the dragging task.
This seems reasonable as when dragging the hand (and the
forearm in the case of the pen and digitizing tablet) is in a
state of greater tension compared with when pointing. The
big difference in performance between the mouse and
the pen and digitizing tablet was in the dragging task,
where the performance gap for the pointing task was the
greatest with the mouse. Error rates for each device-task
type were also evaluated and are shown in Fig. 7.
As withmeanmovement time, error rates were worse for
the dragging task than they were for pointing. The unad-
justed results were evaluated before making the modica-
tions describedabove. The adjustment of eliminatingerrors
greater than three standard deviations from the mean
1400
1200
1000
800
600
Dragging
Pointing
Mouse Tablet
Device
M
e
a
n
M
T
(
m
s
)
Trackball
Figure 6. Graph showing execution times of three devices for
elemental pointing and dragging tasks.
Table 3. Tabulated form of the plots shown in Figure 6
Input device
AverageMT: AverageMT:
elemental elemental
pointing dragging
Stylus & Tablet 665 ms 802 ms
Mouse 674 ms 916 ms
Trackball 1101 ms 1284 ms
Dragging
20
10
0
Pointing
Mouse Tablet
Device
M
e
a
n
P
e
r
c
e
n
t
a
g
e
E
r
r
o
r
s
Trackball
Unadjusted
Adjusted
Adjusted
Unadjusted
Figure 7. Graphs to show the mean error rates for three devices
during pointing and dragging tasks.
6 PEN-BASED COMPUTING
termination distance from the target (as described above)
was also applied to the pointing task, although dropping
errors could not have occurred during the pointing task.
The results showninFig. 7showthat the mouse hadalower
(but comparable) error rate compared with the pen and
digitizing tablet in the pointing task, and had a much lower
error rate than in the dragging task. The 12 participants
were computer literate, and because the mouse is the
standard interface device for pointing and dragging, this
nding suggests that on average they would have been
more familiar with the mouse than they were with the pen
and digitizing tablet (in the nondirect-manipulation form
that was employed in this experiment). Thus, they would
have had better developed motor skills for pointing and
dragging using a mouse. It is reasonable to assume that the
pen and digitizing tablet interface in its direct-manipula-
tion form would have yielded fewer errors all round, but
almost certainly in pointing as the user would no longer
have to track the position of the cursor between the targets.
The user could simply perform a dotting action on the
targets in an alternate fashion, which as well as eliminat-
ing errors would have boosted the speed of elemental
pointing tasks.
CONCLUSIONS
The results of the experiment conducted by Hewlett Pack-
ard and The University of Bristol research labs (5) dis-
cussed previously suggest that the pen interface is very
effective for pointing and dragging tasks. Although subject
feedback gave applications a lower appropriateness rating
(for use with pen input), its handwriting recognition func-
tionality was more signicant and vital in task execution.
Overall, the results of the experiment conveyed the impres-
sion that the pen was comparable with the mouse for
pointing and dragging tasks, but not as good as the key-
board for text entry because of relatively high recognition
error rates.
In my attempt at a direct comparison of pen-based and
keyboard and mouse-based systems it was shown that pen-
based systems are the simpler of the two in a cognitive
sense, because the overhead associated with switching the
hand between the mouse and the keyboard is removed with
pen-based systems as the pen is the sole input device. The
other operators in the model most likely have quicker
execution times than their KLM equivalents for a specic
user as a result of the direct-manipulationinput device that
is the integrated digitizer and screen and because of the
more advanced motor skills that most users will have for
using a pen than for using a mouse. The only exception to
this is the B or B-operator (pressing a mouse button/
tapping the pen on the digitizing tablet), with which no
evidence suggests which is faster than the other, but
usually this operator is combined with the P or P-operator
(pointing), and reasons exist to believe this method is faster
using pen-based systems than keyboard and mouse-based
systems.
The results of the experiment conducted by Buxtonet al.
(7) discussed previously support these facts by showing the
pen to be faster than the mouse (and the trackball) for both
pointing and dragging tasks. Although the results showed
higher error rates during both pointing and dragging for
the pen than for the mouse. It is reasonable to assume that
this was because the experiment used the indirect-manip-
ulation form of the digitizing tablet (where it is not inte-
grated with the screen). With this conguration of the
digitizing tablet the user is conned to using a low inter-
activity-level input mechanismjust as they are when using
a mouse, but with a less familiar input device as this
conguration of the digitizing tablet does not resemble
the conventional pen and paper metaphor like the direct-
manipulation conguration does.
BIBLIOGRAPHY
1. PDA vs. Laptop: a comparison of two versions of a nursing
documentation application. Center for Computer Research
Development, Puerto Rico University, Mayaguez, Puerto Rico.
2. N-trig http://www.n-trig.com.
3. http://msdn2.microsoft.com/en-us/library/ms811395.aspx.
4. Multimodal Integration for Advanced Multimedia Interfaces
ESPRIT III Basic Research Project 8579. Avalable: http://
hwr.nici.kun.nl/~miami/taxonomy/node1.html.
5. Recognition accuracy and user acceptance of pen interfaces.
University of Bristol Research Laboratories; Hewlett Packard
CHI 95 Proceedings and Papers. Avalable: http://www.ac-
m.org/sigs/sigchi/chi95/Electronic/documnts/.
6. A. Dix, J. Finlay, G. Abowd, R. BealeHumancomputer Inter-
action 2nd ed. Englewood.Cliffs, NJ: 1998.
7. A Comparison of Input Devices in Elemental Pointing and
Dragging Tasks. Avalable: http://www.billbuxton.com/tts91.
html.
DR. SANDIP JASSAR
Leicester, United Kingdom
PEN-BASED COMPUTING 7
P
PROGRAMMABLE LOGIC ARRAYS
INTRODUCTION
Programmable logic arrays (PLAs) are widely used tradi-
tional digital electronic devices. The term digital is
derived from the way digital systems process information;
that is by representing information in digits and operating
on them. Over the years, digital electronic systems have
progressed from vacuum-tube circuits to complex inte-
grated circuits, some of which contain millions of transis-
tors. Currently, digital systems are included in a wide
range of areas, such as communication systems, military
systems, medical systems, industrial control systems, and
consumer electronics.
Electronic circuits can be separated into two groups,
digital and analog. Analog circuits operate on analog quan-
tities that are continuous in value, where as digital circuits
operate on digital quantities that are discrete in value and
are limited in precision. Analog signals are continuous in
time and are continuous in value. Most measurable quan-
tities in nature are in analog form, for example, tempera-
ture. The measurements of temperature changes are
continuous in value and in time; the temperature can
take any value at any instance of time without a limit
on precision but with the capability of the measurement
tool. Fixing the measurement of temperature to one read-
ing per an interval of time and rounding the value recorded
to the nearest integer will graph discrete values at discrete
intervals of time that could be coded into digital quantities
easily. From the given example, it is clear that an analog-
by-nature quantity could be converted to digital by taking
discrete-valued samples at discrete intervals of time and
then coding each sample. The process of conversion usually
is known as analog-to-digital conversion (A/D). The oppo-
site scenario of conversion is also valid and known as
digital-to-analog conversion (D/A). The representation of
information in a digital form has many advantages over
analog representation in electronic systems. Digital data
that is discrete in value, discrete in time, and limited in
precision could be efciently stored, processed, and trans-
mitted. Digital systems are more noise-immune than ana-
log electronic systems because of the physical nature of
analog signals. Accordingly, digital systems are more reli-
able thantheir analog counterpart. Examples of analog and
digital systems are shown in Fig. 1.
A BRIDGE BETWEEN LOGIC AND CIRCUITS
Digital electronic systems represent information in digits.
The digits used in digital systems are the 0 and 1 that
belongto the binary mathematical number system. Inlogic,
the 0 and 1 values correspond to true and false. In circuits,
the true and false correspond with high voltage and low
voltage. These correspondences set the relations among
logic (true and false), binary mathematics (0 and 1), and
circuits (high and low).
Logic, in its basic shape, checks the validity of a certain
proposition a proposition could be either true or false.
The relation among logic, binary mathematics, and circuits
enables a smooth transition of processes expressed in pro-
positional logic to binary mathematical functions andequa-
tions (Boolean algebra) and to digital circuits. A great
scientic wealth exists to support strongly the relations
among the three different branches of science that ledto the
foundation of modern digital hardware and logic design.
Boolean algebra uses three basic logic operations AND,
OR, and NOT. If joined with a proposition P, the NOT
operation works by negating it; for instance, if P is True
then NOT P is False and vice versa. The operations AND
and OR should be used with two propositions, for example,
P and Q. The logic operation AND, if applied on P and Q
would mean that P AND Q is True only when both P and Q
are True. Similarly, the logic operation OR, if applied on P
andQwouldmeanthat PORQis True wheneither Por Qis
True. Truth tables of the logic operators AND, OR, and
NOT are shown in Fig. 2a. Figure 2b shows an alternative
representation of the truth tables of AND, OR, and NOT in
terms of 0s and 1s.
Digital circuits implement the logic operations AND,
OR, and NOT as hardware elements called gates that
perform logic operations on binary inputs. The AND-gate
performs an AND operation, an OR-gate performs an OR
operation, and an Inverter performs the negation operation
NOT. Figure 2c shows the standard logic symbols for the
three basic operations. With analogy from electric circuits,
the functionality of the AND and OR gates are captured as
shown in Fig. 3. The actual internal circuitry of gates
is built using transistors; two different circuit implementa-
tions of inverters are shown in Fig. 4. Examples of AND,
OR, NOTgates integratedcircuits (ICs) are showninFig. 5.
Besides the three essential logic operations, four
other important operations existthe NOR, NAND, exclu-
sive-OR (XOR), and Exclusive-NOR.
A logic circuit usually is created by combining gates
together to implement a certain logic function. A logic
function could be a combination of logic variables (such
as A, B, C, etc.) with logic operations; logic variables can
take only the values 0 or 1. The created circuit could be
implemented using AND-OR-Inverter gate-structure or
using other types of gates. Figure 6 shows an example
combinational implementation of the following logic func-
tion F(A, B, C):
FA; B; C ABC A
0
BC AB
0
C
0
In this case, F(A, B, C) could be described as a sum-of-
products (SOP) function according to the analogy that
exists between OR and addition (+), and between AND
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
Input X Input Y
Output:
X AND Y
Input X Input Y
Output:
X OR Y
Input X
Output:
NOT X
False False False False False False False True
False True False False True True True False
True False False True False True
True True True True True True
(a)
Input X Input Y
Output:
X AND Y
Input X Input Y
Output:
X OR Y
Input X
Output:
NOT X
0 0 0 0 0 0 0 1
0 1 0 0 1 1 1 0
1 0 0 1 0 1
1 1 1 1 1 1
(b)
(c)
AND Gate OR Gate
Inverter
0
1
0
0
0
0
0
1
0
1
1
1
0
1
1
1
0
0
0 1 0
0 1
1
0
11
1
Figure 2. (a) Truthtables for AND, OR, and Inverter. (b) Truthtables for AND, OR, and Inverter in binary numbers, (c) Symbols for AND,
OR, and Inverter with their operation.
X Y
X AND Y
X
Y
X OR Y
X
Y
X
Y
X AND X Y OR Y
Figure 3. A suggested analogy between AND and OR gates and electric circuits.
Analog Amplifier
Speaker
Microphone
A Simple Analog System
Personal Digital Assistant and a Mobile Phone
Speaker
Microphone
A Digital System
Figure 1. Asimple analogsystemandadigital system; the analogsignal amplies the input signal usinganalogelectronic components. The
digital system can still include analog components like a speaker and a microphone, the internal processing is digital.
2 PROGRAMMABLE LOGIC ARRAYS
and product (.); the NOT operation is indicated by an
apostrophe that follows the variable name.
PROGRAMMABLE LOGIC
Basically, three types of IC technologies exist that can be
used to implement logic functions on which Ref. 1, include,
full-custom, semi-custom, and programmable logic devices
(PLDs). In full-custom implementations, the designer is
concernedabout the realizationof the desiredlogic function
to the deepest details, which include the transistor-level
optimizations, to produce a high-performance implementa-
tion. In semi-custom implementations, the designer uses
ready logic-circuit blocks and completes the wiring to
achieve an acceptable performance implementation in a
shorter time than full-custom procedures. In PLDs, the
logic blocks and the wiring are ready. To implement a
function on a PLD, the designer will decide which wires
and blocks to use; this step usually is referred to as pro-
gramming the device.
Output Input
+V
DD
1.6 kW 130 W 4 kW
1 kW
+V
CC
Output
Input
TTL Inverter CMOS Inverter
Figure 4. Complementary Metal-oxide Semiconductor (CMOS) and Transistor-Transistor Logic (TTL) Inverters.
GND
Vcc
GND
Vcc
GND
Vcc
Figure 5. The 74LS21 (AND), 74LS32 (OR), 74LS04 (Inverter) TTL ICs.
A
B
C
F(A, B, C)
A
B
C
A
B
C
Figure 6. AND-OR-Inverter implementation of the function FA; B; C ABC A
0
BC AB
0
C
0
.
PROGRAMMABLE LOGIC ARRAYS 3
Obviously, the development time using a PLDis shorter
than the other full-custom and semi-custom implementa-
tion options. The performance of a PLD varies according to
its type and complexity; however a full-custom circuit
is optimized to achieve higher performance. The key
advantage of modern programmable devices is their
reconguration without rewiring or replacing components
(reprogrammability). Programming a modern PLD is as
easy as writing a software program in a high-level pro-
gramming language.
The rst programmable device that achieved a wide-
spread use was the programmable read-only memory
(PROM) and its derivatives Mask-PROM, and Field-
PROM (the erasable or electrically erasable versions).
Another step forward led to the development of PLDs.
Programmable array logic, programmable logic array
(PLA), and generic array logic are commonly used PLDs
designed for small logic circuits and are referred to as
simple-PLDs (SPLDs). Other types, Such as the mask-
programmable gate arrays, were developed to handle lar-
ger logic circuits. Complex-PLDs (CPLDs) and eld pro-
grammable gate arrays (FPGAs) are more complicated
devices that are fully programmable and instantaneously
customizable. Moreover, FPGAs and CPLDs have the abil-
ity to implement very complex computations with millions
of gates devices currently in production. A classication of
PLDs is shown in Fig. 7.
PLAs
A (PLA) is an (SPLD) that is used to implement combina-
tional logic circuits. APLAhas a set of programmable AND
gates, which link to a set of programmable OR gates to
produce an output (see Fig. 8). Implementing a certain
function using a PLA requires the determination of which
connections among wires to keep. The unwanted routes
could be eliminated by burning the switching device (pos-
sibly a fuse or an antifuse) that connects different routs.
The AND-OR layout of a PLA allows logic functions to be
implemented that are in an SOP form.
Technologies used to implement programmability in
PLAs include fuses or antifuses. A fuse is a low resistive
element that could be blown (programmed) to result in an
open circuit or high impedance. An antifuse is a high
PLDs
SPLDs High-density
PLDs
PLA PAL GAL CPLDs FPGAs
Figure 7. Typical PLD device classication.
Standard Multiple
AND Gate Symbol
AND Array
Symbol
Standard Multiple
OR Gate Symbol
OR Array
Symbol
C B A
Figure 8. A3-input 2-output PLAwithits ANDArrays andORArrays. AnANDarray is equivalent to a standardmultiple-input ANDgate,
and an OR array is equivalent to a standard multiple-input OR gate.
4 PROGRAMMABLE LOGIC ARRAYS
resistive element (initially high impedance) and is pro-
grammed to be low impedance.
Boolean expressions can be represented in either of two
standard forms, SOPs and the product-of-sums (POSs). For
example, the equations for F (an SOP) and G(a POS) are as
follows:
FA; B; C ABC A
0
BC AB
0
C
0
GA; B; C A B C : A
0
B C : A B
0
C
0
L
in
P; v
in
cos u
in
dv
in
Ingeneral, the BRDFis different for eachpoint onasurface;
it is a function of position: f
r
P; v
in
; v
out
. The position
dependence of the BRDF creates the surfaces character-
istic visual texture.
BRDF denes the surface appearance and, therefore, it
is at the heart of all rendering algorithms. It has to be
evaluated for each visible point in the scene because it
converts the light arriving at that point from the light
sources (direction v
in
) into the light directed toward the
eye (direction v
out
). Renderers make use of a small routine,
called a shader, which uses the BRDF to compute the
Figure 4. Various light source models are usedfor lightingsimulation. (a) Omnidirectional point light, (b) spot light, (c) directional light, (d)
area light, and (e) environment map.
LIGHTING 3
amount of light reected along an outgoing direction due to
the light incident from one or more directions. Each object
in the scene has a shader assigned to it depending on the
desired reectance properties.
BRDF Properties
Aphysically plausible BRDFmust satisfy three fundamen-
tal properties. First, BRDF is a non-negative function.
Second, BRDF is reciprocal, that is to say the BRDF value
does not change if the incoming and outgoing directions
are interchanged [i.e., f
r
v
in
; v
out
f
r
v
out
; v
in
]. Third,
BRDF is energy conserving (i.e., a surface cannot reect
more than the amount of incident light).
BRDF Types
The BRDF functions are as wildly varying as are the
appearances of objects in the real world. A diffuse (Lam-
bertian) BRDF describes matte objects that reect light
uniformlyinall directions. No matter where the light comes
from, it is reected uniformly along all directions in the
hemisphere above the surface point. Diffuse BRDF is
constant, f
diffuse
r
v
in
; v
out
k
d
p
, where 0 k
d
1.
Specular BRDF describes a mirror surface, which only
reects light according to the Law of Reection (angle of
reection is equal to the angle of incidence) and the
reected radiance is a constant fraction of the outgoing
radiance. Apart fromthose two special cases, there is awide
range of glossy or directional diffuse BRDFs that can have
completely arbitrary properties. Most commonly, the light
is directed around the direction of perfect specular reec-
tion, but it can also be reected back toward the direction of
incidence (retro-reective materials). Real BRDFs are
usually modeled by a sum of diffuse, specular, and glossy
components (Fig. 5b).
BRDF Models
For the purpose of lighting simulation, general BRDFs are
usually described by a mathematical model, parameterized
by a small number of user settable parameters. The most
common is the Phong model, whose physically plausible
formis f
phong
r
v
in
; v
out
k
d
k
s
cos
n
a, where ais the angle
between the mirror reection direction v
r
of v
in
and out-
going direction v
out
. The user sets k
d
(diffuse reectivity
determines the surface color), k
s
(specular reectivity
determines the intensity of specular highlights), and n
(specular exponent, or shininessdetermines the sharp-
ness of specular highlights). Many other BRDF models
exist, see Refs. 1 or 3 for some examples.
BRDF Extensions
BRDF describes the reection of light at one point. The
model assumes that all light energy incident at a point is
reected from the same point, which is not true in the case
of subsurface scattering, where the light energy enters the
surface at one point, is scattered multiple times inside the
material and leaves surface at a different point. This reec-
tion behavior is exhibited by human skin (think of the
frightening red color in ones face if he or she positions a
ashlight under his or her chin), marble, milk, plant tis-
sues, and so on. Reection behavior of those materials is
described by the bidirectional surface scattering (reec-
tance) distribution function (BSSDF).
DIRECT ILLUMINATION
Once the emission, geometry, and reection functions of
scene objects are specied, the light transport in the scene
can be simulated. Real-time graphics (e.g., in video games)
use the lighting simulation that disregards most of the
interactions of light with scene objects. It only takes into
account the so-called direct illumination. For each visible
surface point in the scene, only the light energy coming
directly from the light source is reected toward the view
point. The direct lighting simulation algorithmproceeds as
follows.
For each image pixel, the scene point visible through
that pixel is determined by a hidden surface removal algo-
rithm[most commonly z-buffer algorithm(4)]. For a visible
surface point P, the outgoing radiance L
out
P; v
out
in the
directionof the virtual camera is computed by summing the
incoming radiance contributions from the scene light
sources, multiplied by the BRDF at P:
L
out
P; v
out
X
#lights
i1
LQ
i
!P f
r
P; Q
i
!P; v
out
cos u
in
1
In this equation, Q
i
is the position of the ith light source,
Q
i
!P denotes the direction from the ith light source
toward the illuminated point P, and LQ
i
!P is the
Figure 5. (a) The geometry of light reection at a surface. (b) A general BRDF is a sum of diffuse, specular, and glossy components.
4 LIGHTING
radiance due to the ith light source, arriving at P. The
equationassumes point light sources. Area light source can
be approximated by a number of point lights with positions
randomly picked from the surface of the area light source.
In the specic example of physically plausible Phong
BRDF, the direct illumination would be computed using a
formula similar to the following:
L
out
P; v
out
X
#lights
i1
I
i
A
i
k
d
k
s
cos
n
a cos u
in
Here, I
i
denotes the intensity of the light source. A
i
is the
light attenuationterm, whichshouldamount to the squared
distance between the light source position and the illumi-
nated point. In practical rendering, the attenuation might
be only linear withdistance or there might be no attenuation
at all. The cosine term, cos u
in
, is computed as the dot
product between the surface normal and direction v
in
(i.e., a unit vector in the direction of the light source).
In the basic form of the algorithm, all the light sources
are assumedto illuminate all scene points, regardless of the
possible occlusion between the illuminated point and the
light source. It means that the determination of the out-
going radiance at a point is a local computation, as only the
local BRDF is taken into account. Therefore, the algorithm
is said to compute local illumination. All surfaces are
illuminated exclusively by the light coming directly from
the light sources, thus direct illumination. Direct illumina-
tion is very fast; using todays graphics hardware, it is
possible to generate hundreds of images per second of
directly illuminated reasonably complex scenes.
To include shadow generation in a direct illumination
algorithm, the occlusion between the light source and the
illuminated point has to be determined. To detect possible
occluders, one can test all the scene objects for intersection
with the ray between the illuminated point and the light
source. However, this process is slow even with a specia-
lizedspatial indexing data structure. For real-time lighting
simulation, shadows are generated using shadow maps or
shadow volume (4).
GLOBAL ILLUMINATION
The simplistic direct illumination algorithm is generally
not capable of simulating the interesting lighting effects
mentioned inthe introduction, suchas specular reections,
color bleeding or caustics. The two main reasons are as
follows. First, those effects are due to light reected multi-
ple times on its way fromthe source to the camera. Indirect
illumination, however, only a single reection of light is
taken into account (the reection at the visible point P).
Second, the point P under consideration is not only illumi-
nated by light sources but also by the light reected fromall
other surfaces visible from P. The light arriving at a point
from other scene surfaces is called indirect illumination as
opposed to direct illumination, which arrives directly from
the light sources.
The total amount of light reected by a surface at a point
P in the direction v
out
is given by the following hemisphe-
rical integral referred to as the Reectance Equation or
Illumination Integral.
L
out
P; v
out
Z
V
in
L
in
P; v
in
f
r
P; v
in
; v
out
cos u
in
dv
in
Note the similarity between this equation and the direct
illumination sum in Equation (1). The direct illumination
sumonly takes into account only those directions leading to
the point light sources, whereas the Reectance Equation
takes into account all direction in the hemisphere, thereby
translating the sum into an integral.
To compute the outgoing radiance L
out
at the point P
using the Reectance Equation, one has to determine the
incoming radiance L
in
fromevery incoming direction v
in
in
the hemisphere above P, multiply L
in
by the BRDF and
integrate over the whole hemisphere. The incoming radi-
ance L
in
from the direction v
in
is equal to the radiance
leaving point Q (visible from P in direction v
in
) in the
direction v
in
. That is to say, the incoming radiance at
the point P from a single direction is given by the outgoing
radiance at a completely different point Q in the scene.
Theoutgoingradianceat theQis computedbyevaluating
the Reectance Equation at Q, which involves evaluating
the outgoing radiance at many other surface points R
i
for
which the Reectance Equation has to be evaluated again.
This behavior corresponds to the aforementioned multiple
reections of the light in the scene. The light transport is
global in the sense that illumination of one point is given by
the illumination of all other points in the scene.
By substituting the incoming radiance L
in
P; v
in
at Pby
the outgoing radiance L
out
at the point directly visible from
P in the direction v
in
, we get the Rendering Equation
describing the global light transport in a scene:
LP; v
out
L
e
P; v
out
Z
V
in
LCP; v
in
; v
in
f
r
P; v
in
; v
out
cos u
in
dv
in
The self-emitted radiance L
e
is nonzero only for light
sources andis givenbythe EmissionDistributionFunction.
CP; v
in
is the ray casting function that represents the
surface point visible from P in the direction v
in
. As all
radiances in the Rendering Equation are outgoing, the
subscript out was dropped. Global illumination algo-
rithms compute an approximate solution of the Rendering
Equation for points in the scene.
RAY TRACING
The simplest algorithm that computes some of the global
illumination effects, namely perfect specular reections
and refractions, is ray tracing (5). The elementary opera-
tion in ray tracing is the evaluation of the ray casting
function CP; v (also called ray shooting): Given a ray
dened by its origin P and direction v, nd the closest
intersection of the ray with the scene objects. With an
acceleration space indexing data structure, the query can
LIGHTING 5
be evaluated in Olog N time, where N is the number of
scene objects.
Ray tracing generates images as follows. For a given
pixel, a primary ray from the camera through that pixel is
cast into the scene and the intersection point P is found. At
the intersection point, shadow rays are sent toward the
light sources to check whether they are occluded. Incoming
radiance contributions due to each unoccluded light source
are multiplied by the BRDF at P and summed to determine
direct illumination. So far we only have direct illumination
as described above. The bonus with ray tracing is the
treatment of reection on specular surfaces.
If the surface is specular, a secondary ray is cast in the
direction of the perfect specular reection (equal angle of
incidence and reection). The secondary ray intersects the
scene at a different point Q, where the outgoing radiance in
the direction v P Q is evaluated and returned to the
point P, where it is multipliedbythe BRDFandaddedto the
pixel color. The procedure to compute the outgoingradiance
at Qis exactly the same as at P. Another secondary ray may
be initiatedat Qthat intersects the surface at point Randso
on. The whole procedure is recursive and is able to simulate
multiple specular reections. Refractions are handledsimi-
larly, but the secondary ray is cast through the object
surface according to the index of refraction. This algorithm
is also referred to as classical of Whitted ray tracing.
Ray tracing is a typical example of a view-dependent
lighting computation algorithmlight, to be precise, the
outgoing radiance, is computed only for the scene point
visible fromthe current viewpoint of the virtual camera. In
this way, ray tracing effectively combines image generation
with lighting simulation.
MONTE CARLO
Ray tracing as described in the previous section handles
indirect lighting on perfectly specular or transmitting sur-
faces, but still is not able to compute indirect lighting on
diffuse or glossy ones. On those surfaces, one really has to
evaluate the Reectance Equation at each point. The pro-
blem on those surfaces is how to numerically evaluate the
involved hemispherical integral. One possibility is to use
Monte Carlo (MC) quadrature. Consider MCfor evaluating
the integral I
Z
1
0
f xdx. It involves taking a number of
uniformly distributed randomnumbers j
i
fromthe interval
h0; 1i, evaluating the function f at each of those points, and
taking the arithmetic average to get an unbiased estimator
of the value of integral hIi
1
N
X
i
f j
i
. The unbiased
nature of the estimator means that the expected value of
the estimator is equal to the valueof the integral. Thelarger
the number of samples f j
i
, the more accurate the esti-
mate is, but the expected error only decreases with the
square root of the number of samples.
MONTE CARLO RAY TRACING
To apply MC to evaluating the hemispherical integral in
the Reectance Equation, one generates a number of ran-
dom, uniformly distributeddirections over the hemisphere.
For each random direction, a secondary ray is cast in that
direction exactly in the same way as in classic ray tracing.
The secondary ray hits a surface at a point Q, the outgoing
radiance in the direction of the secondary ray is computed
(recursively using exactly the same procedure) and
returned back. The contributions from all secondary rays
are summed together and give the estimate of the indirect
lighting. MC is only used to compute indirect illumination
(the light energy coming from other surfaces in the scene).
Direct illumination is computed using the shadow rays in
the same way as in classic ray tracing. Figure 6 shows an
image rendered with MC ray tracing.
Simply put, MC ray tracing casts the secondary rays in
randomized directions, as opposed to classic ray tracing,
which only casts them in the perfect specular reection
direction. By sending multiple randomized rays for each
pixel and averaging the computed radiances, an approx-
imation to the correct global illumination solution is
achieved. A big disadvantage of MC ray tracing is the
omnipresent noise, whose level only decreases with the
square root of the number of samples.
Figure 6. Global illumination (a) is composed of the direct illumination (b) and indirect illumination (c). The indirect illumination is
computed with MC quadrature, which may leave visible noise in images. Overall, 100 samples per pixel were used to compute the indirect
illumination.
6 LIGHTING
Classic ray tracing was able to compute indirect lighting
only on specular surfaces. By applying MC to evaluating
the Reectance Equation, MC ray tracing can compute
indirect lighting on surfaces with arbitrary BRDFs.
IMPORTANCE SAMPLING
Consider again evaluating the integral I
Z
1
0
f xdx, but
imagine this time that function f is near zero in most of the
interval and only has a narrow peak around 0.5. If MC
quadrature is run as described above by choosing the
samples uniformly from the interval h0; 1i, many of the
samples will not contribute to the integral estimate because
they will miss the narrow peak, and the computational
power put in them will be wasted. If no information is
available about the behavior of f, if is probably the best
one cando. However, if some informationonthe behavior of
the integrandf is known, concentratingthe samples around
the peak will improve the estimate accuracy without hav-
ing to generate excessive samples. Importance sampling is
a variant of MCquadrature that generates more samples in
the areas where the integrand is known to have larger
values and reduces the variance of the estimate signi-
cantly. If the samples are generated with the probability
density p(x), the unbiased estimator is hIi
1
N
X
i
f j
i
pj
i
,
where j
i
s are the generated random numbers. Importance
samplingsignicantly reduces the estimator variance if the
probability density p(x) follows the behavior of the inte-
grand f(x). Going back to evaluating the Reectance Equa-
tion, the integrand is f
r
v
in
; v
out
L
in
v
in
cos u
i
. Incoming
radiance L
in
v
in
is completely unknown (we are sending
the secondary rays to sample it), but the BRDF is known.
The direction v
out
is xed (the integration is over the
incoming directions), so we can use the known function
f
vout
r
v
in
cos u
i
as the probability density for generating the
samples. For highly directional (glossy) BRDFs, the noise
reduction is substantial, as shown in Fig.7.
PHOTON MAPS
Although importance sampling reduces the image noise a
great deal, MCray tracing still takes too long to converge to
a noise-free image. Photon mapping (6) is a two-pass prob-
abilistic lighting simulation method that performs signi-
cantly better.
In the rst (photon tracing) pass, light sources emit
photons in randomized directions. Brighter lights emit
more photons, whereas dimmer ones emit less. Each
time an emitted photon hits a surface, the information
about the hit (hit position, photon energy, and incoming
direction) is stored in the photon map, which is a spatial
data structure specically designed for fast nearest-neigh-
bor queries. After storing the photoninthe map, the tracing
continues by reecting the photon off the surface in a
probabilistic manner (mostly using importance sampling
according to the surface BRDF). The result of the photon
tracing pass is a lled photon map, which roughly repre-
sents the distribution of indirect illumination in the scene.
In the second pass, lighting reconstruction and image
formation are both done together. For each pixel, a primary
ray is sent to the scene. At its intersection with a scene
surface, the direct illumination is computed by sending the
shadow rays as described above. Indirect illumination is
reconstructed fromthe photon map. The map is queried for
k nearest photons k 50200, and the indirect illumina-
tion is computed from the energy and density of those
photons with a procedure called radiance estimate. In
this maner, the incoming radiance L
in
v
in
in the reection
equation is determined from the photon map without ever
having to send any secondary rays.
Unlike ray tracing, photon mapping separates the light-
ing simulation stage from image generation. The rst,
photon tracing, pass is an example of a view-independent
Figure 7. Importance can substantially reduce the noise level in the image. Both the left and right images were computed with MC ray
tracing using 100 samples per pixel. Uniformhemisphere sampling was used for the image on the left; importance sampling was used for the
image on the right.
LIGHTING 7
lighting computation algorithm. The photons are traced
and stored independently from the position of the camera.
The image generationitself is performedinthe secondpass.
FINAL GATHERING, IRRADIANCE CACHING
As the photon map holds a very rough representation of
illumination, the images generated withthe reconstruction
pass as described above do not usually have good quality
because the indirect illumination randomly uctuates over
surfaces. Final gathering inserts one level of MC ray tra-
cing into the lighting reconstruction pass of photon map-
ping and achieves high-quality images. For each primary
ray hitting a surface, many secondary rays are sent
(5006000) to sample the indirect incoming radiance. At
the points where the secondary rays hit a scene surface, the
radiance estimate from the photon maps is used to deter-
mine indirect illumination. Because as many rough radi-
ance estimates as there are secondary rays are averaged for
each pixel, the image quality is good (the uneven indirect
radiance is averaged out).
Final gathering is still very costly, because many sec-
ondary rays have to be sent for each pixel. However, on
diffuse surfaces, the surface shading due to indirect illu-
mination tends to change very slowly as we move over the
surface, which is exploited in the irradiance caching(7)
algorithm, where the costly nal gathering is computed
onlyat afewspots inthe scene andinterpolatedfor points in
a neighborhood. Photon mapping with a form of nal
gathering is currently the most commonly used method
for computing global illumination in production rendering
software.
PARTICIPATING MEDIAFOG, CLOUDS, AND SMOKE
All methods in the previous sections were built on one
fundamental assumptionthe value of radiance along a
straight ray does not change until the ray hits a surface.
However, if the space between the scene surfaces is lled
with a volumetric medium like fog, cloud, or smoke,
radiance can be altered by the interaction with the
medium.
At eachpoint of the mediumthe value of radiance alonga
straight line canbe modiedby anumber of events. Absorp-
tion decreases the radiance due to conversion of the light
energy into heat and is described by the volume absorption
coefcient s
a
m
1
. Scattering, described by the volume
scattering coefcient s
s
m
1
, is a process where a part of
light energy is absorbed and then re-emitted in various
directions. The angular distribution of the scattered light is
described by the phase function. Scattering in a volume
element is similar to reection on a surface and the phase
functioncorresponds to a surface BRDF. The total radiance
loss due to absorption and scattering is described by the
Beers exponential extinction law: Radiance decreases
exponentially with the optical depth of the medium. The
optical depth is the line integral of s
a
s
s
along the ray.
The value of radiance in a volume element does not only
decrease, it can be increased due to emission and in-
scattering. In-scattering is the radiance gain caused by
the energy scattered into the ray direction from the neigh-
boring volume elements. By integrating all the differential
radiance changes along the ray, we get the integral form of
the Volume Light Transfer Equation (see Ref. 3).
The light transfer in volumes is much more complicated
than on surfaces and takes longer to compute. Fortunately,
humans are not very skillful observers of illuminated
volumes and large errors can be tolerated in the solution,
which is often exploited by simplifying the transportthe
most commonsimplication is the single scattering approx-
imation, where only one scattering event is assumed on the
way between the light source and the eye. It corresponds to
direct illumination in surface lighting.
RADIOSITY METHOD
Radiosity (8) is a very different approach to computing
global illumination on surfaces than MC ray tracing. It is
based on the theory of radiative heat transfer and the
solution is found with a nite element method. In the basic
radiosity method, the scene surfaces are discretized into
small surface elements (patches) of constant radiosity.
Only diffuse BRDFs are considered. The energy exchange
in the scene is described by the radiosity equation
B
i
B
e
i
r
i
X
N
j1
F
i; j
B
j
where B
i
is the total radiosity of the ith patch, B
e
i
is the
radiosity due to self emission (non zero only for light
sources), r
i
is the surface reectance, and F
i; j
is the form
(or conguration) factor. The formfactor F
i; j
is the propor-
tion of the total power leaving the ith patch that is received
by the jth patch. The whole radiosity equation means that
the radiosity leaving the ith patch is the sum of the patchs
self emitted radiosity and the reection of the radiosities
emitted by all other surface patches toward the ith patch.
The radiosityemittedbyasurface patchj towardthe patchi
is form factor F
i; j
times its total radiosity B
j
. Knowing the
form factors, the radiosity equations corresponding to all
the patches of the scene can be cast in a system of linear
equations with unknown patch radiosities and solved by
standard linear algebra techniques. The details of the
algorithm are given in a separate article.
In spite of a great deal of research that has gone into
radiosity, the methoddidnot prove to be practical. The most
important disadvantage is that the method works correctly
only with clean models (no overlapping triangles, no
inconsistent normal, etc.), which is rarely the case in prac-
tice. Among computational disadvantages are large mem-
orycomplexitydue to the surface discretization, difcultyof
computing form factors, and inability to handle general
BRDFs.
Similar to the rst pass of the photon mapping algo-
rithm, radiosity gives a view-independent solution. It com-
putes radiosity values for all surface patches regardless of
the position of the camera. To generate the nal image from
a radiosity solution, nal gathering is usedbecause it masks
the artifacts due to discretizing surfaces into patches.
8 LIGHTING
IMAGE-BASED LIGHTING, HIGH DYNAMIC RANGE
IMAGES
In spite of the success of modeling and rendering techni-
ques to produce many realistic looking images of synthetic
scenes, it is still not possible to model and render every
naturally occurring object or scene. A compromise to this
problem is to capture the real-world images and combine
them with computer-rendered images of synthetic objects.
This approach is commonly used for special effects in
movies, in high-quality games, and in contents created
for simulation and training. To achieve a seamless and
convincingfusionof the virtual object withthe images of the
real world, it is essential that the virtual object be illumi-
natedby the same light that illuminates other objects inthe
real scene. Thus, the illumination has to be measured from
the real environment and applied during the lighting
simulation on the synthetic object.
Image-based lighting (9) is the process of illuminating
objects with images of light from the real world. It involves
two steps. First, real-world illumination at a point is cap-
tured and stored as a light probe image (also called an
environment map) (Fig. 8), an omnidirectional image in
which each pixel corresponds to a discrete direction around
the point. The image spans the whole range of intensities of
the incoming lightit is a high dynamic range (HDR)
image. An omnidirectional image can be taken by photo-
graphing a metallic sphere reecting the environment or
with a specialized omnidirectional camera. Multi-exposure
photography is used to capture the high dynamic range.
Each exposure captures one zone of intensities, which are
then merged to produce the full dynamic range.
Once the light probe image is captured, it can be used to
illuminate the object. To this end, the light probe might be
replaced by a number (3001000) of directional lights cor-
responding to the brightest pixels of the probe image.
IMAGE DISPLAY, TONE MAPPING
Once the physical radiance values are computed for a
rectangular array of pixels, they have to be mapped to
image intensity values (between 0 and 1) for display.
One of the main problems that must be handled in this
mapping process is that the emitting ranges of display pixel
are limited and xed, whereas the computed radiance
values can vary by orders of magnitude. Another problem
is that the displayed images are normally observed in
illumination conditions that may be very different from
the computed scene. Tone reproduction or tone mapping
algorithms attempt to solve these problems. Reference 10
describes a number of such algorithms.
DISCUSSION
Lighting computation is an important branch of computer
graphics. More than 30 years have been devoted to the
research in lighting computation. Lighting engineers are
Figure 8. Example light probe images fromthe Light Probe Gallery at http://www.debevec.org/Probes. (Images courtesy of Paul Debevec.)
LIGHTING 9
regularly using lighting computation to accurately predict
the lighting distribution in architectural models. The
entertainment industry is using lighting computation for
producing convincing computer-generated images of vir-
tual scenes.
Unfortunately, the computation time for global illumi-
nation simulation is still very high, prohibiting its full use
in the entertainment industry, where thousands of images
must be rendered within tight deadlines. Still, the industry
is able to create images of stunning quality, mostly thanks
to artists who knowhowlight behaves and use their knowl-
edge to fake the global illumination. As the indirect light-
ing would be missing in a direct illumination-only solution,
lighting artists substitute indirect lighting by a number of
ll lights to achieve a natural look of the object without
having to resort to computationally expensive global illu-
mination techniques. However, it takes a great deal of
expertise to achieve a natural looking global illuminated
scene using direct illumination-only techniques, and some-
times it is not possible at all. It is, therefore, desirable to
make the global illumination available to all computer
graphics artists. To that end, most of the current lighting
research is being devoted to developing approximate but
very fast global illumination algorithms, accurate enough
to be imperceptible from the correct global illumination
solution.
In the research community, MC techniques have been
established as the method of choice for computing global
illumination. Still, there are some important topics for
future work. Efcient handling of light transport on glossy
surfaces, adaptive sampling techniques exploiting a priori
knowledge such as illumination smoothness are some
examples. Finally, the computer-generated images will
continue to look computer generated unless we start
using reection models that closely match reality. For
this reason, more accessible and precise acquisition of
BRDFs of real-world objects will be a topic of intensive
research in future.
BIBLIOGRAPHY
1. A. S. Glassner, Principles of Digital Image Synthesis, San
Francisco, CA: Morgan Kaufmann, 1995.
2. P. Dutre, P. Bekaert, and K. Bala, Advanced Global Illumina-
tion, Natick, MA: A. K. Peters Ltd., 2003.
3. M. Pharr and G. Humphreys, Physically Based Rendering:
From Theory to Implementation, San Francisco, CA: Morgan
Kaufmann, 2004.
4. T. Akenine-Moller and E. Haines, Real-time Rendering, 2nd
ed.Natick, MA: A. K. Peters Ltd., 2002.
5. P. Shirley and R. K. Morley, Realistic Ray Tracing, 2nd ed.
Natick, MA: A. K. Peters Ltd., 2003.
6. H. W. Jensen, Realistic Image Synthesis Using Photon Map-
ping, Natick, MA: A. K. Peters Ltd., 2001.
7. G. J. Ward and P. Heckbert, Irradiance gradients, Proc. Euro-
graphics Workshop on Rendering, 8598, 1992.
8. M. F. Cohen and J. R. Wallace, Radiosity and Realistic Image
Synthesis, San Francisco, CA: Morgan Kaufmann, 1993.
9. P. Debevec, A tutorial on image-based lighting, IEEE Comp.
Graph. Appl., Jan/Feb: 2002.
10. E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec, High
Dynamic Range Imaging, San Francisco, CA: Morgan
Kaufman, 2005.
JAROSLAV KR
IVA
NEK
Czech Technical University
in Prague
Prague, Czech Republic
SUMANTA PATTANAIK
University of Central Florida
Orlando, Florida
10 LIGHTING
P
PARAMETRIC SURFACE RENDERING
INTRODUCTION
Parametric surface (1) is a surface dened by a tensor-
product of parametric curves, which are represented by
parametric equations. As such a surface comes with a well-
dened mathematical denition, it becomes one of the most
common geometric tools for computer-aided geometric
design and for the object modeling aids of many computer
graphics applications. Typical examples of such a surface
include Bezier surfaces (1) and non-uniform rational B-
Spline (NURBS) surfaces (2). Unfortunately, because most
existing graphics hardware only handle polygons, extra
procedures must be carried out to convert a parametric
surface into a polygon representation for rendering. This
process introduces a signicant bottleneck to graphics
applications, which involve parametric surfaces, to assure
interactiveness in terms of rendering performance.
ABOUT PARAMETRIC SURFACES
In the past, many different parametric surfaces have been
developed. They include Bezier surfaces and NURBS sur-
faces, Bezier triangle, and multivariate objects. Such sur-
faces are mainly used to model smooth and curved objects,
and particularly, to provide a good support for modeling
deformable objects. For instance, Bezier surfaces and
NURBS surfaces are typically used in computer-aided
geometric design, whereas multivariate objects are mainly
used in scientic visualization and in free-form deforma-
tion (3). A unique feature of the parametric surface is that
its shape can be changed by modifying the position of its
control points.
A parametric surface is modeled by taking a tensor-
product on some parametric curves, which are formulated
by parametric equations. In a parametric equation, each
three-dimensional coordinate of a parametric curve is
represented separately as an explicit function of an inde-
pendent parameter u:
Cu xu; yu; zu (1)
where C(u) is a vector-valued function of the independent
parameter u in which a u b. The boundary of the
parametric equation is dened explicitly by the parametric
intervals [a, b].
Bezier Surface
The Bezier surface (1) was developed by a French engineer
named Pierre Bezier and was used to design Renault auto-
mobile bodies. The Bezier surface possesses several useful
properties, such as endpoints interpolation, convex hull
property, and global modication. Such properties make
the Bezier surface highly useful and convenient to design
curved and smooth objects. Hence, this surface is adopted
widely in various CAD/CAMand in animation applications,
and in general graphics programming packages, such as
OpenGL and Performer. A Bezier surface is dened as
S
Bez
u; v
X
n
i0
X
m
j0
B
n
i
uB
m
j
vP
i j
for 0 u; v 1 (2)
where n and mare the degrees of the Bezier surface along u
and v parametric directions, respectively. P
ij
forms the
control net. B
n
i
u and B
m
j
v are the basis functions, in
which each is dened by a Bernstein polynomial:
B
n
i
u
n!
i!n i!
u
i
1 u
n1
(3)
To evaluate a Bezier surface, we can apply the de Casteljau
subdivision algorithm (4) to its Bernstein polynomials in
both u and v parametric directions. For instance, in the u
direction, we have
P
r
i
u 1 uP
r1
i
u uP
r1
i1
u (4)
for r 1; . . . ; n; i 0; . . . ; n r where n is the degree of the
surface. Bernstein polynomials of the other parametric
direction also are evaluated through a similar recursion
process.
Although the Bezier surface provides a powerful tool in
shape design, it has some limitations. Particularly, when
modeling an object with complex shape, one may either
choose to use a Bezier surface with prohibitively high
degrees or a composition of pieces of low-degree Bezier
surface patches by imposing an extra continuity constraint
between the surface patches.
B-Spline Surface
The B-Spline surface (2) is formed by taking a tensor-
product on B-Spline curves in two parametric directions,
in which each of the B-Spline curves is a combination of
several piecewise polynomials. The connecting points of the
polynomial segments of a B-Spline curve are maintained
automatically withC
p1
continuity, where pis the degree of
the B-Spline curve. In addition, a B-Spline curve is com-
posed of several spans. In the parametric domain, these
spans are dened by a knot vector, which is a sequence of
knots, or parameter values u
i
, along the domain. AB-Spline
polynomial is specied by a scaled sum of a set of basis
functions. A B-Spline surface is dened as follows:
S
Bs p
u; v
X
n
i0
X
m
j0
N
p
i
uN
q
j
vP
i j
(5)
where nand mare the numbers of control points, and p and
q are the degrees of surface in the u and v parametric
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
directions, respectively. P
ij
forms the control net. N
p
i
v and
N
q
j
v (v) are the basis functions. Abasis function is dened
by
N
p
i
u
1
0
for u
k
u u
k1
otherwise
for p 1 (6)
N
p
i
u
u u
i
u
ip
u
i
N
p1
i
u
u
ip1
u
u
ip1
u
i
N
p1
i1
u for p>1
(7)
The knot vectors of the B-Spline surface are dened as
follows:
U f0; . . . ; 0;
|{z}
p1
u
p1
; . . . ; u
rp1
; f1; . . . ; 1; g
|{z}
p1
for r n p 1
(8)
V f 0; . . . ; 0;
|{z}
q1
u
q1
; . . . ; u
sq1
; 1; . . . ; 1
|{z}
g
q1
for s m q 1
(9)
where U and V are the knot vectors along the u and v
parametric directions, respectively. U has r knots, and V
has s knots. The de Boor algorithm (5) was proposed to
evaluate a B-Spline surface in parameter space with a
recurrence formula of B-Spline basis functions. Other meth-
ods are proposed to accelerate the evaluation process, such
as Shantzs adaptive forward differencing algorithm(6) and
Silbermanns high-speed implementation for B-Splines (7).
Generally speaking, the B-Spline surface has similar
properties to the Bezier surface. In addition, both surfaces
are tightly related. For instance, a B-Spline surface can be
reduced to a Bezier surface if it has both knot vectors in the
following format:
f 0; . . . ; 0
|{z}
p1
; 1; . . . ; 1g
|{z}
p1
for p degree of B-Spline
(10)
In addition, a B-Spline surface can be converted into sev-
eral of Bezier surface patches through knot insertion (8).
NURBS Surface
A NURBS surface (2) is a rational generalization of the
tensor product nonrational B-Spline surfaces. They are
dened by applying the B-Spline surface Equation 5 to
homogeneous coordinates rather than to the normal 3-D
coordinates. The equation of a NURBS surface is dened as
follows:
S
Nrb
u; v
P
n
i0
P
m
j0
w
i; j
P
i; j
R
i; j
u; v
P
n
i0
P
m
j0
w
i; j
R
i; j
u; v
(12)
where w
i, j
are the weights, P
i, j
form a control net, and
R
i, j
(u,v) are the basis functions. NURBS surface not only
shares all properties of nonrational B-Spline surface, but in
addition, they possesses the following two useful features:
A
j
dA
y
A
i
dA
x
cosf
x
cosf
y
pr
2
xy
Vx; y (3)
where f
x
and f
y
are the angles between the normals of
differential patches dA
x
and dA
y
, respectively with the line
connecting the two differential patches; r
xy
is the distance
between dA
x
and dA
y
; and A
i
and A
j
are areas of patches i
and j. Figure 4 illustrates these parameters. V(x,y) is the
visibility between the two differential patches and is
1
Refer to the article on Lighting for the denition of various
radiometric quantities.
2
For a Lambertian surface, color is a constant times its radiosity.
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
expressed in Equation (4).
Vx; y
0; if there is occlusion between dA
x
and dA
y
1; otherwise
(4)
If we change the relationship betweenthe patches i andj
and wish to write an expression for F
j!i
, then it will be as
follows:
F
j !i
1
A
j
A
j
dA
y
A
i
dA
x
cos f
x
cos f
y
pr
2
xy
Vx; y (5)
Fromthese two formfactor equations we canderive auseful
property of form factor:
A
i
F
i ! j
A
j
F
j !i
or
F
i ! j
F
j !i
A
j
A
i
In other words, the ratio of the two form factors between
any two patches is inversely proportional to the ratio
between the areas of the two patches.
From the denition of the form factors itself, we get
another property: The sum of all form factors from any
patch is no greater than 1 (i.e.,
P
j
F
i ! j
1).
RADIOSITY EQUATION
In the beginning of this article, we dened radiosity as the
ux per unit area. If we use symbol B to denote radiosity
and assume that the light is distributed uniformly over the
areaof the patch, thenwe have B
i
F
i
A
i
. Usingthis denition
of radiosity and the property of form factor, we derive the
radiosity transport equation (or radiosity equation for
short) in Equation (6) from the ux transport equation
given in Equation (2).
F
i
A
i
F
e;i
A
i
r
i
X
n
j1
F
j
F
j !i
A
i
F
e;i
A
i
r
i
X
n
j1
F
j
A
j
A
j
F
j !i
A
i
F
e;i
A
i
r
i
X
n
j1
F
j
A
j
F
i ! j
B
i
E
i
r
i
X
n
j1
B
j
F
i ! j
(6)
The derivation of Equation (6) uses the inverse area form
factor relationship
A
j
F
j !i
A
i
F
i ! j
described in the previous
section.
Figure 5 summarizes the various steps for the radiosity
computation algorithm. The process begins by inputting
the model, followedby computing formfactors andsolving a
linear system, and nally by displaying the scene fromany
arbitrary viewpoint. The input geometry is often discre-
tized into smaller patches in the rst step. The last step is
often a hardware walk-through in the scene. Note that in
this walk-through step, the change in view direction does
not require the re-evaluation of any other step of the algo-
rithm. Change in the surface property (diffuse reectance
and emissivity), however, does require repetition of the
linear system solution step. But, form factors are not
1 2 3
1
2
3
1
2
3
Figure 1. Light interaction.
i
j
Figure 2. Light propagation.
1 2 3
1
2
3
1
2
3
1 1
F
2 1
F
3 1
F
1 2
F
2 2
F
3 2
F
1 3
F
2 3
F
3 3
F
Figure 3. Form factors.
i
j
y
x
x
f
xy
r
y
f
Figure 4. Form factor parameters.
2 RADIOSITY
affected by this change and hence they do not have to be
recomputed.
SOLVING RADIOSITY EQUATION
By rearranging Equation (6), we can get the following
matrix form of the radiosity system:
I MB E (7)
where; I is the identity matrix, M =
r
1
F
1 !1
r
1
F
1 !2
r
1
F
1 !n
r
2
F
2 !1
r
2
F
2 !2
r
2
F
2 !n
.
.
.
.
.
.
}
.
.
.
r
n
F
n!1
r
n
F
n!2
r
n
F
n!n
2
6
6
6
4
3
7
7
7
5
, B =
B
1
B
2
.
.
.
B
n
2
6
6
6
4
3
7
7
7
5
, and
E =
E
1
E
2
.
.
.
E
n
2
6
6
6
4
3
7
7
7
5
.
For convenience, we may denote (IM) in Equation (7)
with K, a square matrix with coefcients k
ij
.
K I M k
i j
nn
(8)
where
k
i j
1 r
i
F
ii
; for i j
r
i
F
i j
; for i 6 j
1
1 r
DB
k
(14)
where
DB
X
n
j1
DB
k
j
A
j
X
n
j1
A
j
and r
X
n
j1
r
j
A
j
X
n
j1
A
j
FORM FACTOR COMPUTATION
Form factor must be computed before solving the radiosity
system. Form factor computation is the most complex step
of the radiosity method. In general, it takes 6080% of the
computationtime in total. There are two classes of methods
to compute formfactors: analytical methods and numerical
methods.
Analytical Methods
The analytical methods are useful only in the absence of
the occlusion between two patches. The most often used
analytical method is from Nishita and Nakamae (3). It
formulates the form factor between a differential patch
dS
i
of the patch S
i
to patch S
j
as a line integral. The
formulation is given in Equation (15). This method does
not consider occlusion between dS
i
and S
j
.
F
dS
i
!S
j
1
2
X
m
k1
b
k
cos a
k
(15)
where cos b
k
QV
k1
jQV
k1
j
QV
k
jQV
k
j
; cos a
k
N
i
jN
i
j
QV
k
QV
k1
jQV
k
QV
k1
j
,
and V
k
-s are the vertices of patch S
j
and Q is the center of the
differential patch dS
i
(see Fig. 6 for illustration).
Numerical Methods
Numerical methods are often used in the form factor com-
putation. Hemicube method (4) is one of the rst proposed
and most popular numerical methods. This method also
computes form factor F
dS
i
!S
j
between a differential patch
dS
i
to all the patches S
j
in the scene. In this method, a
virtual unit hemicube is set up around dS
i
. The faces of the
hemi-cube are discretized into a number of rectangular
pixels (See Fig. 7 for illustration). The form factors DF
q
of pixel q from dS
i
to the pixel q on the virtual hemicube is
precomputed analytically. All the patches of the scene are
scan converted with visibility resolution onto the ve faces
of the hemicube. F
dS
i
!S
j
is nally computed by summing
up the DF
q
-s of all the pixels on the hemicube that are
occupied by the patch S
j
.
This method can take advantage of hardware Z-buffer
rendering to speed up the form factor computation. Z-
buffering allows for handling the visibility of patches.
The patch-to-patch form factor F
i ! j
is either approxi-
mated to be equal to F
dS
i
! j
computed fromthe center of the
patch S
i
or approximated as an average of the F
dS
i
! j
-s at
the vertices of the patch S
i
.
DF
q
-s are precomputed analytically using the equations
given below.
For a pixel q on the top face of the hemicube, the
following is the equation for DF
q
:
DF
q
cos f
i
cos f
j
pr
2
i j
DA
top
1
p1 x
2
y
2
2
DA
top
(16)
V
1
V
N
5
1
V
2
S
J
V
4
V
3
5
b
4
b
3
b
2
b
1
b
5
a
Figure 6. dS
i
and S
j
.
4 RADIOSITY
In this equation, DA
top
is the area of the pixel. Given the
local coordinate system setup as shown in Fig. 8, the
coordinate of the center of the pixel is (x,y,1), and we also
have cosf
i
cosf
j
1=r
i j
and r
i j
1 x
2
y
2
p
.
For a pixel on the left side face of the hemicube, its DF
q
is
computed as follows:
DF
q
1
p1 y
2
z
2
2
DA
side
(17)
where (1, y, z) is the coordinate of the center of the pixel
and DA
side
is the area of the pixel. Equations for the pixels
on the other side faces are similar to Equation (17), except
for the denominator, whichdepends onthe coordinate of the
pixel.
EXTENSIONS TO CLASSIC RADIOSITY
Classic radiosity methods assume that the radiosity is
uniformover the areaof the surface. This conditionis rarely
true over larger surfaces of the scene. So the very rst step
of the radiosity method is scene discretization. Regular
discretization of scene surface causes light and shadow
leakage at discontinuities due to shadow boundaries and
touching objects. Adiscontinuity meshingtechnique (5) has
been proposed to address this issue. This technique iden-
ties discontinuities and splits the surfaces along the dis-
continuities rst. The split surfaces are further discretized
to smaller patches.
The radiosity computed for patches of a surface gives
discrete approximation of a smooth radiosity function over
the surface. Direct rendering of such a function gives a
faceted appearance to the surface. Hence, a smooth recon-
struction of the radiosity function over the surface is desir-
able before it is used for display. A commonly used
reconstruction method is to rst compute the weighted
average radiosity on vertices of the scene using radiosity
of adjacent patches. Weights are the relative areas of the
patches. The radiosity on any point on a patch is then
interpolated from the vertices of the patch giving a smooth
appearance to the scene surfaces.
The assumptions imposed in radiosity formulation often
limit the application of the classic radiosity method. The
assumptions are: Lambertian surface, uniform light dis-
tribution over the patch, planar patches, nonparticipating
medium, andso on. There has beenmuchresearchaimedat
overcoming the limitations. Wavelet (6) and nite element
methods (7) have beenproposedto remove the uniformlight
distribution assumption and thus to reduce the amount of
discretization. The hierarchical radiosity method (8,9) has
been proposed to accelerate the computation by reducing
the number of formfactor computations withvery little loss
of accuracy. Extensions to support nondiffuse surfaces (10),
textured surfaces (11), curved surfaces (12), nondiffuse
light sources (13), participating media (14), furry surfaces
(15) and fractals (16) have also been proposed.
Theradiositymethodwill remainthe primaryphysically
based method for computing realistic lighting in synthetic
scenes. More than 15 years of research has been devoted to
solving problems related to this method. Findings of this
research are being used regularly in many rendering soft-
ware solutions developed for commercial and noncommer-
cial use. Despite so manyyears of researchefforts, the scene
discretization and the extension to nondiffuse environ-
ments remain hard problems.
BIBLIOGRAPHY
1. C. M. Goral, K. E. Torrance, D. P. Greenberg, and B. Battaile,
Modeling the interaction of light between diffuse surfaces,
Comput. Grap., 18 (3): 212222, 1984.
2. M. Cohen, S. E. Chen, J. R. Wallace, and D. P. Greenberg, A
progressive renement approach to fast radiosity image gen-
eration, Comput. Graph., 22 (4): 7584, 1988.
3. T. Nishita and E. Nakamae, Continuous tone representationof
three-dimensional objects takingaccount of shadows andinter-
reections, Comput. Graph., 19 (3): 2330, 1985.
4. M. Cohen and D. P. Greenberg, The hemi-cube: A radiosity
solution for complex environments, Comput. Graph., 19 (3):
3140, 1985.
S
j
dS
i
Figure 7. Hemicube.
z
x
y
dS
i
r
ij
(1,1,1)
(1,1,0)
(1,-1,0) (-1,-1,0)
(-1,-1,1) (1,-1,1)
(-1,1,1)
(-1,1,0)
si de
A
top
A
Figure 8. Coordinate system of DF
q
.
RADIOSITY 5
5. D. Lischinski, F. Tampieri, and D. P. Greenberg, Combining
hierarchical radiosity and discontinuity meshing, ACM SIG-
GRAPH 93 Proc., 1993, pp. 199208.
6. F. Cuny, L. Alonso, and N. Holzschuch, A novel approach
makes higher order wavelet really efcient for radiosity,
Comput. Graph.Forum (Proc. of Eurographics 2000), 19 (3):
99108, 2000.
7. R. Troutman and N. L. Max, Radiosity algorithm using higher
order nite element methods, ACM SIGGRAPH 1993 Proc.,
1993, pp. 209212.
8. M. Cohen, D. P. Greenberg, D. S. Immel, and P. J. Brock, An
efcient radiosity approachfor realistic image synthesis, IEEE
Comput. Graph. Appl., 6 (3): 2635, 1986.
9. P. Hanrahan, D. Salzman, andL. Upperle, Arapidhierarchical
radiosity algorithm, Comput. Graph., 25 (4): 197206, 1991.
10. J. R. Wallace, M. F. Cohen, and D. P. Greenberg, A two-pass
solution to the rendering equation: A synthesis of ray tracing
and radiosity methods, Comput. Graph., 21 (4): 311320, 1987.
11. H. Chen and E. H. Wu, An efcient radiosity solution for bump
texture generation, Comput. Graph., 24 (4): 125134, 1990.
12. H. Bao and Q. Peng, A progressive radiosity algorithm for
scenes containing curved surfaces, Comput. Graph.Forum
(Eurographics 93), 12 (3): C399C408, 1993.
13. E. Languenou and P. Tellier, Including physical light sources
and daylight inglobal illumination, Third Eurographics Work-
shop on Rendering, pp. 217225, 1992.
14. H. E. Rushmeier and K. E. Torrance, The zonal method for
calculating light intensities in the presence of a participating
medium, Comput. Graph., 21 (4): 293302, 1987.
15. H. Chen and E. Wu, Radiosity for furry surfaces, Proc. of
EUROGRAPHICS91, in F. H. Post and W. Barth, (eds.)
North-Holland: Elsevier Science Publishers B. V. 1991, pp.
447457.
16. E. Wu, A radiosity solution for illumination of random fractal
surfaces, J. Visualization Comput. Animation, 6 (4): 219229,
1995.
FURTHER READING
I. Ashdown, Radiosity: A Programmers Perspective, New York:
John Wiley & Sons, Inc., 1994.
M. F. Cohen and J. R. Wallace, Radiosity and Realistic Image
Synthesis, Boston, MA: Academic Press Professional, 1993.
F. X. Sillion, Radiosity and Global Illumination, San Francisco,
CA: Morgan Kaufmann Publishers, 1994.
RUIFENG XU
SUMANTA N. PATTANAIK
University of Central Florida
Orlando, Florida
6 RADIOSITY
R
RENDERING
INTRODUCTION
In the real world, light sources emit photons that normally
travel instraight lines until theyinteract withasurface or a
volume. When a photon encounters a surface, it may either
be absorbed, reected, or transmitted. Some of these
photons may hit the retina of an observer where they are
converted into a signal that is then processed by the brain,
thus forming an image. Similarly, photons may be caught
by the sensor of a camera. In either case, the image is a 2-D
representation of the environment.
The formation of an image as a result of photons inter-
acting with a 3-D environment may be simulated on the
computer. The environment is then replaced by a 3-D
geometric model andthe interactionof light withthis model
is simulated with one of a large number of algorithms. The
process of image synthesis by simulating light behavior is
called rendering.
As longas the environment is not altered, the interaction
of light and surfaces gives rise to a distribution of light in a
scene that is in equilibrium (i.e., the environment does not
get lighter or darker).
As all rendering algorithms model the same process, it is
possible to summarize most rendering algorithms by a
single equation, which is known as the rendering equation.
The underlying principle is that each point x on a surface
receives light from the environment. The light that falls on
a point on a surface may be coming directly from a light
source, or it may have been reected one or more times by
other surfaces.
Considering a point x onsome surface that receives light
from all directions, the material of the surface determines
howmuch of this light is reected, and in which directions.
The reective properties of a surface are also dependent on
wavelength, whichgives eachsurface its distinctive color. A
material may therefore be modeled using a function that
describes howmuch light incident on a point on a surface is
reected for each incoming and each outgoing direction.
Such functions are generally known as bidirectional reec-
tance distributionfunctions (BRDFs), and are denoted here
as f
r
x; Q
i
; Q
o
. This function is dependent on the position
on the surface x, as well as the angle of incidence Q
i
and the
outgoing direction Q
o
.
To determine how much light a surface reects into a
particular direction, we can multiply the BRDF for each
angleof incidence withthe amount of incident light L
i
x; Q
i
X
L
Z
x
i
2L
vx; x
i
f
r;d
xL
e
x
i
; Q
i
cos Q
i
dv
i
Z
Q
s
2V
s
f
r;s
x; Q
s
; Q
o
Lx
s
; Q
s
cos Q
s
dv
s
r
d
xL
a
x
The four terms onthe right-handside are the emissionterm,
followed by a summation of samples shot toward the light
sources. The visibility term vx; x
i
is 1 if position x
i
on the
light source is visible from point x and 0 otherwise. The
integration in the second term is over all possible positions
on each light source. The third term accounts for specular
reections, andthe fourthtermis the ambient term, whichis
addedto account for everything that is not sampleddirectly.
Thus, in ray tracing, only the most important directions
are sampled, namely the contributions of the light sources
andmirror reections, whichrepresents a vast reductionin
computational complexity over a full evaluation of the
rendering equation, albeit at the cost of a modest loss of
visual quality. Finally, ray tracing algorithms can nowrun
at interactive rates and, under limited circumstances, even
in real-time.
RADIOSITY
Both ray tracing and the local illumination models dis-
cussed earlier are view-point-dependent techniques.
Thus, for each frame, all illumination will be recomputed,
whichis desirable for view-point-dependent effects, suchas
specular reection. However, diffuse reection does not
visibly alter if a different viewpoint is chosen. It is therefore
possible to preprocess the environment to compute the light
interaction between diffuse surfaces, which may be
achieved by employing a radiosity algorithm (Fig. 4). The
result can then be used to create an image, for instance, by
ray tracing or with a projective algorithm.
The surfaces in the environment are rst subdivided
into small patches. For computational efciency, it is nor-
mally assumed that the light distributionover eachpatchis
constant. For small enoughpatches, this assumptionis fair.
Each patch can receive light from other patches and then
diffusely reect it. Acommon way to implement radiosity is
to select the patch with the most energy and distribute this
energyover all other patches, whichthengainsome energy.
This process is repeateduntil convergence is reached. As all
patches are assumed to be diffuse reectors, the rendering
equation [Equation (1)] can be simplied to not include the
dependence on outgoing direction Q
o
:
Lx L
e
x r
d
x
Z
x
Lx
0
cos Q
i
cos Q
0
o
pkx
0
xk
2
vx; x
0
dA
0
where vx; x
0
is the visibility term, as before. This equation
models light interaction between pairs of points in the
environment. As radiosity operates on uniform patches
rather than points, this equation can be rewritten to
include a form factorF
ij
, which approximates the fraction
of energy leaving one patch and reaching another patch:
L
i
L
i
e
r
i
d
X
j
L
j
F
i j
F
i j
1
A
i
Z
A
i
Z
A
j
cos Q
i
cos Q
j
pr
2
d
i j
dA
j
dA
i
where the visibility term v between points is replaced with
d
i j
, which denotes the visibility between patches i and j.
The form factor depends on the distance between the two
patches, as well as their spatial orientation with respect to
one another (Fig. 5). In practice, the computation of form
factors is achieved by ray casting.
The radiosity algorithm can be used to model diffuse
inter-reection, which accounts for visual effects such as
Figure 4. An early example of a scene preprocessed by a radiosity
algorithm. Image courtesy of Michael Cohen (2). Figure 5. Geometric relationship between two patches.
4 RENDERING
color bleeding(the coloredglowthat asurface takes onwhen
near a bright surface of a different color, as shown inFig. 6).
MONTE CARLO SAMPLING
Theintegral intherenderingequation[Equation(1)] maybe
evaluatednumerically. However, as boththe domainV
i
and
the integrand [Equation (1)] are complex functions, a very
large number of samples would be required to obtain an
accurate estimate. To make this sampling process more
efcient, a stochastic process called Monte Carlo sampling
may be employed. The environment is then sampled ran-
domlyaccordingtoaprobabilitydensityfunction(pdf) pv
i
:
Z
V
i
gv
i
; Q
i
; Q
o
dv
i
1
N
X
N
i1
gv
i
; Q
i
; Q
o
pv
i
UZ AKYU
Z
University of Central Florida
Orlando, Florida
RENDERING 7
S
SOLID MODELING
Solid modeling is the technique for representing and mani-
pulating a complete description of physical objects. A com-
plete description of a solid object is a representation of the
object that is sufcient for answering geometric queries
algorithmically. This requires a mathematical formulation
that denes rigorously the characteristics of a solid object.
Basedonthis mathematical formulation, different schemes
for representing physical objects are developed.
MATHEMATICAL FORMULATION
A solid is described using the concept of point-set topology,
whereas the boundary of a solid is characterized by using
the concept of algebraic topology as discussed below. Asolid
is a closed point set in the three-dimensional Euclidean
space E
3
. For example, a unit cube is denoted as S fp :
p x; y; z; such that x 20; 1; y 20; 1; z 20; 1g. A point
set in E
3
denoting a solid is rigid and regular (1). A point
set is rigid, which implies that it remains the same when
being moved from one location to another. A point set is
regular, which implies that the point set does not contain
isolated points, edges, or faces with no material around
them. To ensure a point set is regular, a regularization
process is applied to a point set as dened below.
Denition 1. Regularizationof a Point Set. Givena point
set S, the regularization of S is dened as rS ciS,
where c(S) and i(S) are, respectively, the closure and inter-
ior of S
The regularization process discards isolated parts of a
point set, which is then enclosed with a tight boundary
resulting in a regular point set. A point set S is regular if
r(S) =S, andaregular set Sis usually referredto as anr-set.
BOOLEAN OPERATIONS ON POINT SETS
Boolean operations on r-sets may result in a nonregular
point set. For example, if A fp : p x; y; z; such that x 2
0; 1; y 20; 1; z 20; 1g and B fp : p x; y; z; such that
x 21; 2; y 21; 2; z 21; 2g, then,
A\B fp : p x; y; z; such that x 20; 1 \1; 2; y 20; 1 \
1; 2; z 20; 1 \1; 2g
or,
A\B fp : p x; y; z; such that x 1; y 1; z 1g;
A\Bis thus a single point and is not regular. To ensure the
result of Boolean operations on point sets are regular point
sets, the concept of regularized Boolean operation is
adopted. Inthe aboveexample, A\B 1; 1; 1, the interior
of A\B is empty; i.e., iA\B f. As the closure of if is
empty, applying a regularization on A\B gives rA\B
ciA\B f. Based on the concept of regularization on
point sets, regularized Boolean operation is dened as
follows.
Denition 2. Regularized Boolean Operations. Denote
[
, \
B ciA[B 1a
A\
B ciA\B 1b
A
B ciA B 1c
A point set describing a solid must be nite and well
dened. The surfaces of a solid are thus restricted to be
algebraic or analytic surfaces.
THE BOUNDARY OF A SOLID
The boundary of a solid is a collection of faces connected to
form a closed skin of the solid. The concept of algebraic
topology (2) is usually adopted for characterizing the
boundary of a solid.
The surface composing the boundary of a solid is a
2-manifold in the two-dimensional space E
2
. A 2-manifold
is a topological space in which the neighborhood of a point
on the surface is topologically equivalent to an open disk of
E
2
. However, there are cases when the boundary of a solid
is not a 2-manifold as shown in Fig. 1. In addition, some
2-manifolds do not constitute a valid object in E
3
(e.g., the
Klein bottle). To ensure the boundary of an object encloses
a valid solid, the faces of the boundary must be orientable
with no self-intersection. An object is invalid (or is a
nonmanifold object) if its boundary does not satisfy the
EulerPoincare characteristic as described below.
THE EULERPOINCARE CHARACTERISTIC
The EulerPoincare characteristic states that an object is
a nonmanifold object if its boundary does not satisfy the
equation
v e f 2s h r
where v-number of vertices, e-number of edges, f-number
of faces, s-number of shells, h-number of throughholes, and
r-number of rings.
The EulerPoincare formula is a necessary but not a
sufcient condition for a valid solid. An object with a
boundary that does not satisfy the EulerPoincare formula
is an invalid solid. On the contrary, an object satisfying the
EulerPoincare formula may not be a valid solid. Figure 2
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
illustrates an invalid solid satisfying the EulerPoincare
formula.
REPRESENTATION SCHEMES
Denote M as the modeling space of solids. That is, a solid
X is an element of M. A solid representation (1) of an object
is a collection of symbols specifying the solid object. There
are various representation schemes for modeling solids
dened as point sets in E
3
. In general, the properties of a
solid representation scheme can be described as follows:
Geometric coverage the objects that can be described
using the representation scheme.
Validity a representation scheme must designate a
valid solid (Fig. 3)
Completeness a representation must provide enough
data for any geometric computation. For example,
givenapoint p, analgorithmmust exist for decidingif
p is in, on, or out of the solid.
Uniqueness a representation scheme is unique if
there is only one representation for a given solid
object.
Unambiguous a representation scheme is unambig-
uous if a representation designates exactly one solid
object.
Conciseness the space (computer storage) required for
a valid representation.
Closure of operation whether operations on solids
preserve the validity of the representation.
Computational requirement and applicability the
algorithms that can be applied to the representation
scheme and the complexity of these algorithms.
Among the various solid representation schemes, the
constructive solid geometry and the boundary represen-
tation are the most popular schemes as discussed in the
following.
CONSTRUCTIVE SOLID GEOMETRY (CSG)
Constructive solid models consider solids as point sets of
E
3
. The basic elements of CSG model are simple point
sets that can be represented as simple half-spaces. Given
a mapping f : E
3
!Rthat maps the Euclidean space to the
real axis, the function f(p), where p is a point in E
3
, divides
the three-dimensional space into two halves. They are the
space f(p) > 0 and its complement f(p) 0. More complex
objects are obtained by combining half-spaces withBoolean
operations. Figure 4 shows a cylinder constructed with
three half-spaces. An object modeled with constructive
solid geometry can be represented with a CSG tree. A
CSG tree is a binary tree structure with half-spaces or
primitive solids at the leaf nodes. Figure 5 shows a set of
solid primitives commonly used in a CSG modeler. All
internal nodes are Boolean operations or transformations.
An example is shown in Figure 6, where an L-shaped block
is constructed with a CSG modeler. Figure 7 gives an
example for modeling a more complicated object using a
CSG modeler.
POINT MEMBERSHIP CLASSIFICATION
AND BOOLEAN OPERATIONS
Abasic characteristic of a solid representation scheme is to
be capable of providing sufcient information for deciding
Figure 1. Nonmanifold object.
H
1
H
2
H
3
z
x
0 : + r
2
y
2
x
2
H
1
0 : z H
2
0 : h z H
3
: H
3
H
2
H
1
C
y
Figure 4. A cylinder constructed with three half-spaces.
v = 10
e = 15
f = 7
Figure 2. An invalid solid statises Euler characteristic.
Figure 3. An invalid solid.
2 SOLID MODELING
if a given point p is in, on, or out of a solid using a suitable
point membership classication (PMC) (3,4) algorithm.
Using a CSG representation scheme, point membership
classication is performed with a divide-and-conquer
approach. Starting from the root node of the CSG tree,
the point p is passed down the tree. Whenever a Boolean
operation node is encountered, p is passed to both the left
and right subnodes of the current node. If a transformation
node is encountered, an inverse transformation is applied
to pandthe result is passedto the left subnode. Whenever a
leaf node is encountered, the point (possibly transformed)
is classied against the half-space or primitive solid of the
node. The result is propagated upward to the parent node.
In this upward propagation, the results of the left and right
subtree of a Boolean operation node are combined. The
upward propagation ends at the root node where the result
obtained will be the result of classifying p against the whole
object.
Combining the PMCresults at a Boolean operation node
requires special consideration. Assume R
L
and R
R
as the
PMC results of the left and right subtrees, respectively. If
R
L
6on or R
R
6on, then the result of R
L
.op. R
R
(where
o p2f [; \; g) canbe easily obtained according to Table 1.
In all three cases, if a point is on both A and B, the classi-
cation result is undetermined. For instance, in Fig. 8, the
point pis onboth Aand B. However, pmay be in or onA[B
depending on the relative position of A and B. To obtain an
accurate classication result, information regarding the
local geometry of the solids in the vicinity of the point is
required for the classication.
NEIGHBORHOOD
The Neighborhood N(p, S) of a point p x; y; z with
respect to a solid S is the intersection of an open ball
withS, where the openball is asphere withaninnitesimal
Figure 7. Object modeling with constructive solid geometry.
Figure 6. The CSG tree of an L-shaped block.
Figure 5. Commonly used solid primitives.
Table 1. Classications in Boolean operations
B B B
A[B in on out
in in in in
A on in ? in
out in on out
A\B in on out
in in on out
A on on ? out
out out out out
AB in on out
in out on in
A on out ? on
out out out out
SOLID MODELING 3
radius e centered at p; i.e.,
Np; S S\fx0; y0; z0jx0 x
2
y0 y
2
z0 z
2
< e
2
g
(2)
Using the notion of neighborhood, the following three cases
can be established (Fig. 9):
1. A point p is in S iff N(p, S) is a full ball.
2. A point p is out of S iff N(p, S) is empty.
3. A point p is on S iff N(p, S) is not full and not empty.
If p lies in the interior of a face, its neighborhood can
be represented by the equation of the surface and is
oriented such that the face normal points to the exterior
of S (Fig. 10a). If p lies in the interior of an edge, the
neighborhood is represented by a set of sectors in a plane
containing p that is perpendicular to the edge at p
(Fig. 10b). A vertex neighborhood is inferred from the set
of face normals of the faces incident at p (Fig. 10c). In a
Boolean operation between two objects A and B, the neigh-
borhoods of a point p relative to A and B are combined.
Whether the combined neighborhood is full, empty, or not
full nor empty determines whether p is in, out, or on the
combined solid. In addition, by interpreting whether the
neighborhoodis aface neighborhood, edge neighborhood, or
vertex neighborhood, the point p can be classied to be
lying on a face, an edge, or is a vertex. For example, in a
union node, if the neighborhood N
L
, N
R
of p relative to both
the left and the right subtrees are face neighborhoods with
face normals n
L
and n
R
, respectively, then the union N of
N
L
and N
R
is an edge neighborhood when jn
L
n
R
j 61. If
n
L
n
R
1, then Nis a face neighborhood. If n
L
n
R
1, N
is full (Fig. 11). If N
L
and N
R
are edge neighborhoods,
the union of N
L
and N
R
may be a vertex neighborhood,
an edge neighborhood, a face neighborhood, or it is full
(Fig. 12).
n
L
n
R
1 n
R
n
L
1 = n
R
n
L 1 = n
R
n
L
Figure 11. Combining face neighborhoods at a union node.
A
B
A
B
p
p
Figure 8. A point lying on both A and B may be on or in A[B.
p
p in S p out of S p on S
Figure 9. Three possible cases of the neighborhood at a point.
Figure 10. Representation of a neighborhood.
Figure 12. The union of edge neighborhoods.
4 SOLID MODELING
BOUNDARY EVALUATION
CSG representation provides a concise description of a
solid. However, there is no explicit information for display
purposes. To obtain this explicit information, a boundary
evaluation process is required. This requires classifying all
edges and faces of all primitive solids (or half-spaces)
against the CSG representation of the solid. These are
performed with the edge-solid classication and the face-
solid classied algorithms.
In the edge-solid classication process, an edge L is
partitioned into segments by intersecting L with all
primitives of the solid. These segments are then classied
against the given solid. A simple approach for classifying
segments is to apply a point membership classication at
the midpoint of each segment. Segments that are classi-
ed as lying on the solid are combined to give edges of the
solid (Fig. 13).
In the face-solid classication process, a face is inter-
sected with all primitives of the solid. The intersection
edges and the boundary edges of the face are then classied
against the given solid using the edge-solid classication
algorithm. Edges lying on the solid are then connected to
form boundaries of the face (Fig. 14).
PROPERTIES OF CSG MODELS
The geometric coverage of a CSG modeler depends on the
type of primitive solids and half-spaces used. Provided all
primitives are r-sets, and regularized set operations are
adopted for the construction of more complex solids, CSG
models are always valid solids. CSG models are unambi-
guous but not unique. A CSG model can be described
precisely using a binary tree. CSG representations are
thus relatively concise. Algorithms for manipulating
CSG models (e.g., boundary evaluation) may be computa-
tionally expensive. With suitable algorithms, CSG models
can be converted to other types of solid representations.
BOUNDARY REPRESENTATION
A boundary representation (B-Rep) of a solid describes a
solidby its boundary information. Boundary informationin
a B-rep includes the geometric information and the topo-
logical relationship between entities of the boundary. Geo-
metric information refers to the vertices, positions,
surfaces, and curves equations of the bounding faces and
edges. Topological relationship refers to the relationship
among the faces, edges, and vertices so that they conform
to a closed volume. For instance, the boundary of an object
can be modeled as a collection of faces that forms the
complete skin of the object. Each face is bounded by a set
of edge loops, and each edge is, in general, bounded by two
vertices. The boundary of anobject can thus be described as
a hierarchy of faces and edges as depicted in Fig. 15.
B-REP DATA STRUCTURE
Among the various data structures for implementing
B-Rep, the most popular one is the winged-edge structure
(5), which describes a manifold object with three tables of
vertices, faces, and edges. In the winged-edge structure,
each face is bounded by a set of disjoint edge loops or cycles.
Each vertex is shared by a circularly ordered set of edges.
For each edge, there are two vertices bounding the edge,
which also denes the direction of the edge. There are two
S = A * B
A
B L
L
outA
L
inA
L
onA
L
outA
L
outB
L
inB
L
outB
L
onA
* L
inB
L
outA
* L
outB
L
outS
L
onS
L
outS
Figure 13. Edge-solid classication.
Intersection
edges
Intersecting
face
Face
boundary
Segment
in solid
Segment
on solid
Figure 14. Face-solid classication.
Face
Edge loop
Vertex
(a)
(b)
Figure 15. Boundary data: (a) A cube and (b) boundary data of a
cube.
SOLID MODELING 5
ipopf.info
faces, the left and the right adjacent faces, sharing an edge.
There are two edge cycles sharing anedge. One edge cycle is
referred to as in the clockwise direction, and the other is in
the counterclockwise direction. The edge cycle on a face is
always arranged in a clockwise direction as viewed from
outside of the solid. In each direction, there is a preceding
and a succeeding edge associated with the given edge.
Figure 16 illustrates the relations among vertices, edges,
and faces in a winged-edge structure. Table 2 gives an
example of a winged-edge representation of the tetrahe-
dron shown in Fig. 17.
VALIDITY OF B-REP SOLID AND EULER OPERATORS
Given a boundary representation specifying a valid solid,
the set of faces of a boundary model forms the complete skin
of the solid with no missing parts. Faces of the model do not
intersect each other except at common vertices or edges.
Face boundaries are simple polygons or closed contours
that do not intersect themselves. To avoid the construction
of invalid B-Rep solid, Euler operators (4,5) are usually
employed in the object construction process. Euler opera-
tors keep track of the number of different topological enti-
ties in a model such that it satises the EulerPoincare
characteristic. Euler operators are functions dened as
strings of symbols in the character set {M, K , S, J, V, E,
F, S, H, R}. Each of the symbols M, K, S, and J denotes an
operation on the topological entities V, E, F, S, H, and R.
Descriptions of these symbols are listed below:
Mmake, Kkill, Ssplit, Jjoin,
Vvertex, Eedge, Fface, Ssolid, Hhole, Rring.
For example, an operator MEF denotes a function to
Make Edge and Face. As shown in Fig. 18, inserting an
edge in the model also creates another face at the same
time. In Fig. 19, the operator KEMRremoves an edge while
creating a ring in the object. The use of Euler operators
always produces topologically valid boundary data. How-
ever, an object with a topologically valid boundary may be
geometrically invalid. As shown in Fig. 20, moving the
vertex p of a valid solid upward causes the interior of the
object to be exposed and thus becomes an invalid object.
Figure 19. The KEMR operator.
cw-pred cw-succ
ccw-pred
cw-face
ccw-face
ccw-succ
Figure 16. The winged-edge data structure.
Table 2. Winged-edge data.
Edge
Vertices Faces CW CCW
Start End CW CCW Pred Succ Pred Succ
e
1
v
1
v
2
f
1
f
2
e
3
e
2
e
4
e
5
e
2
v
2
v
3
f
1
f
4
e
1
e
3
e
6
e
4
e
3
v
3
v
1
f
1
f
3
e
2
e
1
e
5
e
6
e
4
v
1
v
4
f
2
f
3
e
5
e
1
e
2
e
6
e
5
v
4
v
2
f
2
f
4
e
1
e
4
e
6
e
3
e
6
v
4
v
3
f
4
f
3
e
3
e
5
e
4
e
2
Figure 17. A tetrahedron.
Figure 18. The MEF operator.
6 SOLID MODELING
PROPERTIES OF B-REP
The geometric coverage of the boundary representation is
determinedbythe types of surfaces beingused. Earlyworks
in boundary representation are mainly conned to the
manipulation of polyhedral solid models (6). Contemporary
B-Rep modelers adopt trimmed NURBS surfaces for
describing the geometry of an object boundary. This allows
freeform solids to be modeled. B-Rep is unique but ambig-
uous; that is, one representation always refers to one solid.
However, there may be more than one representation for
one object. For example, a cube can be represented by
6 squares, or it can be represented by 12 triangles. As
vertices, edges, and faces information are readily available,
fast interactive display of B-Rep solid can be achieved.
However, a complex data structure is required for main-
taining a B-Rep. Nevertheless, B-Rep is attractive as it
allows local modications of objects. Local modication
refers to a change inthe geometry of anobject. For instance,
a change in the position of a vertex of a cube causes a plane
of the cube to become a bilinear surface (Fig. 21).
HYBRID REPRESENTATION
As there is no single representationscheme that is superior,
hybrid representation is usually adopted. In most popular
hybrid representation schemes, objects are represented
basically in B-Rep. Boolean operations are allowed for
constructing solids. In some cases, sweep operations (7)
are also adopted for constructing solids. A history of ope-
rations is recorded that is an extension of a CSG represen-
tation. The history of operations allows the modeled object
to be modied by adjusting the location or shape para-
meters of the primitive solids. It also serves as the basis
for implementing the Undo operation. As a B-Rep exists
for each object, local modications can be performed on the
solid, whereas set operations mayalso be appliedonobjects.
This provides abasis for the parametric or constraint-based
modelers (8), which are widely adopted in commercial CAD
systems.
OTHER REPRESENTATION SCHEMES
Besides the most popular CSG and B-Rep representation
schemes, other representation schemes help to widen the
geometric coverage or the scope of applications of solid
modelers. Schemes that widen the geometric coverage of
CSGmodeler include the Sweep-CSG(7) and the Construc-
tive Shell Representation (9). Sweep-CSGtechnique allows
objects created with sweep operations to be used as primi-
tives in a CSG modeler. An example is shown in Fig. 22
where a bevel gear is created by sweeping a planar contour
as showninFig. 22a. The sweep operationis anextrusionof
the contour, whereas the contour is being rotated and
scaled. The Constructive Shell Representationadopts trun-
cated tetrahedrons (Trunctet) as basic primitives. A trunc-
tet is a tetrahedron truncated by a quadric surface. By
maintaining continuity between trunctets, free-formsolids
can be modeled with CSG representation (Fig. 23).
Figure 21. Local modication of a cube.
Figure 20. A topologically valid solid becomes geometrically
invalid.
Figure 22. A bevel gear created with a sweep operation: (a) The
gear contour, the extrusion, and rotation axis for the sweep.
(b) Union of the swept solid, a cylinder, and a block.
Figure 23. Trunctet and the Constructive Shell Representation:
(a) A trunctet and (b) an object constructed with four smoothly
connected trunctets.
SOLID MODELING 7
Applications such as various stress and heat analysis
using the nite element method requires an object to be
representedas acollectionof connectedsolidelements. This
solid representation scheme is usually referred to as Cell
Decomposition (10). Spatial Enumeration and Octree (10)
are solid representation schemes widely used in applica-
tions where only a coarse representation of an object is
required (e.g., collision detection). Although the most pop-
ular B-Rep and CSG schemes may not be suitable for these
applications, algorithms exist for converting B-Rep and
CSGmodels into other representation schemes, and hence,
they remain a primary representation in existing solid
modeling systems.
REMARKS
Solid modeling is a technique particularly suitable for
modeling physical objects and is widely adopted in most
commercial CADsystems. Aproduct model built on top of a
solid model provides the necessary information for the
design analysis and production of the model. For instance,
the rapid prototyping process requires an object to have a
closed boundary for its processing. Manufacturing process
planning of a product requires identifying the manu-
facturing features of the object. This can be obtained
from the CSG tree of the object or the topological relation-
ship between the entities in a B-Rep model. The bills of
materials for a product require volumetric information
of the product, which is readily available in the solid model
of the product. A major limitation of solid modeling is its
complexity in terms of storage and the algorithms required
for its processing.
BIBLIOGRAPHY
1. A. G., Requicha Representations of solid objectstheory,
methods, and systems, ACM Comput. Surv., 12(4): 437464,
1980.
2. J., Mayer Algebraic Topology, Englewood Cliffs, NJ: Prentice-
Hall, 1972.
3. B. R., Tilove Set membership classication: A unied
approach to geometric intersection problems, IEEE Trans.
Comput., 29(10): 874883, 1980.
4. C. M., Hoffmann Geometric and Solid Modeling: An Introduc-
tion, San Francisco, CA: Morgan Kaufmann, 1989.
5. B., Baumgart A Polyhedron Representation for Computer
Vision, Proc. National Computer Conference, 1975, 589596.
6. I. C., Braid The synthesis of solids bounded by many faces,
Commun. ACM, 18(4): 209216, 1975.
7. K. C., Hui S. T., Tan Display techniques and boundary evalua-
tion of a sweep-CSG modeller, Visual Comp. 8(1): 1834, 1991.
8. J. J., Shah M., Mantyla Parametric and Feature Based CAD/
CAM, New York: John Wiley & Sons, Inc., 1995.
9. J., Menon Guo Baining, Free-form modeling in bilateral brep
and CSG representation schemes, Internat. J. Computat.
Geom. Appl. 8(56): 537575, 1998.
10. M., Mantyla An Introduction to Solid Modeling, Rockville,
MD: Computer Science Press, 1988.
K. C. HUI
The Chinese University
of Hong Kong
Shatin, Hong Kong
8 SOLID MODELING
SOLID MODELING 9
S
SURFACE DEFORMATION
Surface deformation is a fundamental tool for the cons-
truction and animation of three-dimensional (3-D) models.
Deforming a surface is to adjust the shape of surface, which
requires adjusting the parameters in the representation of
a surface. As the dening parameters of different types of
surfaces are different, different shape control techniques
are available for different surface types. Alternatively,
general deformation tools can be used for deforming sur-
faces. However, the deformation effect depends on how
these tools are applied to the surfaces. In general, deforma-
tion tools can be classied into the geometric- and physics-
based approaches. The geometry-based approach refers to
the deformation of surfaces by adjusting geometric handles
such as lattices and curves. The physics-based approach
associates material properties with a surface such that the
surface can be deformed in accordance with physical laws.
They are discussed in detail in the following sections.
REPRESENTATION AND DEFORMATION OF SURFACE
The early work on the deformation technique developed by
Alan Barr (1) uses a set of hierarchical transformations
(stretching, bending, twisting, and tapering) for deforming
an object. This technique determines the surface normal of
a deformed object by applying a transformation to the sur-
face normal vector of the undeformed surface. It requires a
representation of the surface of which the Jacobian matrix
can be determined. In fact the representation of a surface
determines how a surface can be deformed. For surfaces
that are represented in terms of geometric entities such as
points and curves, deformation of the surface can be easily
achieved through adjusting the locations of the points or
modifying the shape of the curves. However, for surfaces
where there is no direct geometric interpretation for
the dening parameters, special techniques have to be
employed to associate geometric entities with its represen-
tation. In order not to restrict the pattern of deformation on
a surface, free-formsurfaces or surfaces that can be used to
model arbitrary object shapes are usually considered for
surface deformation.
Deforming Algebraic Surfaces
An algebraic surfaces f is a function of the mapping f: E
3
n
1
; w
0110
=
1
2
(v
3
v
2
)
n
2
;
w
1010
=
1
2
(v
1
v
3
)
n
3
;
w
1001
=
1
2
(v
4
v
1
)
n
1
; w
0101
=
1
2
(v
4
v
2
)
n
2
;
w
0011
=
1
2
(v
4
v
3
)
n
3
; (4)
The shape of an algebraic surface can thus be changed
by adjusting the locations of the control points on the
tetrahedron.
Deforming Tensor Product Surfaces
Parametric surfaces are usually represented with a set of
geometric entities. Consider a tensor product surface s(u, v)
expressed in terms of a set of control polygon vertices (e.g.,
Bezier surface and NURBS surfaces) (5)
s(u; v) =
X
n
i=0
X
m
j=0
a
i;k
(u)b
j;l
(v)b
i; j
(5)
1
Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah.
Copyright # 2008 John Wiley & Sons, Inc.
where a
i,k
(u), b
j,l
(v) are blending functions or basis func-
tions associated with each control polygon vertices of the
surface and k and l are orders of the basis functions in the u
and v directions, respectively. The shape of the surface
s(u, v) can be easily modied by adjusting the location of
the control polygon vertices b
i,j
. For tensor product sur-
faces, s(u, v) approximates the control points except at the
corners where s(u, v) interpolates the corner vertices. By
subdividing s(u, v) into a set of C
k1
, and C
l1
continuous
surface patches s
i
(u, v) along the u and v directions, defor-
mation of s(u, v) can be controlled by adjusting the corner
vertices of s
i
(u, v). This allows points on the surface to be
used as handles for the deformation. However, special
consideration has to be taken to maintain the continuity
of the surface at these corner vertices (5). This is also true
for surfaces that are dened by its corner vertices and
tangents, e.g., parametric cubic surfaces; deformation of
the surface can be achieved through adjusting the corner
vertices and tangents.
Trimming Tensor Product Surfaces
Ingeneral, adjusting the control polygonvertices or surface
points of a surface modies the curvature of a surface.
Although the boundary of a surface can be adjusted by
modifying the control vertices on the boundary of a surface,
this may also affect the curvature of a surface. To adjust the
boundary of a surface without affecting the surface curva-
ture, the concept of trimmed surface is usually adopted.
A trimmed surface is a surface with irregular bound-
aries. Consider a region of a surface dened by a series of
trimming curves on a surface s(u,v) as shown in Fig. 2. A
popular approach for representing trimming curves of a
surface is to specify the curves on the parametric space of
s(u, v). For example, in a trimmed NURBS surface, a
trimming curve is expressed as
c(t) = (u(t); v(t)) =
X
n
i=0
R
i;k
(t)b
i
(6)
where b
i
= (u
i
, v
i
) are points on the parametric space of
s(u, v),
R
i;k
(t) =
w
i
N
i;k
(t)
X
n
j=0
w
j
N
j;k
(t)
N
i;k
(t) =
(t t
i
)N
i;k1
(t)
t
ik1
t
i
(t
ik
t)N
i1;k1
(t)
t
ik
t
i1
;
N
i;1
(t) =
1 if t
i
_ t _ t
i1
0 otherwise
(
and w
i
are the weights associated with b
i
. The trimmed
curve in 3D space is thus s(u(t), v(t)). Trimming curves
are closed loops in the parametric space of s(u, v). To
identify the valid region of the trimmed surface, a popular
approach is to use the direction of the trimming curve loop
to indicate the region of the surface to be retained (6).
For instance, a counter-clockwise loop denotes the outer
boundary of the surface, whereas a clockwise loop indicates
a hole of the surface.
Figure 1. An algebraic surface patch.
Figure 2. A trimmed surface and its trim-
ming curves in the parametric space.
2 SURFACE DEFORMATION
Deforming Subdivision and Mesh Surfaces
Subdivision surfaces and simple polygon meshes are also
popular choices for deformable surfaces as they are dened
by a set of points b
i,j
constituting the base mesh or polygon
mesh of the surface. Deformation can thus be achieved
through adjusting the locations of b
i,j
. However, manipu-
lating individual vertices of a polygon mesh is a tedious
process especially for surfaces with a high mesh density.
Deformationtools are thus usually adopted for this purpose
as discussed below.
GEOMETRY-BASED DEFORMATION TOOL
Popular tools for geometry-based deformation include the
lattice-based deformation, the curve-based deformation,
and the constraints-based deformation.
Free-Form Deformation (FFD)
The free-formdeformation technique introduced by Thomas
W. Sederberg and Scott R. Parry (7) deforms an object by
embedding the object in a parallelepiped region. Dening a
local coordinate system at a point p
0
in the parallelepiped
region as shown in Fig. 3, a point p of the object can be
expressed as
p = p
0
sl tm un (7a)
where l, m, n are the coordinate axes at p
0
and their
corresponding coordinates are
s =
(p p
0
)
l
[l[
2
; t =
(p p
0
)
m
[m[
2
; u =
(p p
0
)
n
[n[
2
(7b)
so that 0 _ s _ 1, 0 _ t _ 1, and 0 _ u _ 1 for p lying
in the parallelepiped. By imposing a grid of control points
q
ijk
on parallelepiped, where
q
i jk
= p
o
i
l
l
j
m
m
k
n
n; (8)
and l, m, n are the dimensions of the grid in the l, m, n
direction respectively, the point p can be expressed as
p =
X
l
i=0
X
m
j=0
X
n
k=0
l
i
m
j
n
k
(1 s)
li
s
i
(1 t)
mj
t
j
(1 u)
nk
u
k
q
i jk
(9)
A change in the location of the control points q
ijk
thus
causes the point p and hence the object to be deformed.
Denote F
1
as a mapping F
1
: E
3
R
3
, such that p E
3
, p
L
R
3
, and p
L
= (s, t, u), so that p
L
= F
1
(p) is the function in
Equation 7b. Dene F
2
as the mapping F
2
: R
3
E
3
, where
F
2
is the function in Equation 9. A free-formdeformation is
a mapping F = F
1
F
2
, where F : E
3
E
3
transforms every
point of an object S to that of the deformed object F(S). As F
is independent of the representation of S, the deformation
canbeappliedto anobject irrespective of its representation.
However, applying the deformation to surface points or the
control points of the surface may result indifferent surfaces
as shown in Fig. 4. An example illustrating the application
of FFDon the algebraic surface (8) discussed in Section 2.1
is shown in Fig. 5.
Extended Free-Form Deformation
The control lattice as constructed with Equation 8 is par-
allelepiped. This contrainedthe shape andsize of the region
to be deformed by FFD. Coquillart (9) proposed an
Extended FFD that allows the use of a non-parallelepiped
or non-prismatic control lattice. Predened lattices in
EFFD include the parallelepipedical and cylindrical lat-
tices (Fig. 6). These lattices can be edited and combined to
give more complex lattices. AnEFFDlattice thus consists of
a set of standard FFD lattices that may include degener-
ated vertices. Asurface or a region of a surface is associated
with the EFFD lattice. Given a point p in the region of a
surface to be deformed, the corresponding FFD lattice
p
ijk
T
S
U
Figure 3. An FFD lattice.
Figure 4. Deformation of a sphere using FFD. (a) Applying the
deformation to the polyhedral model of a sphere. (b) Applying the
deforming to the control polygon of a NURBS sphere.
SURFACE DEFORMATION 3
containing p is located. The lattice coordinate (s, t, u) of p
relative to the corresponding FFD lattice is then deter-
mined using Newton approximation. Whenever there is
a change in the locations of the lattice vertices, the position
of p in E
3
can be computed using Equation 9. This allows
the shape of the surface covered by the lattice to be modied
by adjusting the EFFD lattice control points.
Direct Free-Form Deformation
Free-form deformation allows an object to be deformed
by adjusting the lattice control points. A more intuitive
approach for controlling the deformation is to allow object
points to be adjusted directly. The corresponding positions
of the lattice control vertices are then determined (10).
This in turn is used to determine the deformation of the
other object points. According to Equation 8, a deformed
object point p can be expressed as p = BQ, where Q is the
column matrix of the lattice control points and B is a
function of s, t, u. Assuming a deformation of the lattice
control points DQ causes a deformation of the object point
by Dp, then
Dp = BDQ (10)
Given Dp, the deformation of the lattice control points
DQ is to determined by solving Equation 10. As there
are (m1)(n1)(l1) control points in the FFD lattice,
DQ is thus a (m1)(n1)(l1) by 3 matrix and B is a row
matrix with (m1)(n1)(l1) elements. As B cannot be
inverted, the pseudoinverse B
+
of B can be used to obtain
DQ, i.e.
DQ = B
Dp (11)
where B
= (B
T
B)
1
B
T
.
Using the pseudoinverse for solving Equation 10 gives
the best solution, in a least-squares sense, for the required
deformation.
Other Free-Form Deformation Techniques
There are several other variations of FFD, including the
use of non-uniformrational B-splines for the control lattice,
which provides more control on the deformation (11). The
use of lattice with arbitrary topology constructed with
subdivision surfaces (12) or triangles (13) allows irregu-
larly shaped lattice to be used for the deformation. Free-
formdeformation is also applied for animation (14). Amore
recent study on geometric wraps and deformation can
found in Ref. 15.
Axial Deformation
Axial deformation (16) deforms an object by associating a
curve with an object. Deformation of the object can be
achieved by manipulating the curve. Axial deformation is
a mapping a: E
3
E
3
, and a = f g, where g : E
3
R
4
,
and f: R
4
E
3
, where the function g converts an object
point p in E
3
to the axial curve space dened by a given
curve c(t). The function f maps a point in axial space to E
3
.
Converting a point in E
3
to the axial space of a curve is to
Figure 5. Applying FFD on an algebraic surface.
Figure 6. A cylindrical lattice.
4 SURFACE DEFORMATION
specify the given point relative to a local coordinate frame
on c(t). The local coordinate frame of a curve is a set of
normalizedorthogonal vectors u(t), v(t), and w(t) dened at
the point c(t). A popular approach is to specify the local
coordinate frame using the Frenet Frame, which is
expressed as
w =
c
/
(t)
[c
/
(t)[
(12a)
u =
c
/
(t) c
//
(t)
[c
/
(t) c
//
(t)[
(12b)
v =
w u
[w u[
(12c)
The Frenet Frame of a curve is dened by the tangent,
normal, and binormal of c(t). The local coordinate frame is
thus completely dened by the curve c(t), and there is no
user control on the orientation of the frame, which may not
be desirable. In addition, c(t) may vanish so that the
normal and hence the coordinate frame is not dened.
An alternative is to use the rotation minimizing frame (17).
w =
c
/
(t)
[c
/
(t)[
(13a)
u
/
(t) =
(c
//
(t)
u(t)c
/
(t))
[c
/
(t)[
2
(13b)
v
/
(t) =
(c
//
(t)
v(t))c
/
(t)
[c
/
(t)[
2
(13c)
Given an initial coordinate frame, the set of differential
equations (13a13c) can be solved to obtain a rotation
minimized frame along the curve c(t). As only one local
coordinate frame is specied, this technique cannot be used
for obtaining a smooth transition between user-dened
local coordinate frames. That is, the twist of a curve and
hence the twist of the object cannot be fully controlled.
The use of a curve-pair (18) for specifying the local
coordinate frame provides complete control over the twist
of a curve. Aprimary curve species the locationof the local
coordinate frame, and an orientationcurve is used to dene
the rotationof the frame relative to the curve (Fig. 7). Inthis
case, a special technique is required for synchronizing the
change in shape of both the primary and the orientation
curve. With a well-dened local coordinate frame on an
axial curve c(t), and a given point p on an object, p can be
expressed relative to the curve c(t) as
p = c(t
p
) lu(t
p
) bv(t
p
) gw(t
p
) (14)
where
l = (p c(t
p
))
u(t
p
) (15a)
b = (p c(t
p
))
v(t
p
) (15b)
g = (p c(t
p
))
w(t
p
) (15c)
and c(t
p
) is the point on c(t) corresponding to p. Usually,
c(t
p
) is the point on c(t) closest to p. A point in the axial
space of c(t) corresponding to p is thus (t
p
, l; b; g). By
specifying object points in the axial space of c(t), the shape
of the oe deformed by adjusting c(t). Figure 8 shows an
example of applying an axial deformation to an object.
Figure 9 shows the bending and twisting of an object using
the curve-pair approach. A zone of inuence can be asso-
ciated with the axial curve to constraint the region of
inuence of the axial curve. The zone of inuence is dened
as the region between two general cylinders around the
axial curve. Denote r
min
and r
max
as the radii of the zone of
inuence; an object point p will only be deformed if
r
min
_ [p c(t
p
)[ _ r
max
. This allows an object to be
p
u(t
p
)
w(t
p
)
v(t
p
)
Figure 7. Axial curve pair.
Figure 8. Axial deformation. (a) Object with an undeformed axial
curve. (b) Object with the deformed axial curve.
Figure 9. Curve-pair based axial deformation. (a) Object with an
undeformed axial curve-pair. (b) Effect of bending an axial curve-
pair. (c) Effect of twisting an axial curve-pair.
SURFACE DEFORMATION 5
deformedlocallyas all object points lying outside of the zone
of inuence are not deformed.
The curve-based deformation technique also includes
the Wire deformation technique (19) in which a defor-
mation is dened with a tuple (W, R, s, r,f), where W is the
wire curve, R is a reference curve, r species the zone of
inuence (the distance from the curve), and s is the scale
factor. By adjusting the wire curve W, the shape of W
relative to the reference curve R determines the deforma-
tion on the associated object.
Constraint-Based Deformation
The constraints-based approach (20) deforms a surface by
dening a set of point displacement constraints. Defor-
mations satisfying these constraints are determined by
solving a system of equations satisfying the constraints.
Given a point u = [u
1
; u
2
; . . . ; u
n
[
T
, and its displacement
d(u) = [d
1
(u); d
2
(u); . . . ; d
n
(u)[
T
, the deformation at a point
can be expressed as
d(u) = M f (u) (16)
where f is a function of the mapping f : R
n
R
m
, m_n, and
M is the matrix of a linear transformation T : R
m
R
n
.
Assume there are n
c
constraint points, then d(u
i
) =
[d
1
(u
i
); d
2
(u
i
); . . . ; d
n
(u
i
)[
T
; j = 1; . . . ; n
c
. Withagivenfunc-
tion f, the matrix M is obtained by solving the system of
n n
c
equations
[d(u
1
) . . . d(u
n
c
)[
T
= M[ f (u
1
) . . . f (u
n
c
)[
T
(17)
In general, n may not be the same as n
c
, and M is
obtained using pseudoinverse. Once M is obtained, the
displacement of any point of the object can be computed
using Equation 16. Based on the same concept, the Simple
Constrained Deformation (21) associates a desired displa-
cement and radius of inuence for each constraint point. A
local B-spline basis function centered at the constraint
point is determined that falls to zero for points beyond
the radius of inuence. The displacement of an object point
is obtained by a linear combination of the basis functions
such that all constraints are satised.
Physics-Based Deformation
Physics-based deformation techniques are deformation
methods based on continuum mechanics, which accounts
for the effects of materials properties, external forces, and
environmental constraints on the deformation of objects.
Mass-Spring Model
The mass-spring model is one of the most widely used
techniques for physics-based surface deformation (22). A
surface is modeled as a mesh of point masses connected
by springs as shown in Fig. 10. The springs are usually
assumed to be linearly elastic, although in some cases,
nonlinear springs are used to model inelastic behavior.
In general, a mass point at the position x
i
in a mass spring
system is governed by the Newtons Second Law of motion
m
i
x
i
= g
i
_ x
i
X
n
g
j=0
g
i j
f
i
(18)
where m
i
is the mass of the mass point, g
i
x
:
i
is the velocity-
dependent damping force, g
ij
are the spring forces acting at
m
i
, and f
i
is the external force at m
i
. Assuming there are n
mass points in the system, applying Equation 18 at each of
the mass points gives
Mx C_ x Kx = f (19)
where M, C, and K are the 3n 3n mass, damping, and
stiffness matrices, respectively, and f is a 3n 1 column
vector denoting the total external forces on the mass point.
Equation 19 describes the dynamics of a mass-spring
system and is thus commonly used for modeling surfaces
that are deformed dynamically. Given the initial position
x and velocity v of the mass points, and expressing
Equation 19 as
v = M
1
(Cv Kx f) (20a)
_ x = v (20b)
the velocity v of the mass point at subsequent time
instances can be obtained. The corresponding location of
the mass point can be obtained with various numerical
integration techniques. Mass-spring systems are easy to
construct and can be animated at interactive speed. How-
ever, it is a signicant simplication of the physics in a
continuous body. A more accurate physical model for sur-
face deformation is to use the nite/boundary element
method.
Figure 10. A surface modeled with a mass-spring system.
6 SURFACE DEFORMATION
Deformation using Finite/Boundary Element Method
Using the nite element method (FEM), an object is
approximated with a set of discrete elements such as tri-
angles, quadrilaterals, tetrahedrons, and cuboids. Solid
elements (e.g., tetrahedrons and cubes) are used for closed
surfaces, whereas plate or shell elements are used for
nonclosed surfaces. As an object is in equilibrium when
its total potential energy is at a minimum, an equilibrium
equation is established by considering the total strain
energy of a system and the work due to external forces.
The boundary element method (BEM) approximates the
closed surface of a volume with a set of elements such as
triangles and quadrilaterals. No solid element is thus
required for BEM. Assuming the deforming object is line-
arly elastic, both FEM and BEM result in a simple matrix
equation
F = KU (21)
where F denotes the external forces acting on element
nodes of the object, U is the displacement at the element
nodes, and Kis the stiffness matrix. Detailed discussion on
the derivation of Equation 21 can be found in Refs. 22 and
23 and in standard texts on FEM/BEM. Given the material
properties of anobject, the variables at anelement node are
the external force acting onthe node and the corresponding
displacement. In general, one of the variables is known,
whereas the other one is unknown. By partitioning Fand U
into known and unknown elements, Equation 21 can be
rewritten as
F
k
F
u
=
K
00
K
01
K
10
K
11
U
k
U
u
(22)
where F
k
, F
u
denotes the known and unknown nodal forces
and U
k
, U
u
denotes the known and unknown nodal dis-
placements.
Hence,
F
k
= K
00
U
k
K
01
U
u
(23a)
F
u
= K
10
U
k
K
11
U
u
(23b)
In general, for graphics applications, the diplacement at
certain nodes is specied or constrained, and the forces at
the free or unconstrained nodes are zero (i.e., F
k
= 0).
Assume there are n nodes in the object with known dis-
placement at k of the nodes. The free boundary will be
composed of nk nodes. The dimensions of K
00
and K
01
are
thus (n k) k, and (n k) (n k). Equation 23a gives
U
u
= K
1
01
K
00
U
k
(24)
Given the deformation at certain nodes of a surface, the
deformation of the other nodes can be obtained using
Equation 24.
Ingeneral, FEMandBEMtechnique canonly be applied
for solid objects or closed surfaces. Standard plate or shell
elements can be used for open surfaces by assuming a
constant thickness for the surface. By considering the
potential energy in the stretching and bending of a surface,
an energy functional can be established for a surface
(24,25). Minimizing the energy functional gives an equa-
tion in the form of Equation 21. Standard FEM techniques
can then be used for deforming the surface. A mass and a
damping term can also be incorporated into Equation 21
to allow for the simulation of dynamic effects.
REMARKS
Geometry-based deformation tools are usually employed
for deforming surfaces in the object construction process in
which the deformation does not need to obey the laws of
physics. For the purposes of animation, the use of geometry-
based deformation for generating realistic motions relies
very much on the skill and experience of the animator in
using the deformation tool. On the contrary, physics-based
deformation is capable of generating an animated surface
deformation with little user interactions and thus plays an
important role in animation.
BIBLIOGRAPHY
1. A. H. Barr, Global and local deformations of solid primitives,
ACM Comput. Graphics, 18(3): 2130, 1984.
2. T. W. Sedergerg, Piecewise algebraic surface patches, Com-
put.-Aided Geometric Design, 2: 5359, 1985.
3. B. Guo, Representation of arbitrary shapes using implicit
quadrics, Visual Comput., 9(5): 267277, 1993.
4. C. L. Bajaj, Free-form modeling with implicit surface patches,
In: J. Bloomenthal and B. Wyvill (eds.), Implicit Surfaces, San
Francisco, CA: Morgan Kaufman Publishers, 1996.
5. G. Farin, Curves and Surfaces for CAGD, a Practical Guide,
4th ed.San Diego, CA: Academic Press, 1997.
6. OpenGLArchitecture ReviewBoard, D. Shreiner, M. Woo, and
J. Neider, OpenGL Programming Guide: The Ofcial Guide to
Learning OpenGL, Version 2 (5th Edition), 2005.
7. T. W. Sederberg, and S. R. Parry, Free-form deformation of
solid geometric models, ACM Computer Graphics, 20(4): 151
160, 1986.
8. K. C. Hui, Free-form deformation of constructive solid model,
Comput.-Aided Design, 35(13): 12211224, 2003.
9. S. Coquillart, Extended free-form deformation: A sculpturing
tool for 3Dgeometric modeling, ACMComput. Graphics, 24(4):
187196, 1990.
10. W. M. Hsu, J. F. Hughes, H. Kaufmann, Direct manipulationof
free-form deformations, ACM Comput. Graphics, 26(2): 177
184, 1992.
11. H. L. Lamousin, W. N. Waggenspack, NURBS-based free-form
deformations, IEEEComput. Graphics Applicat., 14(6): 5965,
1994.
12. R. MacCracken, and K. I. Joy, Free-form deformations
with lattices of arbitrary topology, Proc. SIGGRAPH, 1996,
pp. 181188.
13. K. G. Kobayashi, and K. Ootsubo, t-FFD: Free-form deforma-
tion by using triangular mesh, Proc. 8th ACM Symposium on
Solid Modeling and Applications, 2003, pp. 226234.
14. S. Coquillart, and P. Jancene, Animated free-form deforma-
tion: An interactive animation technique, ACM Comput.
Graphics, 25(4): 2326, 1991.
SURFACE DEFORMATION 7
15. T. Milliron, R. J. Jensen, R. Barzel, and A. Finkelstein, A
frameworkfor geometric warps anddeformations, ACMTrans.
Graphics, 21(1): 2051, 2002.
16. F. Lazarus, S. Coquillart, and Jancene P. , Axial deformations:
An intuitive deformation technique, Comput.-Aided Design,
26(8): 607613, 1994.
17. F. Klok, Two movingcoordinateframes for sweepingalonga3D
trajectory, Comput.-AidedGeometric Design, 3: 217229, 1986.
18. K. C. Hui, Free-form design using axial curve-pairs, Comput.-
Aided Design, 34(8): 583595, 2002.
19. K. Singh, and E. Fiume, Wires: A geometric deformation
technique, Proc. SIGGRAPH, 1998, pp. 405414.
20. P. Borrel, and D. Bechmann, Deformation of N-dimensional
objects, Internat. J. Computat. Geomet. Applicat., 1(4): 427
453, 1991.
21. P. Borrel, and A. Rapporort, Simple constrained deformations
for geometric modeling and interactive design, ACM Trans.
Graphics, 13(2): 137155, 1994.
22. S. F. F. Gibson, and B. Mirtich, A survey of deformable model-
ing in computer graphics, TR-97-19, MERL-AMitsubishi Elec-
tric Research Laboratory, Available: http://www.merl.com.
23. D. L. James and D. K. Pai, ARTDEFO: Accurate Real Time
Deformable Objects, Proc. SIGGRAPH, 1999, pp. 6572.
24. G. Celniker, and D. Gossard, Deformable curve and surface
nite-elements for free-formshape design, ACMComput. Gra-
phics, 25(4): 257266, 1991.
25. D. Terzopoulos, H. Qin, Dynamic NURBS with geometric con-
straints for interactive sculpting, ACMTrans. Graphics, 13(2):
103136, 1994.
K. C. HUI
The Chinese University of
Hong Kong
Shatin, Hong Kong
8 SURFACE DEFORMATION
S
SURFACE MODELING
INTRODUCTION
Surface is afundamental element indescribing the shape of
an object. The Classic method for specifying a surface is to
use a set of orthogonal planar projection curves to describe
the surface. With the advance of computer graphics and
computer-aided design technologies, the wire frame
models has evolved and is now capable of describing
three-dimensional (3-D) objects with lines and curves in
3-D space. However, the wire frame model is ambiguous
and does not provide a complete 3-D description of a sur-
face. P. Bezier andP. de Casteljauindependently developed
the Bezier curves and surfaces in the late 1960s, which
revolutionized the methods for describing surfaces. In
general, surface modeling is the technique for describing
a 3-Dsurface. There are three popular ways for specifying a
surface, namely, implicit, paramet