You are on page 1of 36

Data Compression

DEBRA A. LELEWER and DANIEL S. HIRSCHBERG


Department of Information and Computer Science, University of California, Irvine, California 92717

This paper surveys a variety of data compression methods spanning almost 40 years of
research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique
developed in 1986. The aim of data compression is to reduce redundancy in stored or
communicated data, thus increasing effective data density. Data compression has
important application in the areas of file storage and distributed systems. Concepts from
information theory as they relate to the goals and evaluation of data compression
methods are discussed briefly. A framework for evaluation and comparison of methods is
constructed and applied to the algorithms presented. Comparisons of both theoretical and
empirical natures are reported, and possibilities for future research are suggested

Categories and Subject Descriptors: E.4 [Data]: Coding and Information Theory-data
compaction and compression
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Adaptive coding, adaptive Huffman codes, coding,
coding theory, tile compression, Huffman codes, minimum-redundancy codes, optimal
codes, prefix codes, text compression

INTRODUCTION mation but whose length is as small as


possible. Data compression has important
Data compression is often referred to as application in the areas of data transmis-
coding, where coding is a general term en- sion and data storage. Many data process-
compassing any special representation of ing applications require storage of large
data that satisfies a given need. Informa- volumes of data, and the number of such
tion theory is defined as the study of applications is constantly increasing as the
efficient coding and its consequences in use of computers extends to new disci-
the form of speed of transmission and plines. At the same time, the proliferation
probability of error [ Ingels 19711.Data com- of computer communication networks is
pression may be viewed as a branch of resulting in massive transfer of data over
information theory in which the primary communication links. Compressing data to
objective is to minimize the amount of data be stored or transmitted reduces storage
to be transmitted. The purpose of this pa- and/or communication costs. When the
per is to present and analyze a variety of amount of data to be transmitted is re-
data compression algorithms. duced, the effect is that of increasing the
A simple characterization of data capacity of the communication channel.
compression is that it involves transform- Similarly, compressing a file to half of its
ing a string of characters in some represen- original size is equivalent to doubling the
tation (such as ASCII) into a new string capacity of the storage medium. It may then
(e.g., of bits) that contains the same infor- become feasible to store the data at a

Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
0 1966 ACM 0360-0300/87/0900-0261$1.50

ACM Computing SUWSYS,Vol. 19, No. 3, September 1987


262 . D. A. Lelewer and D. S. Hirschberg
CONTENTS been reported to reduce a file to anywhere
from 12.1 to 73.5% of its original size [Wit-
ten et al. 19871. Cormack reports that data
compression programs based on Huffman
INTRODUCTION coding (Section 3.2) reduced the size of a
1. FUNDAMENTAL CONCEPTS large student-record database by 42.1%
1.1 Definitions when only some of the information was
1.2 Classification of Methods
1.3 A Data Compression Model
compressed. As a consequence of this size
1.4 Motivation reduction, the number of disk operations
2. SEMANTIC DEPENDENT METHODS required to load the database was reduced
3. STATIC DEFINED-WORD SCHEMES by 32.7% [Cormack 19851. Data com-
3.1 Shannon-Fan0 Code pression routines developed with specific
3.2 Static Huffman Coding
3.3 Universal Codes and Representations of the applications in mind have achieved com-
Integers pression factors as high as 98% [Severance
3.4 Arithmetic Coding 19831.
4. ADAPTIVE HUFFMAN CODING Although coding for purposes of data se-
4.1 Algorithm FGK
4.2 Algorithm V
curity (cryptography) and codes that guar-
5. OTHER ADAPTIVE METHODS antee a certain level of data integrity (error
5.1 Lempel-Ziv Codes detection/correction) are topics worthy of
5.2 Algorithm BSTW attention, they do not fall under the
6. EMPIRICAL RESULTS
7. SUSCEPTIBILITY TO ERROR
umbrella of data compression. With the
7.1 Static Codes exception of a brief discussion of the sus-
7.2 Adaptive Codes ceptibility to error of the methods surveyed
8. NEW DIRECTIONS (Section 7), a discrete noiseless channel is
9. SUMMARY assumed. That is, we assume a system in
REFERENCES
which a sequence of symbols chosen from
a finite alphabet can be transmitted from
one point to another without the possibility
of error. Of course, the coding schemes
higher, thus faster, level of the storage hi- described here may be combined with data
erarchy and reduce the load on the input/ security or error-correcting codes.
output channels of the computer system. Much of the available literature on data
Many of the methods discussed in this compression approaches the topic from the
paper are implemented in production point of view of data transmission. As noted
systems. The UNIX1 utilities compact earlier, data compression is of value in data
and compress are based on methods dis- storage as well. Although this discussion is
cussed in Sections 4 and 5, respectively framed in the terminology of data trans-
[UNIX 19841.Popular file archival systems mission, compression and decompression
such as ARC and PKARC use techniques of data files are essentially the same tasks
presented in Sections 3 and 5 [ARC 1986; as sending and receiving data over a com-
PKARC 19871. The savings achieved by munication channel. The focus of this
data compression can be dramatic; reduc- paper is on algorithms for data compres-
tion as high as 80% is not uncommon sion; it does not deal with hardware aspects
[Reghbati 19811. Typical values of com- of data transmission. The reader is referred
pression provided by compact are text to Cappellini [1985] for a discussion of
(38%), Pascal source (43%), C source techniques with natural hardware imple-
(36%), and binary (19%). Compress gener- mentation.
ally achieves better compression (50-60% Background concepts in the form of ter-
for text such as source code and English) minology and a model for the study of data
and takes less time to compute [UNIX compression are provided in Section 1. Ap-
19841. Arithmetic coding (Section 3.4) has plications of data compression are also dis-
cussed in Section 1 to provide motivation
1 UNIX is a trademark of AT&T Bell Laboratories. for the material that follows.

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 263
Although the primary focus of this survey Source message Codeword
is data compression methods of general a 000
utility, Section 2 includes examples from b 001
the literature in which ingenuity applied to ii 011
010
domain-specific problems has yielded inter-
esting coding techniques. These techniques
are referred to as semantic dependent since 7 100
101
they are designed to exploit the context g 110
and semantics of the data to achieve re- space 111
dundancy reduction. Semantic-dependent Figure 1. A block-block code for EXAMPLE.
techniques include the use of quadtrees,
run-length encoding, or difference mapping
for storage and transmission of image data Source message Codeword
[Gonzalez and Wintz 1977; Samet 19841.
General-purpose techniques, which as- ;ib 01
sume no knowledge of the information cccc 10
content of the data, are described in ddddd 11
Sections 3-5. These descriptions are suffi- eeeeee 100
ciently detailed to provide an understand- fffffff 101
&%%Nmz! 110
ing of the techniques. The reader will need space 111
to consult the references for implementa-
tion details. In most cases only worst-case Figure 2. A variable-variable code for EXAMPLE.
analyses of the methods are feasible. To
provide a more realistic picture of their
effectiveness, empirical data are presented be (0,l). Codes can be categorized as block-
in Section 6. The susceptibility to error of block, block-variable, variable-block, or
the algorithms surveyed is discussed in Sec- variable-variable, where block-block in-
tion 7, and possible directions for future dicates that the source messages and
research are considered in Section 8. codewords are of fixed length and variable-
variable codes map variable-length source
messages into variable-length codewords. A
1. FUNDAMENTAL CONCEPTS
block-block code for EXAMPLE is shown
A brief introduction to information theory in Figure 1, and a variable-variable code is
is provided in this section. The definitions given in Figure 2. If the string EXAMPLE
and assumptions necessary to a compre- were coded using the Figure 1 code, the
hensive discussion and evaluation of data length of the coded message would be 120;
compression methods are discussed. The using Figure 2 the length would be 30.
following string of characters is used to The oldest and most widely used codes,
illustrate the concepts defined: EXAMPLE ASCII and EBCDIC, are examples of
= “au bbb cccc ddddd eeeeeefffffffgggggggg”. block-block codes, mapping an alphabet of
64 (or 256) single characters onto 6-bit (or
8-bit) codewords. These are not discussed,
1.1 Definitions
since they do not provide compression.
A code is a mapping of source messages The codes featured in this survey are of
(words from the source alphabet a) into the block-variable, variable-variable, and
codewords (words of the code alphabet 0). variable-block types.
The source messages are the basic units When source messages of variable length
into which the string to be represented is are allowed, the question of how a mes-
partitioned. These basic units may be single sage ensemble (sequence of messages) is
symbols from the source alphabet, or they parsed into individual messages arises.
may be strings of symbols. For string Many of the algorithms described here are
EXAMPLE, cy = (a, b, c, d, e, f, g, space). defined-word schemes. That is, the set of
For purposes of explanation, /3 is taken to source messages is determined before the

ACM Computing Surveys, Vol. 19, No. 3, September 1987


264 . D. A. Lelewer and D. S. Hirschberg
invocation of the coding scheme. For u in ,B. The set of codewords (00, 01, 10) is
example, in text file processing, each an example of a prefix code that is not
character may constitute a message, or minimal. The fact that 1 is a proper prefix
messages may be defined to consist of the codeword 10 requires that 11 be
of alphanumeric and nonalphanumeric either a codeword or a proper prefix of a
strings. In Pascal source code, each token codeword, and it is neither. Intuitively, the
may represent a message. All codes involv- minimality constraint prevents the use of
ing fixed-length source messages are, by codewords that are longer than necessary.
default, defined-word codes. In free-parse In the above example the codeword 10 could
methods, the coding algorithm itself parses be replaced by the codeword 1, yielding a
the ensemble into variable-length se- minimal prefix code with shorter code-
quences of symbols. Most of the known words. The codes discussed in this paper
data compression methods are defined- are all minimal prefix codes.
word schemes; the free-parse model differs In this section a code has been defined
in a fundamental way from the classical to be a mapping from a source alphabet
coding paradigm. to a code alphabet; we now define related
A code is distinct if each codeword is terms. The process of transforming a source
distinguishable from every other (i.e., the ensemble into a coded message is coding or
mapping from source messages to code- encoding. The encoded message may be re-
words is one to one). A distinct code is ferred to as an encoding of the source en-
uniquely decodable if every codeword is semble. The algorithm that constructs the
identifiable when immersed in a sequence mapping and uses it to transform the source
of codewords. Clearly, each of these fea- ensemble is called the encoder. The decoder
tures is desirable. The codes of Figures 1 performs the inverse operation, restoring
and 2 are both distinct, but the code of the coded message to its original form.
Figure 2 is not uniquely decodable. For
example, the coded message 11 could be
1.2 Classification of Methods
decoded as either “ddddd” or “bbbbbb”. A
uniquely decodable code is a prefix code (or Not only are data compression schemes
prefix-free code) if it has the prefix prop- categorized with respect to message and
erty, which requires that no codeword be a codeword lengths, but they are also classi-
proper prefix of any other codeword. All fied as either static or dynamic. A static
uniquely decodable block-block and vari- method is one in which the mapping from
able-block codes are prefix codes. The code the set of messages to the set of codewords
with codewords (1, 100000, 00) is an ex- is fixed before transmission begins, so that
ample of a code that is uniquely decodable a given message is represented by the same
but that does not have the prefix property. codeword every time it appears in the mes-
Prefix codes are instantaneously decodable; sage ensemble. The classic static defined-
that is, they have the desirable property word scheme is Huffman coding [Huffman
that the coded message can be parsed into 19521. In Huffman coding, the assignment
codewords without the need for lookahead. of codewords to source messages is based
In order to decode a message encoded using on the probabilities with which the source
the codeword set (1, 100000, 001, lookahead messages appear in the message ensemble.
is required. For example, the first codeword Messages that appear frequently are rep-
of the message 1000000001 is 1, but this resented by short codewords; messages with
cannot be determined until the last (tenth) smaller probabilities map to longer code-
symbol of the message is read (if the string words. These probabilities are determined
of zeros had been of odd length, the first before transmission begins. A Huffman
codeword would have been 100000). code for the ensemble EXAMPLE is given
A minimal prefix code is a prefix code in Figure 3. If EXAMPLE were coded using
such that, if x is a proper prefix of some this Huffman mapping, the length of the
codeword, then xu is either a codeword or coded message would be 117. Static Huff-
a proper prefix of a codeword for each letter man coding is discussed in Section 3.2;

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 265

Source message Probability Codeword Source message Probability Codeword


1001 a 216 10
; 2140
3140 1000 b 3/6 0
011 space l/6 11
i 4140
5/40 010
e 6140 111 Figure 4. A dynamic Huffman code table for the
f 7/40 110 prefix “an bbb” of message EXAMPLE.
g a/40 00
space 5140 101
common in a wide variety of text types, for
Figure 3. A Huffman code for the message EXAM- a particular word to occur frequently for
PLE (code length = 117). short periods of time and then fall into
disuse for long periods.
All of the adaptive methods are one-pass
other static schemes are discussed in Sec- methods; only one scan of the ensemble is
tions 2 and 3. required. Static Huffman coding requires
A code is dynamic if the mapping from two passes: one pass to compute probabili-
the set of messages to the set of codewords ties and determine the mapping, and a sec-
changes over time. For example, dynamic ond pass for transmission. Thus, as long as
Huffman coding involves computing an ap- the encoding and decoding times of an
proximation to the probabilities of occur- adaptive method are not substantially
rence “on the fly,” as the ensemble is being greater than those of a static method, the
transmitted. The assignment of codewords fact that an initial scan is not needed im-
to messages is based on the values of the plies a speed improvement in the adaptive
relative frequencies of occurrence at each case. In addition, the mapping determined
point in time. A message x may be repre- in ,the first pass of a static coding scheme
sented by a short codeword early in the must be transmitted by the encoder to the
transmission because it occurs frequently decoder. The mapping may preface each
at the beginning of the ensemble, even transmission (i.e., each file sent), or a single
though its probability of occurrence over mapping may be agreed upon and used for
the total ensemble is low. Later, when the multiple transmissions. In one-pass meth-
more probable messages begin to occur with ods the encoder defines and redefines the
higher frequency, the short codeword will mapping dynamically during transmission.
be mapped to one of the higher probability The decoder must define and redefine the
messages, and x will be mapped to a longer mapping in sympathy, in essence “learn-
codeword. As an illustration, Figure 4 ing” the mapping as codewords are re-
presents a dynamic Huffman code table ceived. Adaptive methods are discussed in
corresponding to the prefix “aa bbb ” of Sections 4 and 5.
EXAMPLE. Although the frequency of An algorithm may also be a hybrid,
space over the entire message is greater neither completely static nor completely
than that of b, at this point b has higher dynamic. In a simple hybrid scheme, sender
frequency and therefore is mapped to the and receiver maintain identical codebooks
shorter codeword. containing lz static codes. For each trans-
Dynamic codes are also referred to in the mission, the sender must choose one of the
literature as adaptive, in that they adapt to k previously agreed upon codes and inform
changes in ensemble characteristics over the receiver of the choice (by transmitting
time, The term adaptive is used for the first the “name” or number of the chosen
remainder of this paper; the fact that these code). Hybrid methods are discussed fur-
codes adapt to changing characteristics is ther in Sections 2 and 3.2.
the source of their appeal. Some adaptive
methods adapt to changing patterns in the 1.3 A Data Compression Model
source [Welch 19841, whereas others ex-
ploit locality of reference [Bentley et al. In order to discuss the relative merits of
19861. Locality of reference is the tendency, data compression techniques, a framework

ACM Computing Surveys, Vol. 19, No. 3, September 1987


266 . D. A. Lelewer and D. S. Hirschberg
for comparison must be established. There code symbols are of different durations.
are two dimensions along which each of The assumption is also important, since
the schemes discussed here may be mea- the problem of constructing optimal
sured: algorithm complexity and amount of codes over unequal code letter costs is
compression. When data compression is a significantly different and more diffi-
used in a data transmission application, the cult problem. Perl et al. [1975] and Varn
goal is speed. Speed of transmission de- [1971] have developed algorithms for min-
pends on the number of bits sent, the time imum-redundancy prefix coding in the case
required for the encoder to generate the of arbitrary symbol cost and equal code-
coded message, and the time required for word probability. The assumption of equal
the decoder to recover the original ensem- probabilities mitigates the difficulty pre-
ble. In a data storage application, although sented by the variable symbol cost. For the
the degree of compression is the primary more general unequal letter costs and un-
concern, it is nonetheless necessary that equal probabilities model, Karp [1961] has
the algorithm be efficient in order for the proposed an integer linear programming
scheme to be practical. For a static scheme, approach. There have been several approx-
there are three algorithms to analyze: the imation algorithms proposed for this more
map construction algorithm, the encoding difficult problem [Krause 1962; Cot 1977;
algorithm, and the decoding algorithm. For Mehlhorn 19801.
a dynamic scheme, there are just two algo- When data are compressed, the goal is
rithms: the encoding algorithm and the to reduce redundancy, leaving only the
decoding algorithm. informational content. The measure of
Several common measures of compres- information of a source message oi (in
sion have been suggested: redundancy bits) is -log p(ai), where log denotes the
[Shannon and Weaver 19491, average mes- base 2 logarithm and p(ai) denotes the
sage length [Huffman 19521, and compres- probability of occurrence of the message
sion ratio [Rubin 1976; Ruth and Kreutzer oi.2 This definition has intuitive appeal; in
19721. These measures are defined below. the case in which p(ai) = 1, it is clear that
Related to each of these measures are ai is not at all informative since it had to
assumptions about the characteristics occur. Similarly, the smaller the value of
of the source. It is generally assumed in p(ai), the more unlikely ai is to appear, and
information theory that all statistical hence the larger its information content.
parameters of a message source are known The reader is referred to Abramson [1963,
with perfect accuracy [Gilbert 19711. The pp. 6-131 for a longer, more elegant discus-
most common model is that of a discrete sion of the legitimacy of this technical
memoryless source; a source whose output definition of the concept of information.
is a sequence of letters (or messages), each The average information content over the
letter being a selection from some fixed source alphabet can be computed by
alphabet al, - - - , a,,. The letters are taken weighting the information content of each
to be random, statistically independent se- source letter by its probability of occur-
lections from the alphabet, the selection rence, yielding the expression xi”=1
being made according to some fixed prob- [-p(oi)log p(ai)]. This quantity is referred
ability assignment p(al), . . ., p(a,) [Gal- to as the entropy of a source letter or the
lager 19681. Without loss of generality, the entropy of the source and is denoted by H.
code alphabet is assumed to be (0, 1) Since the length of a codeword for message
throughout this paper. The modifications oi must be sufficient to carry the informa-
necessary for larger code alphabets are tion content of oi, entropy imposes a lower
straightforward. bound on the number of bits required for
We assume that any cost associated the coded message. The total number of
with the code letters is uniform. This is a
reasonable assumption, although it omits * Note that throughout this paper all logarithms are
applications like telegraphy where the to the base 2.

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 267
bits must be at least as large as the product methods for dealing with local redundancy
of H and the length of the source ensemble. are discussed in Sections 2 and 6.2.
Since the value of H is generally not an Huffman uses average message length,
integer, variable-length codewords must be 2 p(ai)Zi, as a measure of the efficiency of
used if the lower bound is to be achieved. a code. Clearly, the meaning of this term is
Given that message EXAMPLE is to be the average length of a coded message. We
encoded one letter at a time, the entropy of use the term average codeword length to
its source can be calculated using the prob- represent this quantity. Since redundancy
abilities given in Figure 3: H = 2.894, so is defined to be average codeword length
that the minimum number of bits con- minus entropy and entropy is constant
tained in an encoding of EXAMPLE is 116. for a given probability distribution, mini-
The Huffman code given in Section 1.2 does mizing average codeword length minimizes
not quite achieve the theoretical minimum redundancy.
in this case. A code is asymptotically optimal if it
Both of these definitions of information has the property that for a given probability
content are due to Shannon [ 19491. A der- distribution, the ratio of average codeword
ivation of the concept of entropy as it re- length to entropy approaches 1 as entropy
lates to information theory is presented by tends to infinity. That is, asymptotic opti-
Shannon [1949]. A simpler, more intuitive mality guarantees that average codeword
explanation of entropy is offered by Ash length approaches the theoretical mini-
[1965]. mum (entropy represents information
The most common notion of a “good” content, which imposes a lower bound on
code is one that is optimal in the sense of codeword length).
having minimum redundancy. Redundancy The amount of compression yielded by a
can be defined as c p(ai)Zi - 2 [-p(ai)log coding scheme can be measured by a
p(oi)], where li is the length of the codeword compression ratio. Compression ratio has
representing message oi. The expression been defined in several ways. The defini-
z p(ai)Zi represents the lengths of the code- tion C = (average message length)/(average
words weighted by their probabilities of codeword length) captures the common
occurrence, that is, the average codeword meaning, which is a comparison of the
length. The expression z [-p(ai)log p(oi)] length of the coded message to the length
is entropy H. Thus, redundancy is a mea- of the original ensemble [Cappellini 19851.
sure of the difference between average If we think of the characters of the ensem-
codeword length and average information ble EXAMPLE as 6-bit ASCII characters,
content. If a code has minimum average then the average message length is 6 bits.
codeword length for a given discrete prob- The Huffman code of Section 1.2 repre-
ability distribution, it is said to be a mini- sents EXAMPLE in 117 bits, or 2.9 bits
mum redundancy code. per character. This yields a compression
We define the term local redundancy to ratio of 612.9, representing compression by
capture the notion of redundancy caused a factor of more than 2. Alternatively, we
by local properties of a message ensemble, may say that Huffman encoding produces
rather than its global characteristics. Al- a file whose size is 49% of the original
though the model used for analyzing ASCII file, or that 49% compression has
general-purpose coding techniques assumes been achieved.
a random distribution of the source mes- A somewhat different definition of
sages, this may not actually be the case. In compression ratio by Rubin [1976], C =
particular applications the tendency for (S - 0 - OR)/S, includes the representa-
messages to cluster in predictable patterns tion of the code itself in the transmission
may be known. The existence of predictable cost. In this definition, S represents the
patterns may be exploited to minimize local length of the source ensemble, 0 the length
redundancy. Examples of applications in of the output (coded message), and OR the
which local redundancy is common and size of the “output representation” (e.g., the

ACM Computing Surveys, Vol. 19, No. 3, September 1987


260 l D. A. Lelewer and D. S. Hirschberg
number of bits required for the encoder to complex. For these reasons, one could ex-
transmit the code mapping to the decoder). pect to see even greater use of variable-
The quantity OR constitutes a “charge* to length coding in the future.
an algorithm for transmission of informa- It is interesting to note that the Huffman
tion about the coding scheme. The inten- coding algorithm, originally developed for
tion is to measure the total size of the the efficient transmission of data, also has
transmission (or file to be stored). a wide variety of applications outside the
sphere of data compression. These include
1.4 Motivation
construction of optimal search trees [Zim-
merman 1959; Hu and Tucker 1971; Itai
As discussed in the Introduction, data 19761,list merging [Brent and Kung 19781,
compression has wide application in terms and generating optimal evaluation trees in
of information storage, including represen- the compilation of expressions [Parker
tation of the abstract data type string 19801. Additional applications involve
[Standish 19801 and file compression. search for jumps in a monotone function of
Huffman coding is used for compression in a single variable, sources of pollution along
several file archival systems [ARC 1986; a river, and leaks in a pipeline [Glassey and
PKARC 19871, as is Lempel-Ziv coding, Karp 19761.The fact that this elegant com-
one of the adaptive schemes to be discussed binatorial algorithm has influenced so
in Section 5. An adaptive Huffman coding many diverse areas underscores its impor-
technique is the basis for the compact com- tance.
mand of the UNIX operating system, and
the UNIX compress utility employs the 2. SEMANTIC-DEPENDENT METHODS
Lempel-Ziv approach [UNIX 19841.
In the area of data transmission, Huff- Semantic-dependent data compression
man coding has been passed over for years techniques are designed to respond to spe-
in favor of block-block codes, notably cific types of local redundancy occurring in
ASCII. The advantage of Huffman coding certain applications. One area in which
is in the average number of bits per char- data compression is of great importance
acter transmitted, which may be much is image representation and processing.
smaller than the log n bits per character There are two major reasons for this. The
(where n is the source alphabet size) of a first is that digitized images contain a large
block-block system. The primary difficulty amount of local redundancy. An image is
associated with variable-length codewords usually captured in the form of an array of
is that the rate at which bits are presented pixels, whereas methods that exploit the
to the transmission channel will fluctuate, tendency for pixels of like color or intensity
depending on the relative frequencies of the to cluster together may-be more efficient.
source messages. This fluctuation requires The second reason for the abundance of
buffering between the source and the chan- research in this area is volume. Digital im-
nel. Advances in technology have both ages usually require a very large number of
overcome this difficulty and contributed to bits, and many uses of digital images in-
the appeal of variable-length codes. Cur- volve large collections of images.
rent data networks allocate communication One technique used for compression of
resources to sources on the basis of need image data is run-length encoding. In a
and provide buffering as part of the system. common version of run-length encoding,
These systems require significant amounts the sequence of image elements along a
of protocol, and fixed-length codes are quite scan line (row) x1, x2, . . . , xn is mapped
inefficient for applications such as packet into a sequence of pairs (cl, II), (cp, f$), . . .
headers. In addition, communication costs (ck, lh), where ci represents an intensity or
are beginning to dominate storage and color and li the length of the ith run (se-
processing costs, so that variable-length quence of pixels of equal intensity). For
coding schemes that reduce communication pictures such as weather maps, run-length
costs are attractive even if they are more encoding can save a significant number of

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression 9 269
bits over the image element sequence [Gon- with database files. The method, which is
zalez and Wintz 19771. part of IBM’s Information Management
Another data compression technique System (IMS), compresses individual rec-
specific to the area of image data is differ- ords and is invoked each time a record is
ence mapping, in which the image is rep- stored in the database tile; expansion is
resented as an array of differences in performed each time a record is retrieved.
brightness (or color) between adjacent Since records may be retrieved in any order,
pixels rather than the brightness values context information used by the compres-
themselves. Difference mapping was used sion routine is limited to a single record. In
to encode the pictures of Uranus transmit- order for the routine to be applicable to any
ted by Voyager 2. The 8 bits per pixel database, it must be able to adapt to the
needed to represent 256 brightness levels format of the record. The fact that database
was reduced to an average of 3 bits per pixel records are usually heterogeneous collec-
when difference values were transmitted tions of small fields indicates that the local
[Laeser et al. 19861. In spacecraft applica- properties of the data are more impor-
tions, image fidelity is a major concern tant than their global characteristics. The
because of the effect of the distance from compression routine in IMS is a hybrid
the spacecraft to Earth on transmission method that attacks this local redundancy
reliability. Difference mapping was com- by using different coding schemes for dif-
bined with error-correcting codes to pro- ferent types of fields. The identified field
vide both compression and data integrity types in IMS are letters of the alphabet,
in the Voyager 2 project. Another method numeric digits, packed decimal digit pairs,
that takes advantage of the tendency for blank, and other. When compression be-
images to contain large areas of constant gins, a default code is used to encode the
intensity is the use of the quadtree data fust character of the record For each
structure [Samet 19841. Additional exam- subsequent character, the type of the pre-
ples of coding techniques used in image vious character determines the code to
processing can be found in Wilkins and be used. For example, if the record
Wintz [1971] and in Cappellini [1985]. “01870bABCD bb LMN” were encoded with
Data compression is of interest in busi- the letter code as default, the leading zero
ness data processing, because of both the would be coded using the letter code; the 1,
cost savings it offers and the large volume 8, 7, 0 and the first blank (b) would be
of data manipulated in many business ap- coded by the numeric code. The A would be
plications. The types of local redundancy coded by the blank code; B, C, D, and the
present in business data files include runs next blank by the letter code; the next blank
of zeros in numeric fields, sequences of and the L by the blank code; and the M and
blanks in alphanumeric fields, and fields N by the letter code. Clearly, each code
that are present in some records and null must define a codeword for every character;
in others. Run-length encoding can be used the letter code would assign the shortest
to compress sequences of zeros or blanks. codewords to letters, the numeric code
Null suppression can be accomplished would favor the digits, and so on. In the
through the use of presence bits [Ruth and system Cormack describes, the types of the
Kreutzer 19721. Another class of methods characters are stored in the encode/decode
exploits cases in which only a limited set of data structures. When a character c is re-
attribute values exist. Dictionary substitu- ceived, the decoder checks type(c) to detect
tion entails replacing alphanumeric repre- which code table will be used in transmit-
sentations of information such as bank ting the next character. The compression
account type, insurance policy type, sex, algorithm might be more efficient if a spe-
and month by the few bits necessary to cial bit string were used to alert the receiver
represent the limited number of possible to a change in code table. Particularly
attribute values [Reghbati 19811. if fields were reasonably long, decoding
Cormack [1985] describes a data would be more rapid and the extra bits in
compression system that is designed for use the transmission would not be excessive.

ACM Computing Surveys, Vol. 19, No. 3, September 1987


270 l D. A. Ldewer and D. S. Hirschberg
Cormack reports that the performance of believe that Huffman coding cannot be im-
the IMS compression routines is very good; proved upon; that is, that it is guaranteed
at least 50 sites are currently using the to achieve the best possible compression
system. He cites a case of a database con- ratio. This is only true, however, under the
taining student records whose size was constraints that each source message is
reduced by 42.1%, and as a side effect mapped to a unique codeword and that the
the number of disk operations required to compressed text is the concatenation of the
load the database was reduced by 32.7% codewords for the source messages. An ear-
[ Cormack 19851. lier algorithm, due independently to Shan-
A variety of approaches to data compres- non and Fano [Shannon and Weaver 1949;
sion designed with text tiles in mind in- Fano 19491, is not guaranteed to provide
cludes use of a dictionary representing optimal codes but approaches optimal be-
either all of the words in the file so that the havior as the number of messages ap-
file itself is coded as a list of pointers to the proaches infinity. The Huffman algorithm
dictionary [Hahn 19741, or common words is also of importance because it has pro-
and word endings so that the file consists vided a foundation upon which other data
of pointers to the dictionary and encodings compression techniques have been built
of the less common words [Tropper 19821. and a benchmark to which they may be
Hand selection of common phrases [Wag- compared. We classify the codes generated
ner 19731, programmed selection of prefixes by the Huffman and Shannon-Fan0 algo-
and suffixes [Fraenkel et al. 19831, and rithms as variable-variable and note that
programmed selection of common charac- they include block-variable codes as a spe-
ter pairs [Cortesi 1982; Snyderman and cial case, depending on how the source mes-
Hunt 19701 have also been investigated. sages are defined. Shannon-Fan0 coding is
This discussion of semantic-dependent discussed in Section 3.1; Huffman coding
data compression techniques represents a in Section 3.2.
limited sample of a very large body of re- In Section 3.3 codes that map the inte-
search. These methods and others of a like gers onto binary codewords are discussed.
nature are interesting and of great value in Since any finite alphabet may be enumer-
their intended domains. Their obvious ated, this type of code has general-purpose
drawback lies in their limited utility. It utility. However, a more common use of
should be noted, however, that much of the these codes (called universal codes) is in
efficiency gained through the use of seman- conjunction with an adaptive scheme. This
tic-dependent techniques can be achieved connection is discussed in Section 5.2.
through more general methods, albeit to a Arithmetic coding, presented in Sec-
lesser degree. For example, the dictionary tion 3.4, takes a significantly different ap-
approaches can be implemented through proach to data compression from that of
either Huffman coding (Sections 3.2 and 4) the other static methods. It does not con-
or Lempel-Ziv codes (Section 5.1). Cor- struct a code, in the sense of a mapping
mack’s database scheme is a special case of from source messages to codewords. In-
the codebook approach (Section 3.2), and stead, arithmetic coding replaces the source
run-length encoding is one of the effects of ensemble by a code string that, unlike all
Lempel-Ziv codes. the other codes discussed here, is not the
concatenation of codewords corresponding
to individual source messages. Arithmetic
3. STATIC DEFINED-WORDS SCHEMES coding is capable of achieving compression
results that are arbitrarily close to the en-
The classic defined-word scheme was tropy of the source.
developed over 30 years ago in Huff-
man’s well-known paper on minimum-
3.1 Shannon-Fan0 Coding
redundancy coding [ Huffman 19521.
Huffman’s algorithm provided the first The Shannon-Fan0 technique has as an
solution to the problem of constructing advantage its simplicity. The code is con-
minimum-redundancy codes. Many people structed as follows: The source messages ai

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 271
al 112 0 de”1 is the same as that achieved by the Huff-
man code (see Figure 3). That the Shan-
non-Fan0 algorithm is not guaranteed to
a3 118 110 .&.“l produce an optimal code is demonstrated
l/16 1110 step1 by the following set of probabilities: (.35,
.17, .17, .16, .15,). The Shannon-Fan0 code
a6 l/32 11110 “ten5 for this distribution is compared with the
aG l/32 11111 Huffman code in Section 3.2.
Figure 5. A Shannon-Fan0 code.
3.2 Static Huffman Coding
Huffman’s algorithm, expressed graphi-
g a/40 00 step2 cally, takes as input a list of nonnegative
f 7/40 010 h-p3 weights (wi, . . . , w,J and constructs a full
binary tree3 whose leaves are labeled with
e 6140 011 step1 the weights. When the Huffman algorithm
d 5140 100 Sal6 is used to construct a code, the weights
space 5140 101
represent the probabilities associated with
.tep1
the source letters. Initially, there is a set of
c 4140 110 .tsnR singleton trees, one for each weight in the
b 3140 1110 list. At each step in the algorithm the trees
step7
corresponding to the two smallest weights,
a 2/40 1111 wi and wj, are merged into a new tree whose
Figure6. A Shannon-Fan0 code for EXAMPLE weight is wi + Wj and whose root has two
(code length = 117). children that are the subtrees represented
by wi and Wj. The weights Wi and Wj are
removed from the list, and wi + wj is in-
and their probabilities p(ai) are listed in serted into the list. This process continues
order of nonincreasing probability. This list until the weight list contains a single value.
is then divided in such a way as to form If, at any time, there is more than one way
two groups of as nearly equal total proba- to choose a smallest pair of weights, any
bilities as possible. Each message in the such pair may be chosen. In Huffman’s
first group receives 0 as the first digit of its paper the process begins with a nonincreas-
codeword; the messages in the second half ing list of weights. This detail is not impor-
have codewords beginning with 1. Each of tant to the correctness of the algorithm,
these groups is then divided according to but it does provide a more efficient imple-
the same criterion, and additional code dig- mentation [Huffman 19521. The Huffman
its are appended. The process is continued algorithm is demonstrated in Figure 7.
until each subset contains only one mes- The Huffman algorithm determines the
sage. Clearly, the Shannon-Fan0 algorithm lengths of the codewords to be mapped to
yields a minimal prefix code. each of the source letters oi. There are
Figure 5 shows the application of the many alternatives for specifying the actual
method to a particularly simple probability digits; it is necessary only that the code
distribution. The length of each codeword have the prefix property. The usual assign-
is equal to -log p(ai). This is true as long ment entails labeling the edge from each
as it is possible to divide the list into parent to its left child with the digit 0 and
subgroups of exactly equal probability. the edge to the right child with 1. The
When this is not possible, some code- codeword for each source letter is the se-
words may be of length -logp(ai) + 1. The quence of labels along the path from the
Shannon-Fan0 algorithm yields an average root to the leaf node representing that
codeword length S that satisfies H 5 S 5 letter. The codewords for the source of
H + 1. In Figure 6, the Shannon-Fan0 code
for ensemble EXAMPLE is given. As is 3 A binary tree is full if every node has either zero or
often the case, the average codeword length two children.

ACM Computing Surveys, Vol. 19, No. 3, September 1987


272 8 D. A. Ldewer and D. S. Hirschberg
a1 .25 .25 .25 .33 A2 S-F Huffman
az .20 .20 .22 .25 .33 :3’.”
.15 .18 .20 .22 .25 a1 0.35 00 1
a3
.12 .15 .18 .20 az 0.17 01 011
a
.lO .12 .15JJ a3 0.17 10 010
as
.lO .lO a4 0.16 110 001
as
a7 .080 a5 0.15 111 000
(a) Average codeword length 2.31 2.30
Figure 8. Comparison of Shannon-Fan0 and Huff-
man codes.

3) both yield the same average codeword


length for a source with probabilities (.4, -2,
.2, -1, .l). Schwartz [1964] defines a varia-
tion of the Huffman algorithm that per-
forms “bottom merging”, that is, that
orders a new parent node above existing
nodes of the same weight and always
merges the last two weights in the list. The
code constructed is the Huffman code with
minimum values of maximum codeword
length (max( &1) and total codeword length
(b) (c li). Schwartz and Kallick [1964] describe
Figure 7. The Huffman process: (a) The list; (b) the
an implementation of Huffman’s algorithm
tree. with bottom merging. The Schwartz-
Kallick algorithm and a later algorithm
by Connell [ 19731use Huffman’s procedure
Figure 7, in order of decreasing probability, to determine the lengths of the codewords,
are (01, 11, 001, 100, 101, 0000, 0001). and actual digits are assigned so that the
Clearly, this process yields a minimal prefix code has the numerical sequence property;
code. Further, the algorithm is guaranteed that is, codewords of equal length form a
to produce an optimal (minimum redun- consecutive sequence of binary numbers.
dancy) code [Huffman 19521. Gallager Shannon-Fan0 codes also have the numer-
[1978] has proved an upper bound on the ical sequence property. This property can
redundancy of a Huffman code of pn + be exploited to achieve a compact represen-
log[(2 log e)/e] = p,, f 0.086, where pn is tation of the code and rapid encoding and
the probability of the least likely source decoding.
message. In a recent paper, Capocelli et al. Both the Huffman and the Shannon-
[1986] have provided new bounds that are Fano mappings can be generated in O(n)
tighter than those of Gallager for some time, where n is the number of messages in
probability distributions. Figure 8 shows a the source ensemble (assuming that the
distribution for which the Huffman code is weights have been presorted). Each of
optimal while the Shannon-Fan0 code is these algorithms maps a source message ai
not. with probability p to a codeword of length
In addition to the fact that there are 1 (-log p I 1 I - log p + 1). Encoding and
many ways of forming codewords of appro- decoding times depend on the representa-
priate lengths, there are cases in which the tion of the mapping. If the mapping is
Huffman algorithm does not uniquely stored as a binary tree, then decoding the
determine these lengths owing to the codeword for ai involves following a path of
arbitrary choice among equal minimum length 1 in the tree. A table indexed by the
weights. As an example, codes with code- source messagescould be used for encoding;
word lengths of (1,2, 3,4,41 and ]2,2, 2, 3, the code for ai would be stored in position

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 273

i of the table and encoding time would be he or she is using. This requires only log k
O(Z). Connell’s algorithm makes use of the bits of overhead. Assuming that classes of
index of the Huffman code, a representa- transmission with relatively stable charac-
tion of the distribution of codeword lengths, teristics could be identified, this hybrid
to encode and decode in O(c) time, where c approach would greatly reduce the redun-
is the number of different codeword dancy due to overhead without significantly
lengths. Tanaka [1987] presents an imple- increasing expected codeword length. In ad-
mentation of Huffman coding based on dition, the cost of computing the mapping
finite-state machines that can be realized would be amortized over all files of a given
efficiently in either hardware or software. class. That is, the mapping would be com-
As noted earlier, the redundancy bound puted once on a statistically significant
for Shannon-Fan0 codes is 1 and the bound sample and then used on a great number of
for the Huffman method is p,, + 0.086, files for which the sample is representative.
where p,, is the probability of the least likely There is clearly a substantial risk associ-
source message (so p,, is less than or equal ated with assumptions about file character-
to 5, and generally much less). It is impor- istics, and great care would be necessary in
tant to note that in defining redundancy to choosing both the sample from which the
be average codeword length minus entropy, mapping is to be derived and the categories
the cost of transmitting the code mapping into which to partition transmissions. An
computed by these algorithms is ignored. extreme example of the risk associated
The overhead cost for any method in which with the codebook approach is provided by
the source alphabet has not been estab- Ernest V. Wright who wrote the novel
lished before transmission includes n log n Gadsby (1939) containing no occurrences of
bits for sending the n source letters. For a the letter E. Since E is the most commonly
Shannon-Fan0 code, a list of codewords used letter in the English language, an en-
ordered so as to correspond to the source coding based on a sample from Gadsby
letters could be transmitted. The additional would be disastrous if used with “normal”
time required is then c Zi, where the Zi are examples of English text. Similarly, the
the lengths of the codewords. For Huffman “normal” encoding would provide poor
coding, an encoding of the shape of the compression of Gadsby.
code tree might be transmitted. Since any McIntyre and Pechura [1985] describe
full binary tree may be a legal Huffman an experiment in which the codebook ap-
code tree, encoding tree shape may require proach is compared to static Huffman cod-
as many as log 4” = 2n bits. In most cases ing. The sample used for comparison is a
the message ensemble is very large, so that collection of 530 source programs in four
the number of bits of overhead is minute languages. The codebook contains a Pascal
by comparison to the total length of the code tree, a FORTRAN code tree, a
encoded transmission. However, it is im- COBOL code tree, a PL/I code tree, and an
prudent to ignore this cost. ALL code tree. The Pascal code tree is the
If a less-than-optimal code is acceptable, result of applying the static Huffman al-
the overhead costs can be avoided through gorithm to the combined character frequen-
a prior agreement by sender and receiver cies of all of the Pascal programs in the
as to the code mapping. Instead of a Huff- sample. The ALL code tree is based on the
man code based on the characteristics of combined character frequencies for all of
the current message ensemble, the code the programs. The experiment involves en-
used could be based on statistics for a class coding each of the programs using the five
of transmissions to which the current en- codes in the codebook and the static Huff-
semble is assumed to belong. That is, both man algorithm. The data reported for each
sender and receiver could have access to a of the 530 programs consist of the size of
codebook with k mappings in it: one for the coded program for each of the five
Pascal source, one for English text, and so predetermined codes and the size of the
on. The sender would then simply alert the coded program plus the size of the mapping
receiver as to which of the common codes (in table form) for the static Huffman

ACM Computing Surveys, Vol. 19, No. 3, September 1987


274 . D. A. Lekwer and D. S. Hirschberg

method. In every case the code tree for the constants cl and c2. We recall the definition
language class to which the program be- of an asymptotically optimal code as one
longs generates the most compact encoding. for which average codeword length ap-
Although using the Huffman algorithm on proaches entropy and remark that a uni-
the program itself yields an optimal map- versal code with cl = 1 is asymptotically
ping, the overhead cost is greater than the optimal.
added redundancy incurred by the less- An advantage of universal codes over
than-optimal code. In many cases, the ALL Huffman codes is that it is not necessary
code tree also generates a more compact to know the exact probabilities with which
encoding than the static Huffman algo- the source messages appear. Whereas Huff-
rithm. In the worst case an encoding con- man coding is not applicable unless the
structed from the codebook is only 6.6% probabilities are known, with universal
larger than that constructed by the Huff- coding it is sufficient to know the probabil-
man algorithm. These results suggest that, ity distribution only to the extent that the
for files of source code, the codebook ap- source messages can be ranked in probabil-
proach may be appropriate. ity order. By mapping messages in order of
Gilbert [1971] discusses the construction decreasing probability to codewords in or-
of Huffman codes based on inaccurate der of increasing length, universality can
source probabilities. A simple solution to be achieved. Another advantage to univer-
the problem of incomplete knowledge of the sal codes is that the codeword sets are fixed.
source is to avoid long codewords, thereby It is not necessary to compute a codeword
minimizing the error of badly underesti- set based on the statistics of an ensemble;
mating the probability of a message. The any universal codeword set will suffice as
problem becomes one of constructing the long as the source messages are ranked.
optimal binary tree subject to a height re- The encoding and decoding processes are
striction (see [Knuth 1971; Hu and Tan thus simplified. Although universal codes
1972; Garey 19741). Another approach in- can be used instead of Huffman codes as
volves collecting statistics for several general-purpose static schemes, the more
sources and then constructing a code based common application is as an adjunct to a
on some combined criterion. This approach dynamic scheme. This type of application
could be applied to the problem of designing is demonstrated in Section 5.
a single code for use with English, French, Since the ranking of source messages is
German, and so on, sources. To accomplish the essential parameter in universal coding,
this, Huffman’s algorithm could be used to we may think of a universal code as repre-
minimize either the average codeword senting an enumeration of the source
length for the combined source probabili- messages or as representing the integers,
ties or the average codeword length for which provide an enumeration. Elias [ 19751
English, subject to constraints on average defines a sequence of universal coding
codeword lengths for the other sources. schemes that maps the set of positive in-
tegers onto the set of binary codewords.
The first Elias code is one that is simple
3.3 Universal Codes and Representations of but not optimal. This code, y, maps an
the Integers integer x onto the binary value of x prefaced
A code is universal if it maps source mes- by Llog x 1 zeros. The binary value of n is
sages to codewords so that the resulting expressed in as few bits as possible and
average codeword length is bounded by therefore begins with a 1, which serves to
clH + c2. That is, given an arbitrary delimit the prefix. The result is an instan-
source with nonzero entropy, a universal taneously decodable code since the total
code achieves average codeword length that length of a codeword is exactly one greater
is at most a constant times the optimal than twice the number of zeros in the
possible for that source. The potential prefix; therefore, as soon as the first 1
compression offered by a universal code of a codeword is encountered, its length
clearly depends on the magnitudes of the is known. The code is not a minimum

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 275
Source Frequency Rank Codeword
1 1 message
2 010 a(1) = 1
3 011 S(2) = 0100
4 00100 e 6(3) = 0101
5 00101 d 6(4) = 01100
6 00110 space 6(5) = 01101
7 00111 a(6) = 01110
8 0001000 s 6(7) = 01111
16 000010000 a s(8) = 00100000
17 000010001
32 00000100000 Figure 10. An Elias code for EXAMPLE (code
length = 161).
Figure 9. Elias codes.

redundancy code since the ratio of expected on the Fibonacci numbers of order m 2 2,
codeword length to entropy goes to 2 as where the Fibonacci numbers of order 2 are
entropy approaches infinity. The second the standard Fibonacci numbers: 1, 1, 2, 3,
code, 6, maps an integer x to a codeword 5,8,13,. . . . In general, the Fibonacci num-
consisting of r(Llog xl + 1) followed by bers of order m are defined by the recur-
the binary value of x with the leading 1 de- rence: Fibonacci numbers F-,+1 through F,,
leted. The resulting codeword has length are equal to 1; the kth number for iz L 1 is
Llog XJ + 2Llog(l + Llog xJ)J + 1. This con- the sum of the preceding m numbers. We
cept can be applied recursively to shorten describe only the order-2 Fibonacci code;
the codeword lengths, but the benefits the extension to higher orders is straight-
decrease rapidly. The code 6 is asymptoti- forward.
cally optimal since the limit of the ratio of Every nonnegative integer N has pre-
expected codeword length to entropy is 1. cisely one binary representation of the form
Figure 9 lists the values of y and 6 for a R(N) = Cf=o diFi (where di E (0, 11, k 5 N,
sampling of the integers. Figure 10 shows and the Fi are the order-2 Fibonacci num-
an Elias code for string EXAMPLE. The bers as defined above) such that there are
number of bits transmitted using this map- no adjacent ones in the representation. The
ping would be 161, which does not compare Fibonacci representations for a small
well with the 117 bits transmitted by the sampling of the integers are shown in
Huffman code of Figure 3. Huffman coding Figure 11, using the standard bit sequence
is optimal under the static mapping model. from high order to low. The bottom row of
Even an asymptotically optimal universal the figure gives the values of the bit posi-
code cannot compare with static Huffman tions. It is immediately obvious that this
coding on a source for which the probabil- Fibonacci representation does not consti-
ities of the messages are known. tute a prefix code. The order-2 Fibonacci
A second sequence of universal coding code for N is defined as F(N) = D 1, where
schemes, based on the Fibonacci numbers, D = d,,d,dz . -. dk (the di defined above).
is defined by Apostolico and Fraenkel That is, the Fibonacci representation is
[1985]. Although the Fibonacci codes are reversed and 1 is appended. The Fibonacci
not asymptotically optimal, they compare code values for a small subset of the inte-
well to the Elias codes as long as the num- gers are given in Figure 11. These binary
ber of source messages is not too large. codewords form a prefix code since every
Fibonacci codes have the additional attrib- codeword now terminates in two consecu-
ute of robustness, which manifests itself by tive ones, which cannot appear anywhere
the local containment of errors. This aspect else in a codeword.
of Fibonacci codes is discussed further in Fraenkel and Klein [ 19851 prove that the
Section 7. Fibonacci code of order 2 is universal, with
The sequence of Fibonacci codes de- cl = 2 and c2 = 3. It is not asymptotically
scribed by Apostolico and Fraenkel is based optimal since cl > 1. Fraenkel and Klein

ACM Computing Surveys, Vol. 19, No. 3, September 1987


276 l D. A. Lelewer and D. S. Hirschberg
N R(N) WV Source Cumulative
Probability Raze
message probability
1 1 11
2 10 011 A .2 .2 w, 2)
3 10 0 0011 B .4 .6 1.2,.6)
4 1 0 1 1011 c .l .7 L.6, .7)
5 1 0 0 0 00011 D .2 .9 f.7, .9)
6 10 0 1 10011 # .l 1.0 [.9, 1.0)
7 10 1 0 01011
8 10 0 0 0 000011 Figure 13. The arithmetic coding model.
16 100100 0010011
32 1010100 00101011 3.4 Arithmetic Coding
21 13 8 5 3 2 1 The method of arithmetic coding was sug-
Figure 11. Fibonacci representations and Fibonacci gested by Elias and presented by Abramson
codes. [1963] in his text on information theory.
Implementations of Elias’ technique were
developed by Risssanen [1976], Pasco
Source Frequency Rank Codeword
[1976], Rubin [1979], and, most recently,
message Witten et al. [ 19871. We present the con-
cept of arithmetic coding first and follow
l? 8 1 F(1) = 11 with a discussion of implementation details
f 7 2 F(2) = 011
and performance.
65 43 F(4) = 0011
F(3) 1011
In arithmetic coding a source ensemble
s
space 5 5 F(5) = 00011 is represented by an interval between 0
and 1 on the real number line. Each symbol
; 34 67 F(6) = 10011
F(7) 01011 of the ensemble narrows this interval.
a 2 8 F(8) = 000011 As the interval becomes smaller, the num-
ber of bits needed to specify it grows.
Figure 12. A Fibonacci code for EXAMPLE (code Arithmetic coding assumes an explicit
length = 153).
probabilistic model of the source. It
is a defined-word scheme that uses the
probabilities of the source messages to
also show that Fibonacci codes of higher successively narrow the interval used to
order compress better than the order-2 code represent the ensemble. A high-probability
if the source language is large enough message narrows the interval less than a
(i.e., the number of distinct source mes- low-probability message, so that high-
sages is large) and the probability distri- probability messages contribute fewer bits
bution is nearly uniform. However, no to the coded message. The method begins
Fibonacci code is asymptotically optimal. with an unordered list of source messages
The Elias codeword 6 (N) is asymptotically and their probabilities. The number line is
shorter than any Fibonacci codeword for partitioned into subintervals on the basis
N, but the integers in a very large initial of cumulative probabilities.
range have shorter Fibonacci codewords. A small example is used to illustrate the
For m = 2, for example, the transition point idea of arithmetic coding. Given source
is N = 514,228 [Apostolico and Fraenkel messages (A, B, C, D, #) with probabilities
19851. Thus, a Fibonacci code provides bet- (.2, .4, .l, .2, .l], Figure 13 demonstrates the
ter compression than the Elias code until initial partitioning of the number line. The
the size of the source language becomes symbol A corresponds to the first 5 of the
very large. Figure 12 shows a Fibonacci interval [0, 1); B the next g; D the subin-
code for string EXAMPLE. The number of terval of size 5, which begins 70% of the
bits transmitted using this mapping would way from the left endpoint to the right.
be 153, which is an improvement over the When encoding begins, the source ensem-
Elias code of Figure 10, but still compares ble is represented by the entire interval
poorly with the Huffman code of Figure 3. 10, 1). For the ensemble AADB#, the first

ACM Computing Surveys, Vol. 19, No. 3, September 198’7


Data Compression l 277
A reduces the interval to [0, .2) and the In order to recover the original ensemble,
second A to [0, .04) (the first 2 of the the decoder must know the model of the
previous interval). The D further narrows source used by the encoder (e.g., the source
the interval to [.028, .036)(; of the previous messages and associated ranges) and a sin-
size, beginning 70% of the distance from gle number within the interval determined
left to right). The B narrows the interval by the encoder. Decoding consists of a se-
to [.0296, .0328), and the # yields a final ries of comparisons of the number i to the
interval of [.03248, .0328). The interval, or ranges representing the source messages.
alternatively any number i within the in- For this example, i might be .0325 (.03248,
terval, may now be used to represent the .0326, or .0327 would all do just as well).
source ensemble. The decoder uses i to simulate the actions
Two equations may be used to define the of the encoder. Since i lies between 0 and
narrowing process described above: .2, the decoder deduces that the first letter
was A (since the range [0, .2) corresponds
newleft = prevleft + msgleft * prevsize (1) to source message A). This narrows the
newsize = prevsize * msgsize (2) interval to [0, .2). The decoder can now
deduce that the next message will further
Equation (1) states that the left endpoint narrow the interval in one of the following
of the new interval is calculated from the ways: to [0, .04) for A, to [.04, .12) for 23, to
previous interval and the current source [.l2, .l4) for C, to [.14, .18) for D, or to
message. The left endpoint of the range [.18, .2) for #. Since i falls into the interval
associated with the current message speci- [0, .04), the decoder knows that the second
fies what percent of the previous interval message is again A. This process continues
to remove from the left in order to form the until the entire ensemble has been re-
new interval. For D in the above example, covered.
the new left endpoint is moved over by Several difliculties become evident when
.7 X .04 (70% of the size of the previous implementation of arithmetic coding is
interval). Equation (2) computes the size of attempted. The first is that the decoder
the new interval from the previous interval needs some way of knowing when to stop.
size and the probability of the current mes- As evidence of this, the number 0 could
sage (which is equivalent to the size of its represent any of the source ensembles A,
associated range). Thus, the size of the AA, AAA, and so forth. Two solutions to
interval determined by D is .04 x .2, and this problem have been suggested. One is
the right endpoint is .028 + .008 = .036 that the encoder transmit the size of the
(left endpoint + size). ensemble as part of the description of
The size of the final subinterval deter- the model. Another is that a special symbol
mines the number of bits needed to specify be included in the model for the purpose of
a number in that range. The number of bits signaling end of message. The # in the
needed to specify a subinterval of [0, 1) of above example serves this purpose. The
size s is -log a. Since the size of the final second alternative is preferable for several
subinterval is the product of the probabili- reasons. First, sending the size of the en-
ties of the source messages in the ensemble semble requires a two-pass process and
(i.e., s = nZ1 p(source message i), where precludes the use of arithmetic coding as
N is the length of the ensemble), we have part of a hybrid codebook scheme (see
-log s = - z Zi log p (source message i) = Sections 1.2 and 3.2). Second, adaptive
- xr==,p(aJlog p(oJ, where n is the number methods of arithmetic coding are easily
of unique source messages al, a2, . . . , a,. developed, and a first pass to determine
Thus, the number of bits generated by the ensemble size is inappropriate in an on-
arithmetic coding technique is exactly line adaptive scheme.
equal to entropy H. This demonstrates A second issue left unresolved by the
the fact that arithmetic coding achieves fundamental concept of arithmetic coding
compression that is almost exactly that is that of incremental transmission and
predicted by the entropy of the source. reception. It appears from the above

ACM Computing Surveys, Vol. 19, No. 3, September 1987


278 . D. A. Lelewer and D. S. Hirschberg
discussion that the encoding algorithm Source
Probability
Cumulative
probabi,ity
message Raw
transmits nothing until the final interval is
determined. However, this delay is not nec- .05 .05 PI .05)
essary. As the interval narrows, the leading i .075 .125 [.05, .125)
bits of the left and right endpoints become .l .225 [.125, .225)
the same. Any leading bits that are the Tl .125 .35 [.225, .35)
same may be transmitted immediately, .15 .5 [.35, .5)
since they will not be affected by further ; .175 .675 [.5, .675)
g .2 .a75 [.675, .875)
narrowing. A third issue is that of precision.
space .125 1.0 [.875, 1.0)
From the description of arithmetic coding
it appears that the precision required grows Figure 14. The arithmetic coding model of EXAM-
without bound as the length of the ensem- PLE.
ble grows. Witten et al. [1987] and Rubin
[1979] address this issue. Fixed precision any static or adaptive method for comput-
registers may be used as long as underflow ing the probabilities (or frequencies) of the
and overflow are detected and managed. source messages. Witten et al. [1987] im-
The degree of compression achieved by an plement an adaptive model similar to the
implementation of arithmetic coding is not techniques described in Section 4. The
exactly H, as implied by the concept of performance of this implementation is
arithmetic coding. Both the use of a mes- discussed in Section 6.
sage terminator and the use of fixed-length
arithmetic reduce coding effectiveness. 4. ADAPTIVE HUFFMAN CODING
However, it is clear that an end-of-message
symbol will not have a significant effect on Adaptive Huffman coding was first con-
a large source ensemble. Witten et al. ceived independently by Faller [1973] and
[1987] approximate the overhead due to Gallager [ 19781. Knuth [ 19851 contributed
the use of fixed precision at lo-* bits per improvements to the original algorithm,
source message, which is also negligible. and the resulting algorithm is referred to
The arithmetic coding model for ensem- as algorithm FGK. A more recent version
ble EXAMPLE is given in Figure 14. The of adaptive Huffman coding is described by
final interval size is p(a)” X p(b)3 X p(c)” X Vitter [ 19871. All of these methods are
pk05 X p(eY X p(f17 X p(g)’ X pbpd5. defined-word schemes that determine the
The number of bits needed to specify mapping from source messages to code-
a value in the interval is -log(1.44 x words on the basis of a running estimate of
10-35) = 115.7. So excluding overhead, ar- the source message probabilities. The code
ithmethic coding transmits EXAMPLE in is adaptive, changing so as to remain opti-
116 bits, one less bit than static Huffman mal for the current estimates. In this way,
coding. the adaptive Huffman codes respond to
Witten et al. [1987] provide an imple- locality. In essence, the encoder is “learn-
mentation of arithmetic coding, written in ing” the characteristics of the source. The
C, which separates the model of the source decoder must learn along with the encoder
from the coding process (where the coding by continually updating the Huffman tree
process is defined by eqs. (1) and (2)). The so as to stay in synchronization with the
model is in a separate program module and encoder.
is consulted by the encoder and the decoder Another advantage of these systems is
at every step in the processing. The fact that they require only one pass over the
that the model can be separated so easily data. Of course, one-pass methods are not
renders the classification static/adaptive very interesting if the number of bits they
irrelevant for this technique. Indeed, the transmit is significantly greater than that
fact that the coding method provides of the two-pass scheme. Interestingly, the
compression efficiency nearly equal to the performance of these methods, in terms of
entropy of the source under any model al- number of bits transmitted, can be better
lows arithmetic coding to be coupled with than that of static Huffman coding. This

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression 8 279
does not contradict the optimality of the that is the new O-node. Again, the tree is
static method, since the static method is recomputed. In this case, the code for the
optimal only over all methods that assume O-node is sent; in addition, the receiver
a time-invariant mapping. The perform- must be told which of the n - k unused
ance of the adaptive methods can also be messageshas appeared. In Figure 15, a sim-
worse than that of the static method. Upper ple example is given. At each node a count
bounds on the redundancy of these meth- of occurrences of the corresponding mes-
ods are presented in this section. As dis- sage is stored. Nodes are numbered indi-
cussed in the Introduction, the adaptive cating their position in the sibling property
method of Faller [1973], Gallager [1978], ordering. The updating of the tree can be
and Knuth [ 19851is the basis for the UNIX done in a single traversal from the a,+, node
utility compact. The performance of com- to the root. This traversal must increment
pact is quite good, providing typical the count for the a,+l node and for each of
compression factors of 30-40%. its ancestors. Nodes may be exchanged to
maintain the sibling property, but all of
4.1 Algorithm FGK
these exchanges involve a node on the path
from a,+l to the root. Figure 16 shows the
The basis for algorithm FGK is the sibling final code tree formed by this process on
property, defined by Gallager [1978]: A the ensemble EXAMPLE.
binary code tree has the sibling property Disregarding overhead, the number of
if each node (except the root) has a sibling bits transmitted by algorithm FGK for the
and if the nodes can be listed in order of EXAMPLE is 129. The static Huffman
nonincreasing weight with each node adja- algorithm would transmit 117 bits in pro-
cent to its sibling. Gallager proves that a cessing the same data. The overhead asso-
binary prefix code is a Huffman code if and ciated with the adaptive method is actually
only if the code tree has the sibling prop- less than that of the static algorithm. In
erty. In algorithm FGK, both sender and the adaptive case the only overhead is the
receiver maintain dynamically changing n log n bits needed to represent each of the
Huffman code trees. The leaves of the n different source messageswhen they ap-
code tree represent the source messages, pear for the first time. (This is in fact
and the weights of the leaves represent conservative; instead of transmitting a
frequency counts for the messages.At any unique code for each of the n source mes-
particular time, k of the n possible source pages, the sender could transmit the
messages have occurred in the message message’s position in the list of remaining
ensemble. messagesand save a few bits in the average
Initially, the code tree consists of a single case.) In the static case, the source mes-
leaf node, called the O-node. The O-node is sages need to be sent as does the shape of
a special node used to represent the n - k the code tree. As discussed in Section 3.2,
unused messages.For each message trans- an efficient representation of the tree shape
mitted, both parties must increment the requires 2n bits. Algorithm FGK com-
corresponding weight and recompute the pares well with static Huffman coding on
code tree to maintain the sibling property. this ensemble when overhead is taken
At the point in time when t messageshave into account. Figure 17 illustrates an
been transmitted, k of them distinct, and example on which algorithm FGK per-
k < n, the tree is a legal Huffman code forms better than static Huffman coding
tree with k + 1 leaves, one for each of even without taking overhead into account.
the k messages and one for the O-node. If Algorithm FGK transmits 47 bits for this
the (t + 1)st messageis one of the k already ensemble, whereas the static Huffman code
seen, the algorithm transmits at+l’s current requires 53.
code, increments the appropriate counter, Vitter [1987] has proved that the total
and recomputes the tree. If an unused mes- number of bits transmitted by algorithm
sage occurs, the O-node is split to create a FGK for a message ensemble of length t
pair of leaves, one for a,+l, and a sibling containing n distinct messages is bounded

ACM Computing Surveys, Vol. 19, No. 3, September 1987


280 D. A. Lelewer and D. S. Hirschberg

space space
(4 (b)

(4
Figure 15. Algorithm FGK processing the ensemble EXAMPLE--(a) Tree
after processing LLaa bb”; 11 will be transmitted for the next b. (b) After
encoding the third b; 101 will be transmitted for the next spuce; the form of
the tree will not change-only the frequency counts will be updated; 100 will
be transmitted for the first c. (c) Tree after update following first c.

below by S - n + 1, where S is the perform- 4.2 Algorithm V


ance of the static method, and bounded
above by 2s + t - 4n + 2. So the perform- The adaptive Huffman algorithm of Vitter
ance of algorithm FGK is never much worse (algorithm V) incorporates two improve-
than twice optimal. Knuth [1985] provides ments over algorithm FGK. First, the num-
a complete implementation of algorithm ber of interchanges in which a node is
FGK and a proof that the time required for moved upward in the tree during a recom-
each encoding or decoding operation is putation is limited to one. This number is
O(l), where 1 is the current length of the bounded in algorithm FGK only by l/2,
codeword. It should be noted that since the where 1 is the length of the codeword for
mapping is defined dynamically, (during a,+l when the recomputation begins. Sec-
transmission) the encoding and decoding ond, Vitter’s method minimizes the values
algorithms stand alone; there is no addi- of z li and max(li] subject to the require-
tional algorithm to determine the mapping ment of minimizing 2 wili. The intuitive
as in static methods. explanation of algorithm V’s advantage

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 281

Figure 16. Tree formed by algorithm


FGK for ensemble EXAMPLE.

Figure 17. Tree formed by algorithm


FGK for ensemble “e eae de eabe eae dcf”.

over algorithm FGK is as follows: As in rately represent the symbol probabilities


algorithm FGK, the code tree constructed over the entire message. Therefore, the fact
by algorithm V is the Huffman code tree that algorithm V guarantees a tree of min-
for the prefix of the ensemble seen so far. imum height (height = max(Zi)) and mini-
The adaptive methods do not assume that mum external path length (Z li) implies
the relative frequencies of a prefix accu- that it is better suited for coding the next

ACM Computing Surveys, Vol. 19, No. 3, September 1987


282 . D. A. Lelewer and D. S. Hirschberg

Figure 18. FGK tree with nonlevel order numbering.

message of the ensemble, given that any of


the leaves of the tree may represent that
next message.
These improvements are accomplished
through the use of a new system for num-
bering nodes. The numbering, called an
implicit numbering, corresponds to a level
ordering of the nodes (from bottom to top
and left to right). Figure 18 illustrates that
the numbering of algorithm FGK is not 1
always a level ordering. The following in-
variant is maintained in Vitter’s algorithm: c
For each weight w, all leaves of weight w
precede (in the implicit numbering) all in- Figure 19. Algorithm V processing the ensemble “UQ
ternal nodes of weight w. Vitter [1987] bbb c”.
proves that this invariant enforces the de-
sired bound on node promotions. The in-
variant also implements bottom merging, by increasing weight with the convention
as discussed in Section 3.2, to minimize that a leaf block always precedes an inter-
z li and max(Zi ). The difference between nal block of the same weight. When an
Vitter’s method and algorithm FGK is in exchange of nodes is required to maintain
the way the tree is updated between the sibling property, algorithm V requires
transmissions. In order to understand the that the node being promoted be moved to
revised update operation, the following the position currently occupied by the high-
definition of a block of nodes is necessary: est numbered node in the target block.
Blocks are equivalence classes of nodes de- In Figure 19, the Vitter tree correspond-
fined by u = v iff weight(u) = weight(v), ing to Figure 15~ is shown. This is the
and u and v are either both leaves or both first point in EXAMPLE at which algo-
internal nodes. The leader of a block is the rithm FGK and algorithm V differ signifi-
highest numbered (in the implicit number- cantly. At this point, the Vitter tree has
ing) node in the block. Blocks are ordered height 3 and external path length 12,

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 283

Figure 20. Tree formed by algorithm V for


the ensemble of Figure 17.
e

c b

whereas the FGK tree has height 4 and nodes with equal weights could be deter-
external path length 14. Algorithm V trans- mined on the basis of recency. Another
mits codeword 001 for the second c; FGK reasonable assumption about adaptive cod-
transmits 1101. This demonstrates the in- ing is that the weights in the current tree
tuition given earlier that algorithm V is correspond closely to the probabilities as-
better suited for coding the next message. sociated with the source. This assumption
The Vitter tree corresponding to Figure 16, becomes more reasonable as the length of
representing the final tree produced in the ensemble increases. Under this assump-
processing EXAMPLE, is only different tion, the expected cost of transmitting the
from Figure 16 in that the internal node of next letter is C pi& z x wili, SO that neither
weight 5 is to the right of both leaf nodes algorithm FGK nor algorithm V has any
of weight 5. Algorithm V transmits 124 bits advantage.
in processing EXAMPLE, as compared Vitter [1987] proves that the perform-
with the 129 bits of algorithm FGK and ance of his algorithm is bounded by S - n
117 bits of static Huffman coding. It should + 1 from below and S + t - 2n + 1 from
be noted that these figures do not include above. At worst then, Vitter’s adaptive
overhead and, as a result, disadvantage the method may transmit one more bit per
adaptive methods. codeword than the static Huffman method.
Figure 20 illustrates the tree built by The improvements made by Vitter do not
Vitter’s method for the ensemble of Fig- change the complexity of the algorithm;
ure 17. Both C li and maxI& 1 are smaller in algorithm V encodes and decodes in 0 (1)
the tree of Figure 20. The number of bits time, as does algorithm FGK.
transmitted during the processing of the
sequence is 47, the same used by algorithm 5. OTHER ADAPTIVE METHODS
FGK. However, if the transmission contin-
ues with d, b, c, f, or an unused letter, the Two more adaptive data compression
cost of algorithm V will be less than that methods, algorithm BSTW and Lempel-
of algorithm FGK. This again illustrates Ziv coding, are discussed in this section.
the benefit of minimizing the external path Like the adaptive Huffman coding tech-
length (C li) and the height (max(Zi)). niques, these methods do not require a first
It should be noted again that the strategy pass to analyze the characteristics of the
of minimizing external path length and source. Thus they provide coding and
height is optimal under the assumption transmission in real time. However, these
that any source letter is equally likely to schemes diverge from the fundamental
occur next. Other reasonable strategies in- Huffman coding approach to a greater
clude one that assumes locality. To take degree than the methods discussed in Sec-
advantage of locality, the ordering of tree tion 4. Algorithm BSTW is a defined-word

ACM Computing Surveys, Vol. 19, No. 3, September 1987


204 . D. A. Lelewer and D. S. Hirschberg
scheme that attempts to exploit locality. Symbol string Code
Lempel-Ziv coding is a free-parse method; A 1
that is, the words of the source alphabet T 2
are defined dynamically as the encoding is AN 3
performed. Lempel-Ziv coding is the basis TH 4
for the UNIX utility compress. Algorithm THE 5
BSTW is a variable-variable scheme, AND 6
whereas Lempel-Ziv coding is variable- AD 7
b 8
block.
bb 9
bbb 10
5.1 Lempel-Ziv Codes 0 11
00 12
Lempel-Ziv coding represents a departure 000 13
from the classic view of a code as a mapping 0000 14
from a fixed set of source messages (letters, Z 15
symbols, or words) to a fixed set of code-
words. We coin the term free-parse to char- ### 4095
acterize this type of code, in which the set Figure 21. A Lempel-Ziv code table.
of source messages and the codewords to
which they are mapped are defined as the
algorithm executes. Whereas all adaptive in long strings. Effective compression is
methods create a set of codewords dynam- achieved when a long string is replaced by
ically, defined-word schemes have a fixed a single 12-bit code.
set of source messages, defined by context The Lempel-Ziv method is an incremen-
(e.g., in text file processing the source mes- tal parsing strategy in which the coding
sages might be single letters; in Pascal process is interlaced with a learning process
source file processing the source messages for varying source characteristics [Ziv and
might be tokens). Lempel-Ziv coding de- Lempel19771. In Figure 21, run-length en-
fines the set of source messages as it parses coding of zeros and blanks is being learned.
the ensemble. The Lempel-Ziv algorithm parses the
The Lempel-Ziv algorithm consists of a source ensemble into a collection of seg-
rule for parsing strings of symbols from a ments of gradually increasing length. At
finite alphabet into substrings or words each encoding step, the longest prefix of
whose lengths do not exceed a prescribed the remaining source ensemble that
integer L, and a coding scheme that maps matches an existing table entry ((Y) is
these substrings sequentially into uniquely parsed off, along with the character (c)
decipherable codewords of fixed length Lz following this prefix in the ensemble. The
[Ziv and Lempel 19771. The strings are new source message, (YC, is added to the
selected so that they have very nearly equal code table. The new table entry is coded as
probability of occurrence. As a result, fre- (i, c), where i is the codeword for the exist-
quently occurring symbols are grouped ing table entry and c is the appended
into longer strings, whereas infrequent character. For example, the ensemble
symbols appear in short strings. This strat- 010100010 is parsed into (0, 1, 01, 00, 010)
egy is effective at exploiting redundancy and is coded as ((0, 0), (0, l), (1, l), (1, 0),
due to symbol frequency, character repeti- (3, 0)). The table built for the message
tion, and high-usage patterns. Figure 21 ensemble EXAMPLE is shown in Fig-
shows a small Lempel-Ziv code table. Low- ure 22. The coded ensemble has the form
frequency letters such as Z are assigned ((0, a), (1, space), (0, b), (3, b), to, space),
individually to fixed-length codewords (in (0, 4, (6, 4, (6, space), (0, d), (9, d),
this case, 12-bit binary numbers repre- (10, w=e), to, 4, (12, 4, (13, 4, (5, f),
sented in base 10 for readability). (0, f), (1% f), (17, f), (0, g), (1% g), (20, g),
Frequently occurring symbols, such as (20)). The string table is represented in a
blank (represented by b) and zero, appear more efficient manner than in Figure 21;

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 285
Message Codeword good performance briefly, and fail to make
a any gains once the table is full and mes-
lspace 2 sages can no longer be added. If the ensem-
b 3 ble’s characteristics vary over time, the
3b 4 method may be “stuck with” the behavior
space 5 it has learned and may be unable to con-
C 6 tinue to adapt.
6c 7
Lempel-Ziv coding is asymptotically
8
d-
optimal, meaning that the redundancy ap-
9
9d 10 proaches zero as the length of the source
lOspace 11 ensemble tends to infinity. However, for
e 12 particular finite sequences, the compres-
12e 13 sion achieved may be far from optimal
13e 14 [Storer and Szymanski 19821. When the
5f 15 method begins, each source symbol is coded
16 individually. In the case of 6- or B-bit source
i6f 17 symbols and 12-bit codewords, the method
17f 18
yields as much as 50% expansion during
g 19
initial encoding. This initial inefficiency
1% 20
can be mitigated somewhat by initializing
2ol2 21
the string table to contain all of the source
Figure 22. Lempel-Ziv table for the message ensem- characters. Implementation issues are par-
ble EXAMPLE (code length = 173). ticularly important in Lempel-Ziv meth-
ods. A straightforward implementation
takes O(n*) time to process a string of n
the string is represented by its prefix code- symbols; for each encoding operation, the
word followed by the extension character, existing table must be scanned for the long-
so that the table entries have fixed length. est message occurring as a prefix of the
The Lempel-Ziv strategy is simple, but remaining ensemble. Rodeh et al. [1981]
greedy. It simply parses off the longest rec- address the issue of computational com-
ognized string each time instead of search- plexity by defining a linear implementation
ing for the best way to parse the ensemble. of Lempel-Ziv coding based on suffix trees.
The Lempel-Ziv method specifies fixed- The Rodeh et al. scheme is asymptotically
length codewords. The size of the table and optimal, but an input must be very long in
the maximum source message length are order to allow efficient compression, and
determined by the length of the codewords. the memory requirements of the scheme
It should be clear from the definition of the are large, O(n) where n is the length of the
algorithm that Lempel-Ziv codes tend to source ensemble. It should also be men-
be quite inefficient during the initial por- tioned that the method of Rodeh et al.
tion of the message ensemble. For example, constructs a variable-variable code; the
even if we assume 3-bit codewords for char- pair (i, c) is coded using a representation
acters a-g and space and &bit codewords of the integers, such as the Elias codes, for
for table indexes, the Lempel-Ziv algo- i and for c (a letter c can always be coded
rithm transmits 173 bits for ensemble as the kth member of the source alphabet
EXAMPLE. This compares poorly with for some k).
the other methods discussed in this survey. The other major implementation consid-
The ensemble must be sufficiently long for eration involves the way in which the string
the procedure to build up enough symbol table is stored and accessed. Welch [1984]
frequency experience to achieve good suggests that the table be indexed by the
compression over the full ensemble. codewords (integers 1 . . . 2L, where L is the
If the codeword length is not sufficiently maximum codeword length) and that the
large, Lempel-Ziv codes may also rise table entries be fixed-length, (codeword-
slowly to reasonable efficiency, maintain extension) character pairs. Hashing is

ACM Computing Surveys, Vol. 19, No. 3, September 1987


286 . D. A. Lelewer and D. S. Hirschberg
proposed to assist in encoding. Decoding and Wei [1986]. This method, algorithm
becomes a recursive operation in which the BSTW, possessesthe advantage that it re-
codeword yields the final character of the quires only one pass over the data to be
substring and another codeword. The de- transmitted and yet has performance that
coder must continue to consult the table compares well to that of the static two-pass
until the retrieved codeword is 0. Unfortu- method along the dimension of number of
nately, this strategy peels off extension bits per word transmitted. This number of
characters in reverse order, and some type bits is never much larger than the number
of stack operation must be used to reorder of bits transmitted by static Huffman cod-
the source. ing (in fact, it is usually quite close) and
Storer and Szymanski [1982] present a can be significantly better. Algorithm
general model for data compression that BSTW incorporates the additional benefit
encompasses Lempel-Ziv coding. Their of taking advantage of locality of reference,
broad theoretical work compares classes of the tendency for words to occur frequently
macro schemes, where macro schemes in- for short periods of time and then fall into
clude all methods that factor out duplicate long periods of disuse. The algorithm uses
occurrences of data and replace them by a self-organizing list as an auxiliary data
references either to the source ensemble structure and shorter encodings for words
or to a code table. They also contribute near the front of this list. There are many
a linear-time Lempel-Ziv-like algorithm strategies for maintaining self-organizing
with better performance than the standard lists [Hester and Hirschberg 19851; algo-
Lempel-Ziv method. rithm BSTW uses move to front.
Rissanen [ 19831 extends the Lempel-Ziv A simple example serves to outline the
incremental parsing approach. Abandoning method of algorithm BSTW. As in other
the requirement that the substrings parti- adaptive schemes, sender and receiver
tion the ensemble, the Rissanen method maintain identical representations of the
gathers “contexts” in which each symbol of code, in this case message lists that are
the string occurs. The contexts are sub- updated at each transmission using the
strings of the previously encoded string (as move-to-front heuristic. These lists are in-
in Lempel-Ziv), have varying size, and are itially empty. When messagea, is transmit-
in general overlapping. The Rissanen ted, if a, is on the sender’s list, the sender
method hinges on the identification of a transmits its current position. The sender
design parameter capturing the concept of then updates his or her list by moving at to
“relevant” contexts. The problem of finding position 1 and shifting each of the other
the best parameter is undecidable, and Ris- messages down one position. The receiver
sanen suggests estimating the parameter similarly alters his or her word list. If a, is
experimentally, being transmitted for the first time, then k
As mentioned earlier, Lempel-Ziv coding + 1 is the “position” transmitted, where k
is the basis for the UNIX utility compress is the number of distinct messages trans-
and is one of the methods commonly used mitted so far. Some representation of the
in file archival programs. The archival sys- message itself must be transmitted as well,
tem PKARC uses Welch’s implementation, but just this first time. Again, a, is moved
as does compress. The compression pro- to position 1 by both sender and receiver
vided by compress is generally much better subsequent to its transmission. For the en-
than that achieved by compact (the UNIX semble “abcadeabfd “, the transmission
utility based on algorithm FGK) and takes wouldbela2b3c34d5e356f5(for
less time to compute [UNIX 19841.Typical ease of presentation, list positions are rep-
compression values attained by compress resented in base ten).
are in the range of 50-60%. As the example shows, algorithm BSTW
transmits each source message once; the
5.2 Algorithm BSTW
rest of its transmission consists of encod-
ings of list positions. Therefore, an
The most recent of the algorithms surveyed essential feature of algorithm BSTW is a
here is due to Bentley, Sleator, Tarjan, reasonable scheme for representation of
ACM Computing Surveys, Vol. 19, No. 3, September 1987
Data Compression l 287
the integers. The methods discussed by An implementation of algorithm BSTW
Bentley et al. are the Elias codes presented is described in great detail in Bentley et al.
in Section 3.3. The simple scheme, code y, [ 19861. In this implementation, encoding
involves prefixing the binary representa- an integer consists of a table lookup; the
tion of the integer i with Llog iJ zeros. This codewords for the integers from 1 to n + 1
yields a prefix code with the length of the are stored in an array indexed from 1 to
codeword for i equal to 2Llog i J + 1. Greater n + 1. A binary trie is used to store the
compression can be gained through use of inverse mapping from codewords to inte-
the more sophisticated scheme 6, which gers. Decoding an Elias codeword to find
encodes an integer i in 1 + Llog i J + 21 (log(1 the corresponding integer involves follow-
+ Llog il)l bits. ing a path in the trie. Two interlinked data
A message ensemble on which algorithm structures, a binary trie and a binary tree,
BSTW is particularly efficient, described are used to maintain the word list. The trie
by Bentley et al., is formed by repeating is based on the binary encodings of the
each of n messages n times; for example, source words. Mapping a source messageoi
In2n3” . . . nne Disregarding overhead, a to its list position p involves following a
static Huffman code uses n’log n bits (or path in the trie, following a link to the tree,
log n bits per message), whereas algorithm and then computing the symmetric order
BSTW uses n2 + 2 CL Llog iJ (which is position of the tree node. Finding the
less than or equal to nz + 2n log n or O(1) source message Ui in position p is accom-
bits per message). The overhead for algo- plished by finding the symmetric order po-
rithm BSTW consists of just the n log n sition p in the tree and returning the word
bits needed to transmit each source letter stored there. Using this implementation,
once. As discussed in Section 3.2, the over- the work done by sender and receiver is
head for static Huffman coding includes an O(kngth(ai) + length(w)), where oi is the
additional 2n bits. The locality present in messagebeing transmitted and w the code-
ensemble EXAMPLE is similar to that in word representing oi’s position in the list.
the above example. The transmission ef- If the source alphabet consists of single
fected by algorithm BSTW is 1 a 1 2 space characters, then the complexity of algo-
3b1124c11125d111126ellll rithm BSTW is just O(length(w)).
127f1111118glllllll.Using3 The “move-to-front” scheme of Bentley
bits for each source letter (a-g and space) et al. was independently developed by Elias
and the Elias code 6 for list positions, the in his paper on interval encoding and re-
number of bits used is 81, which is a great cency rank encoding [Elias 19871. Recency
improvement over all of the other methods rank encoding is equivalent to algorithm
discussed (only 69% of the length used by BSTW. The name emphasizes the fact,
static Huffman coding). This could be im- mentioned above, that the codeword for a
proved further by the use of Fibonacci source message represents the number of
codes for list positions. distinct messages that have occurred since
In Bentley et al. [1986] a proof is given its most recent occurrence. Interval encod-
that with the simple scheme for encoding ing represents a source messageby the total
integers, the performance of algorithm number of messages that have occurred
BSTW is bounded above by 2S + 1, where since its last occurrence (equivalently, the
S is the cost of the static Huffman coding length of the interval since the last previous
scheme. Using the more sophisticated occurrence of the current message). It is
integer encoding scheme, the bound is obvious that the length of the interval since
1 + S + 2 log( 1 + S). A key idea in the the last occurrence of a message a, is at
proofs given by Bentley et al. is the fact least as great as its recency rank, so that
that, using the move-to-front heuristic, the recency rank encoding never uses more,
integer transmitted for a messagea, will be and generally uses fewer, symbols per mes-
one more than the number of different sagethan interval encoding. The advantage
words transmitted since the last occurrence to interval encoding is that it has a very
of a,. Bentley et al. also prove that algo- simple implementation and can encode and
rithm BSTW is asymptotically optimal. decode selections from a very large alphabet
ACM Computing Surveys, Vol. 19, No. 3, September 1987
288 l D. A. Lelewer and D. 5’. Hirschberg

(a million letters, for example) at a micro- Knuth [ 19851describes algorithm FGK’s


second rate [Elias 19871. The use of inter- performance on three types of data: a file
val encoding might be justified in a data containing the text of Grimm’s first 10 fairy
transmission setting, where speed is the tales, text of a technical book, and a file of
essential factor. graphical data. For the first two files, the
Ryabko [1987] also comments that the source messages are individual characters
work of Bentley et al. coincides with many and the alphabet size is 128. The same data
of the results in a paper in which he con- are coded using pairs of characters so that
siders data compression by means of a the alphabet size is 1968. For the graphical
“book stack” (the books represent the data, the number of source messagesis 343.
source messages, and as a “book” occurs it In the case of the fairy tales the perform-
is taken from the stack and placed on top). ance of FGK is very close to optimum,
Horspool and Cormack [1987] have consid- although performance degrades with in-
ered “move-to-front,” as well as several creasing file size. Performance on the tech-
other list organization heuristics, in con- nical book is not as good, but it is still
nection with data compression. respectable. The graphical data prove
harder yet to compress, but again FGK
6. EMPIRICAL RESULTS
performs reasonably well. In the latter two
casesthe trend of performance degradation
Empirical tests of the efficiencies of the with file size continues. Defining source
algorithms presented here are reported messages to consist of character pairs re-
in Bentley et al. [1986], Knuth [1985], sults in slightly better compression, but the
Schwartz and Kallick [ 19641,Vitter [ 19871, difference would not appear to justify the
and Welch [ 19841.These experiments com- increased memory requirement imposed by
pare the number of bits per word required; the larger alphabet.
processing time is not reported. Although Vitter [1987] tests the performance of
theoretical considerations bound the per- algorithms V and FGK against that of
formance of the various algorithms, exper- static Huffman coding. Each method is run
imental data are invaluable in providing on data that include Pascal source code, the
additional insight. It is clear that the per- TEX source of the author’s thesis, and elec-
formance of each of these methods is de- tronic mail files. Figure 23 summarizes the
pendent on the characteristics of the source results of the experiment for a small file of
ensemble. text. The performance of each algorithm is
Schwartz and Kallick [1964] test an im- measured by the number of bits in the
plementation of static Huffman coding in coded ensemble, and overhead costs are not
which bottom merging is used to determine included. Compression achieved by each
codeword lengths and all codewords of a algorithm is represented by the size of the
given length are sequential binary numbers. file it creates, given as a percentage of the
The source alphabet in the experiment original file size. Figure 24 presents data
consists of 5114 frequently used English for Pascal source code. For the TEX source,
words, 27 geographical names, 10 numerals, the alphabet consists of 128 individual
14 symbols, and 43 suffixes. The entropy of characters; for the other two file types, no
the document is 8.884 binary digits per more than 97 characters appear. For each
message, and the average codeword con- experiment, when the overhead costs are
structed has length 8.920. The same docu- taken into account, algorithm V outper-
ment is also coded one character at a time. forms static Huffman coding as long as the
In this case the entropy of the source is size of the message ensemble (number of
4.03, and the coded ensemble contains an characters) is no more than 104.Algorithm
average of 4.09 bits per letter. The redun- FGK displays slightly higher costs, but
dancy is low in both cases. However, the never more than 100.4% of the static algo-
relative redundancy (i.e., redundancy/ rithm.
entropy) is lower when the document is Witten et al. [1987] compare adaptive
encoded by words. arithmetic coding with adaptive Huffman

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 289

n k Static Algorithm V Algorithm FGK for experiments that compare the perform-
100 96 83.0 71.1 82.4 ance of algorithm BSTW to static Huffman
500 96 83.0 80.8 83.5 coding. Here the defined words consist of
961 97 83.5 82.3 83.7 two disjoint classes, sequence of alpha-
numeric characters and sequences of non-
Figure23. Simulation results for a small text file alphanumeric characters. The performance
[Vitter 19871: n, file size in 8-bit bytes; k, number of
distinct messages. of algorithm BSTW is very close to that of
static Huffman coding in all cases. The
experiments reported by Bentley et al. are
n k Static Algorithm V Algorithm FGK of particular interest in that they incorpo-
100 32 57.4 56.2 58.9 rate another dimension, the possibility that
500 49 61.5 62.2 63.0 in the move-to-front scheme one might
1000 57 61.3 61.8 62.4 want to limit the size of the data structure
10000 73 59.8 59.9 60.0 containing the codes to include only the m
12067 78 59.6 59.8 59.9 most recent words, for some m. The tests
Figure 24. Simulation results for Pascal source code
consider cache sizes of 8, 16, 32, 64, 128,
[Vitter 19871: n, file size in bytes; k, number of distinct and 256. Although performance tends to
messages. increase with cache size, the increase is
erratic, with some documents exhibiting
nonmonotonicity (performance that in-
coding. The version of arithmetic coding creases with cache size to a point and then
tested employs single-character adaptive decreases when cache size is further in-
frequencies and is a mildly optimized C creased).
implementation. Witten et al. compare the Welch [ 19841 reports simulation results
results provided by this version of arith- for Lempel-Ziv codes in terms of compres-
metic coding with the results achieved by sion ratios. His definition of compression
the UNIX compact program (compact is ratio is the one given in Section 1.3, C =
based on algorithm FGK). On three large (average message length)/(average code-
files that typify data compression applica- word length). The ratios reported are 1.8
tions, compression achieved by arithmetic for English text, 2-6 for COBOL data files,
coding is better than that provided by com- 1.0 for floating-point arrays, 2.1 for for-
pact, but only slightly better (average file matted scientific data, 2.6 for system log
size is 98% of the compacted size). A file data, 2.3 for source code, and 1.5 for object
over a three-character alphabet, with very code. The tests involving English text files
skewed symbol probabilities, is encoded by showed that long individual documents did
arithmetic coding in less than 1 bit per not compress better than groups of short
character; the resulting file size is 74% of documents. This observation is somewhat
the size of the file generated by compact. surprising in that it seems to refute the
Witten et al. also report encoding and de- intuition that redundancy is due at least in
coding times. The encoding time of arith- part to correlation in content. For purposes
metic coding is generally half the time of comparison, Welch cites results of
required by the adaptive Huffman coding Pechura and Rubin. Pechura [1982]
method. Decode time averages 65% of the achieved a 1.5 compression ratio using
time required by compact. Only in the static Huffman coding on files of English
case of the skewed file are the time statis- text. Rubin reports a 2.4 ratio for Eng-
tics quite different. Arithmetic coding again lish text when using a complex technique
achieves faster encoding, 67% of the time for choosing the source messages to which
required by compact. However, compact de- Huffman coding is applied [Rubin 19761.
codes more quickly, using only 78% of the These results provide only a very weak
time of the arithmetic method. basis for comparison, since the character-
Bentley et al. [1986] use C and Pascal istics of the files used by the three authors
source files, TROFF source files, and a are unknown. It is very likely that a single
terminal session transcript of several hours algorithm may produce compression ratios

ACM Computing Surveys, Vol. 19, No. 3, September 1987


290 l D. A. Lelewer and D. S. Hirschberg
ranging from 1.5 to 2.4, depending on the
The effect of amplitude errors is dem-
source to which it is applied. onstrated in Figure 26. The format of
the illustration is the same as that in Fig-
7. SUSCEPTIBILITY TO ERROR ure 25. This time bits 1, 2, and 4 are
inverted rather than lost. Again synchro-
The discrete noiseless channel is, unfortu- nization is regained almost immediately.
nately, not a very realistic model of a When bit 1 or bit 2 is changed, only the
communication system. Actual data trans- first three bits (the first character of the
mission systems are prone to two types of ensemble) are disturbed. Inversion of bit 4
error: phase error, in which a code symbol causes loss of synchronization through the
is lost or gained, and amplitude error, in ninth bit. A very simple explanation of the
which a code symbol is corrupted [Neu- self-synchronization present in these ex-
mann 19621. The degree to which channel amples can be given. Since many of the
errors degrade transmission is an impor- codewords end in the same sequence of
tant parameter in the choice of a data digits, the decoder is likely to reach a leaf
compression method. The susceptibility to of the Huffman code tree at one of the
error of a coding algorithm depends heavily codeword boundaries of the original coded
on whether the method is static or adaptive. ensemble. When this happens, the decoder
is back in synchronization with the en-
7.1 Static Codes
coder.
So that self-synchronization may be dis-
It is generally known that Huffman codes cussed more carefully, the following defi-
tend to be self-correcting [Standish 19801; nitions are presented. (It should be noted
that is, a transmission error tends not to that these definitions hold for arbitrary
propagate too far. The codeword in which prefix codes so that the discussion includes
the error occurs is incorrectly received, and all of the codes described in Section 3.) Ifs
it is likely that several subsequent code- is a suffix of some codeword and there exist
words are misinterpreted, but before too sequences of codewords r and A such that
long the receiver is back in synchronization s I? = A, then I’ is said to be a synchronizing
with the sender. In a static code, synchro- sequence for s. For example, in the Huffman
nization means simply that both sender and code used above, 1 is a synchronizing se-
receiver identify the beginnings of the code- quence for the suffix 01, whereas both
words in the same way. In Figure 25 an 000001 and 011 are synchronizing se-
example is used to illustrate the ability of quences for the suffix 10. If every suffix (of
a Huffman code to recover from phase er- every codeword) has a synchronizing se-
rors. The message ensemble “BCDAEB ” is quence, then the code is completely self-
encoded using the Huffman code of Figure synchronizing. If some or none of the proper
8 where the source letters al . . . a5 repre- suffixes have synchronizing sequences,
sent A . . . E, respectively, yielding the then the code is, respectively, partially or
coded ensemble “0110100011000011”. Fig- never self-synchronizing. Finally, if there
ure 25 demonstrates the impact of loss of exists a sequence l? that is a synchronizing
the first bit, the second bit, or the fourth sequence for every suffix, r is defined to be
bit. The dots show the way in which each a universal synchronizing sequence. The
line is parsed into codewords. The loss of code used in the examples above is com-
the first bit results in resynchronization pletely self-synchronizing and has univer-
after the third bit so that only the first sal synchronizing sequence 00000011000.
source message (B) is lost (replaced by AA). Gilbert and Moore [1959] prove that the
When the second bit is lost, the first eight existence of a universal synchronizing se-
bits of the coded ensemble are misinter- quence is a necessary as well as a sufficient
preted and synchronization is regained by condition for a code to be completely self-
bit 9. Dropping the fourth bit causes the synchronizing. They also state that any
same degree of disturbance as dropping the prefix code that is completely self-
second. synchronizing will synchronize itself with

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 291
011.010.001.1.000.011. Coded ensemble BCDAEB
1.1.010.001.1.000.011. Bit 1 is lost, interpreted as AACDAEB
010.1.000.1.1.000.011. Bit 2 is lost, interpreted as CAEAAEB
011.1.000.1.1.000.011. Bit 4 is lost, interpreted as BAEAAEB

Figure 25. Recovery from phase errors.

0 1 1.0 1 0.00 1.1.000.011 Coded ensemble BCDAEB


1.1.1.0 10.00 1.1.000.011 Bit 1 is inverted, interpreted as DCDAEB
0 0 1.0 10.00 1.1.000.011 Bit 2 is inverted, interpreted as AAACDAEB
0 11.1.1.0 00.1.1.000.011 Bit 4 is inverted, interpreted as BAAEAAEB

Figure 26. Recovery from amplitude errors.

probability 1 if the source ensemble con- self-synchronize, this is not guaranteed,


sists of successive messages independently and when self-synchronization is assured,
chosen with any given set of probabilities. there is no bound on the propagation of the
This is true since the probability of occur- error. An additional difficulty is that self-
rence of the universal synchronizing se- synchronization provides no indication
quence at any given time is positive. that an error has occurred.
It is important to realize that the fact The problem of error detection and cor-
that a completely self-synchronizing code rection in connection with Huffman codes
will resynchronize with probability 1 does has not received a great deal of attention.
not guarantee recovery from error with Several ideas on the subject are reported
bounded delay. In fact, for every completely here. Rudner [1971] states that synchroniz-
self-synchronizing prefix code with more ing sequences should be as short as possible
than two codewords, there are errors within to minimize resynchronization delay. In ad-
one codeword that cause unbounded error dition, if a synchronizing sequence is used
propagation [Neumann 19621. In addition, as the codeword for a high probability mes-
prefix codes are not always completely self- sage, then resynchronization will be more
synchronizing. Bobrow and Hakimi [ 19691 frequent. A method for constructing a min-
state a necessary condition for a prefix code imum-redundancy code having the shortest
with codeword lengths l1 . . . 1, to be com- possible synchronizing sequence is de-
pletely self-synchronizing: the greatest scribed by Rudner [ 19711. Neumann [ 19621
common divisor of the Zi must be equal to suggests purposely adding some redun-
1. The Huffman code (00, 01, 10, dancy to Huffman codes in order to permit
1100, 1101, 1110, 1111) is not completely detection of certain types of errors. Clearly
self-synchronizing but is partially self- this has to be done carefully so as not to
synchronizing since suffixes 00, 01, and 10 negate the redundancy reduction provided
are synchronized by any codeword. The by Huffman coding. McIntyre and Pechura
Huffman code (000, 0010, 0011, 01, 100, [1985] cite data integrity as an advantage
1010, 1011, 100, 111) is never self-synchro- of the codebook approach discussed in Sec-
nizing. Examples of never self-synchro- tion 3.2. When the code is stored separately
nizing Huffman codes are difficult to from the coded data, the code may be
construct, and the example above is the backed up to protect it from perturbation.
only one with fewer than 16 source mes- However, when the code is stored or trans-
sages. Stiffler [1971] proves that a code is mitted with the data, it is susceptible to
never self-synchronizing if and only if none errors. An error in the code representation
of the proper suffixes of the codewords are constitutes a drastic loss, and therefore ex-
themselves codewords. treme measures for protecting this part of
The conclusions that may be drawn from the transmission are justified.
the above discussion are as follows: Al- The Elias codes of Section 3.3 are not at
though it is common for Huffman codes to all robust. Each of the codes y and 6 can be

ACM Computing Surveys, Vol. 19, No. 3, September 1987


292 l D. A. Lelewer and D. S. Hirschberg
001 1 0.00 1 00.0 00100 0. Coded integers 6,4,8
0 1 1.0 00 1 00 0.00100.0 Bit 2 is lost, interpreted as 3, 8, 2, etc.
011.1.0 00 1 00 0.00100.0 Bit 2 is inverted, interpreted as 3, 1,8,4, etc.
000 1 0 00.1.00 0 00100 0 Bit 3 is inverted, interpreted as 8, 1, etc.

Figure 27. Effects of errors in Elias codes.

000011.000011.00011.010 11.01011. Coded ensemble “au bb”


00 011.000011.00011.010 11.01011. Bit 3 is lost, interpreted as “u bb”
001011.000011.00011.010 11.01011. Bit 3 is inverted, interpreted as “?a bb”
00001 000011.00011.010 11.01011. Bit 6 is lost, interpreted as “? bb”
000010 000011.00011.010 11.01011. Bit 6 is inverted, interpreted as “? bb”
000011.000011.00011.011.11.01011. Bit 20 is inverted, interpreted as “aa fgb”

Figure 28. Effects of errors in Fibonacci codes.

thought of as generating codewords that first two substrings is disastrous, changing


consist of a number of substrings such that the way in which the rest of the codeword
each substring encodes the length of the is to be interpreted. And, like code y, code
subsequent substring. For code y we may 6 has no properties that aid in regaining
think of each codeword y(x) as the conca- synchronization once it has been lost.
tenation of z, a string of n zeros, and b, The Fibonacci codes of Section 3.3, on
a string of length n + 1 (n = Hog xJ). If the other hand, are quite robust. This ro-
one of the zeros in substring z is lost, syn- bustness is due to the fact that every code-
chronization will be lost since the last word ends in the substring 11 and that
symbol of b will be pushed into the next substring can appear nowhere else in a
codeword. codeword. If an error occurs anywhere
Since the 1 at the front of substring b other than in the 11 substring, the error is
delimits the end of z, if a zero in z is changed contained within that one codeword. It is
to a 1, synchronization will be lost as sym- possible that one codeword will become two
bols from b are pushed into the following (see the sixth line of Figure 28), but no
codeword. Similarly, if ones at the front of other codewords will be distributed. If the
b are inverted to zeros, synchronization will last symbol of a codeword is lost or changed,
be lost since the codeword y(x) consumes the current codeword will be fused with its
symbols from the following codeword. Once successor so that two codewords are lost.
synchronization is lost, it cannot normally When the penultimate bit is disturbed, up
be recovered. to three codewords can be lost. For example
In Figure 27, codewords r(6), r(4), r(8) the coded message 011.11.011 becomes
are used to illustrate the above ideas. In 0011.1011 if bit 2 is inverted. The maxi-
each case, synchronization is lost and never mum disturbance resulting from either an
recovered. amplitude error or a phase error is the
The Elias code 6 may be thought of as a disturbance of three codewords.
three-part ramp, where 6(x) = zmb with z In Figure 28, some illustrations based on
a string of n zeros, m a string of length the Fibonacci coding of ensemble EXAM-
n + 1 with binary value u, and b a string of PLE as shown in Figure 12 are given. When
length u - 1. For example, in 6(16) = bit 3 (which is not part of an 11 substring)
00.101.0000, n = 2, u = 5, and the final is lost or changed, only a single codeword
substring is the binary value of 16 with the is degraded. When bit 6 (the final bit of the
leading 1 removed so that it has length u - first codeword) is lost or changed, the first
1 = 4. Again the fact that each substring two codewords are incorrectly decoded.
determines the length of the subsequent When bit 20 is changed, the first b is incor-
substring means that an error in one of the rectly decoded as fg.

ACM ComputingSurveys,Vol. 19,No.3,September1987


Data Compression l 293

7.2 Adaptive Codes error flags. For adaptive methods it may be


Adaptive codes are far more adversely af- necessary for receiver and sender to verify
fected by transmission errors than static the current code mapping periodically.
codes. For example, in the case of an adap- For adaptive Huffman coding, Galla-
tive Huffman code, even though the re- ger [1978] suggests an “aging” scheme,
ceiver may resynchronize with the sender whereby recent occurrences of a character
in terms of correctly locating the beginning contribute more to its frequency count than
of a codeword, the information lost repre- do earlier occurrences. This strategy intro-
sents more than a few bits or a few char- duces the notion of locality into the
acters of the source ensemble. The fact adaptive Huffman scheme. Cormack and
that sender and receiver are dynamically Horspool [1984] describe an algorithm
redefining the code indicates that by the for approximating exponential aging. How-
time synchronization is regained, they may ever, the effectiveness of this algorithm
have radically different representations has not been established.
of the code. Synchronization as defined in Both Knuth [1985] and Bentley et al.
Section 7.1 refers to synchronization of the [1986] suggest the possibility of using the
bit stream, which is not sufficient for adap- “cache” concept to exploit locality and
tive methods. What is needed here is code minimize the effect of anomalous source
synchronization, that is, synchronization of messages. Preliminary empirical results in-
both the bit stream and the dynamic data dicate that this may be helpful. A problem
structure representing the current code related to the use of a cache is overhead
mapping. time required for deletion. Strategies for
There is no evidence that adaptive meth- reducing the cost of a deletion could be
ods are self-synchronizing. Bentley et al. considered. Another possible extension to
[1986] note that, in algorithm BSTW, loss algorithm BSTW is to investigate other lo-
of synchronization can be catastrophic, cality heuristics. Bentley et al. [ 19861 prove
whereas this is not true with static Huff- that intermittent move-to-front (move-to-
man coding. Ziv and Lempel [1977] recog- front after every k occurrences) is as effec-
nize that the major drawback of their tive as move-to-front. It should be noted
algorithm is its susceptibility to error prop- that there are many other self-organizing
agation. Welch [1984] also considers the methods yet to be considered. Horspool and
problem of error tolerance of Lempel-Ziv Cormack [ 19871 describe experimental re-
codes and suggests that the entire ensemble sults that imply that the transpose heuris-
be embedded in an error-detecting code. tic performs as well as move-to-front and
Neither static nor adaptive arithmetic cod- suggest that it is also easier to implement.
ing has the ability to tolerate errors. Several aspects of free-parse methods
merit further attention. Lempel-Ziv codes
appear to be promising, although the
8. NEW DIRECTIONS
absence of a worst-case bound on the
Data compression is still very much an redundancy of an individual finite source
active research area. This section suggests ensemble is a drawback. The variable-
possibilities for further study. block type Lempel-Ziv codes have been
The discussion of Section 7 illustrates implemented with some success [ARC
the susceptibility to error of the codes pre- 19861, and the construction of a variable-
sented in this survey. Strategies for increas- variable Lempel-Ziv code has been
ing the reliability of these codes while sketched [Ziv and Lempel 19781. The effi-
incurring only a moderate loss of effi- ciency of the variable-variable model
ciency would be of great value. This area should be investigated. In addition, an im-
appears to be largely unexplored. Possible plementation of Lempel-Ziv coding that
approaches include embedding the entire combines the time efficiency of the Rodeh
ensemble in an error-correcting code or re- et al. [ 19811 method with more efficient use
serving one or more codewords to act as of space is worthy of consideration.

ACM Computing Surveys, Vol. 19, No. 3, September 1987


294 l D. A. Lelewer and D. S. Hirschberg
Another important research topic is the data compression algorithm should be
development of theoretical models for data investigated.
compression that address the problem of
local redundancy. Models based on Markov REFERENCES
chains can be exploited to take advantage
ABRAMSON, N. 1963. Information Theory and Cod-
of interaction between groups of symbols. ing. McGraw-Hill, New York.
Entropy tends to be overestimated when APOSTOLICO, A., AND FRAENKEL, A. S. 1985. Robust
symbol interaction is not considered. transmission of unbounded strings using Fibo-
Models that exploit relationships between nacci representations. Tech. Rep. CS85-14, Dept.
source messages may achieve better of Applied Mathematics, The Weizmann Insti-
tute of Science, Rehovot, Israel.
compression than predicted by an entropy
ARC 1986. ARC File Archive Utility, Version 5.1.
calculation based only on symbol probabil- System Enhancement Associates, Wayne, N.J.
ities. The use of Markov modeling is ASH, R. B. 1965. Information Theory. Interscience,
considered by Llewellyn [1987] and by New York.
Langdon and Rissanen [1983]. BENTLEY, J. L., SLEATOR, D. D., TARJAN, R. E., AND
WEI, V. K. 1986. A locally adaptive data com-
pression scheme. Commun. ACM 29, 4 (Apr.),
320-330.
9. SUMMARY
BOBROW, L. S., AND HAKIMI, S. L. 1969. Graph
Data compression is a topic of much im- theoretic prefix codes and their synchronizing
properties. Znf. Contr. 15,l (July), 70-94.
portance and many applications. Methods
BRENT, R. P., AND KUNG, H. T. 1978. Fast algo-
of data compression have been studied for rithms for manipulating formal power series.
almost four decades. This paper has pro- J. ACM 25, 4 (Oct.), 581-595.
vided an overview of data compression CAPOCELLI. R. M., GIANCARLO, R., AND TANEJA, I. J.
methods of general utility. The algorithms 1986. Bounds on the redundancy of Huffman
have been evaluated in terms of the amount codes. IEEE Trans. Inf. Theory 32, 6 (Nov.),
854-857.
of compression they provide, algorithm
CAPPELLINI, V., ED. 1985. Data Compression and
efficiency, and susceptibility to error. Al- Error Control Techmimes with Applications. Ac-
though algorithm efficiency and suscepti- ademic Press, London.
bility to error are relatively independent of CONNELL, J. B. 1973. A Huffman-Shannon-Fan0
the characteristics of the source ensemble, code. Proc. IEEE 61, 7 (July), 1046-1047.
the amount of compression achieved de- CORMACK, G. V. 1985. Data compression on a data-
pends on the characteristics of the source base system. Commun. ACM 28, 12 (Dec.), 1336-
to a great extent. 1342.
Semantic dependent data compression CORMACK, G. V., AND HORSPOOL, R. N. 1984.
Algorithms for adaptive Huffman codes. Znf.
techniques, as discussed in Section 2, are Process. L&t. 18, 3 (Mar.), 159-165.
special-purpose methods designed to ex- CORTESI, D. 1982. An effective text-compression al-
ploit local redundancy or context informa- gorithm. BYTE 7,l (Jan.), 397-403.
tion. A semantic dependent scheme can COT, N. 1977. Characterization and design of opti-
usually be viewed as a special case of one mal prefix codes. Ph.D. dissertation, Computer
or more general-purpose algorithms. It Science Dept., Stanford Univ., Stanford, Cahf.
should also be noted that algorithm BSTW ELIAS, P. 1975. Universal codeword sets and rep-
resentations of the integers. IEEE Trans. Znf.
is a general-purpose technique that exploits Theory 21, 2 (Mar.), 1941203.
locality of reference, a type of local redun- ELIAS, P. 1987. Interval and recency rank source
dancy. coding: Two on-line adaptive variable-length
Susceptibility to error is the main draw- schemes. IEEE Trans. Znf. Theory 33, 1 (Jan.),
back of each of the algorithms presented 3-10.
here. Although channel errors are more FALLER, N. 1973. An adaptive system for data
devastating to adaptive algorithms than to compression. In Record of the 7th As&mar Con-
ference on Circuits, Systems and Computers
static ones, it is possible for an error (Pacific Grove, Calif., Nov.). Naval Postgraduate
to propagate without limit even in the School, Monterey, Calif., pp. 593-597.
static case. Methods of limiting the effect FANO, R. M. 1949. Transmission of Information.
of an error on the effectiveness of a M.I.T. Press, Cambridge, Mass.

ACM Computing Surveys, Vol. 19, No. 3, September 1987


Data Compression l 295

FRAENKEL, A. S., AND KLEIN, S. T. 1985. Robust LAESER, R. P., MCLAUGHLIN, W. I., AND WOLFF,
universal complete codes as alternatives to Huff- D. M. 1986. Engineering Voyager 2’s encounter
man codes. Tech. Rep. CS85-16, Dept. of Applied with Uranus. Sci. Am. 255, 5 (Nov.), 36-45.
Mathematics, The Weizmann Institute of Sci- LANGDON, G. G., AND RISSANEN, J. J. 1983. A
ence, Rehovot, Israel. double-adaptive file compression algorithm.
FRAENKEL, A. S., MOR, M., AND PERL, Y. 1983. Is ZEEE Trans. Comm. 31,ll (Nov.), 1253-1255.
text compression by prefixes and suffixes practi- LLEWELLYN, J. A. 1987. Data compression for a
cal? Acta Znf. 20,4 (Dec.), 371-375. source with Markov characteristics. Comput. J.
GALLAGER, R. G. 1968. Information Theory and 30,2, 149-156.
Reliable Communication. Wiley, New York. MCINTYRE, D. R., AND PECHURA,M. A. 1985. Data
GALLAGER, R. G. 1978. Variations on a theme by compression using static Huffman code-decode
Huffman. IEEE Trans. Znf. Theory 24, 6 (Nov.), tables. Commun. ACM 28, 6 (June), 612-
668-674. 616.
GAREY,M. R. 1974. Optimal binary search trees with MEHLHORN, K. 1980. An efficient algorithm for
restricted maximal depth. SIAM J. Comput. 3, 2 constructing nearly optimal prefix codes. IEEE
(June), 101-110. Trans. Znf. Theory 26,5 (Sept.), 513-517.
GILBERT, E. N. 1971. Codes based on inaccurate NEUMANN, P. G. 1962. Efficient error-limiting
source probabilities. IEEE Trans. Znf. Theory 17, variable-length codes. IRE Trans. Znf. Theory 8,
3 (May), 304-314. 4 (July), 292-304.
GILBERT, E. N., AND MOORE, E. F. 1959. Variable- PARKER, D. S. 1980. Conditions for the optimality
length binary encodings. Bell Syst. Tech. J. 38,4 of the Huffman algorithm. SIAM J. Comput. 9,3
(July), 933-967. (Aug.), 470-489.
GLASSEY, C. R., AND KARP, R. M. 1976. On the PASCO, R. 1976. Source coding algorithms for fast
optimality of Huffman trees. SIAM J. Appl. data compression. Ph.D. dissertation, Dept. of
Math. 31,2 (Sept.), 368-378. Electrical Engineering, Stanford Univ., Stanford,
GONZALEZ,R. C., AND WINTZ, P. 1977. Digital Zm- Calif.
age Processing. Addison-Wesley, Reading, Mass. PECHURA, M. 1982. File archival techniques using
HAHN, B. 1974. A new technique for compression data compression. Commun. ACM 25, 9 (Sept.),
and storage of data. Commun. ACM 17,8 (Aug.), 605-609.
434-436. PERL, Y., GAREY, M. R., AND EVEN, S. 1975.
HESTER, J. H., AND HIRSCHBERG,D. S. 1985. Self- Efficient generation of ontimal prefix code: Equi-
organizing linear search. ACM Comput. Surv. 17, probable words using unequal cost letters.- J.
3 (Sept.), 295-311. ACM 22,2 (Apr.), 202-214.
HORSPOOL, R. N., AND CORMACK, G. V. 1987. A PKARC FAST! 1987. PKARC FAST! File Archival
locally adaptive data compression scheme. Com- Utility, Version 3.5. PKWARE, Inc. Glendale,
mun. ACM 30,9 (Sept.), 792-794. Wis.
Hu, T. C., AND TAN, K. C. 1972. Path length of REGHBATI, H. K. 1981. An overview of data
binary search trees. SIAM J. Appl. Math 22, 2 compression techniques. Computer 14, 4 (Apr.),
(Mar.), 225-234. 71-75.
Hu, T. C., AND TUCKER, A. C. 1971. Optimal com- RISSANEN, J. J. 1976. Generalized Kraft inequality
puter search trees and variable-length alphabetic and arithmetic coding. IBM J. Res. Dev. 20
codes. SIAM J. Appl. Math. 21, 4 (Dec.), 514- (May), 198-203.
532. RISSANEN,J. J. 1983. A universal data compression
HUFFMAN, D. A. 1952. A method for the construc- system. IEEE Trans. Znf. Theory 29, 5 (Sept.),
tion of minimum-redundancy codes. Proc. IRE 656-664.
40,9 (Sept.), 1098-1101. RODEH, M., PRATT, V. R., AND EVEN, S. 1981.
INGELS, F. M. 1971. Information and Coding Theory. Linear algorithm for data compression via string
Intext, Scranton, Penn. matching. J. ACM 28, 1 (Jan.), 16-24.
ITAI, A. 1976. Optimal alphabetic trees. SIAM J. RUBIN, F. 1976. Experiments in text tile compres-
Comput. 5, 1 (Mar.), 9-18. sion. Commun. ACM 19, 11 (Nov.), 617-623.
KARP, R. M. 1961. Minimum redundancy coding for RUBIN, F. 1979. Arithmetic stream coding using
the discrete noiseless channel. IRE Trans. Znf. fixed precision registers. IEEE Trans. Znf. Theory
Theory 7, 1 (Jan.), 27-38. 25, 6 (Nov.), 672-675.
KNUTH, D. E. 1971. Optimum binary search trees. RUDNER, B. 1971. Construction of minimum-
Acta Znf. 1, 1 (Jan.), 14-25. redundancy codes with optimal synchronizing
KNUTH, D. E. 1985. Dynamic Huffman coding. property. IEEE Trans. Znf. Theory 17, 4 (July),
J. Algorithms 6, 2 (June), 163-180. 478-487.
KRAUSE, R. M. 1962. Channels which transmit let- RUTH, S. S., AND KREUTZER, P. J. 1972. Data
ters of unequal duration. Znf. Control 5, 1 (Mar.), compression for large business files. Datamation
13-24. 18,9 (Sept.), 62-66.

ACM ComputingSurveys,Vol. 19,No. 3, September1987


296 l D. A. Lelewer and D. S. Hirschberg
RYABKO, B. Y. 1987. A locally adaptive data TROPPER, R. 1982. Binary-coded text, A text-
compression scheme. Commun. ACM 16, 2 compression method. BYTE 7,4 (Apr.), 398-413.
(Sept.), 792. UNIX. 1984. UNIX User’s Manual, Version 4.2.
SAMET, H. 1984. The Quadtree and related hierar- Berkeley Software Distribution, Virtual VAX-11
chical data structures. ACM Comput. Surv. 30, 9 Version, Univ. of California, Berkeley, Calif.
(June), 187-260. VARN, B. 1971. Optimal variable length codes (arbi-
SCHWARTZ, E. S. 1964. An optimum encoding with trary symbol cost and equal code word probabil-
minimum longest code and total number of digits. ity). Znf. Control 19, 4 (Nov.), 289-301.
Inf. Control 7, 1 (Mar.), 37-44. VITTER, J. S. 1987. Design and analysis of dynamic
SCHWARTZ, E. S., AND KALLICK, B. 1964. Gen- Huffman codes. J. ACM 34,4 (Oct.), 825-845.
erating a canonical prefix encoding. Commun. WAGNER, R. A. 1973. Common phrases and mini-
ACM 7, 3 (Mar.), 166-169. mum-space text storage. Commun. ACM 16, 3
SEVERANCE, D. G. 1983. A practitioner’s guide to (Mar.), 148-152.
data base compression. Znf. Syst. 8,1, 51-62. WELCH, T. A. 1984. A technique for high-perform-
SHANNON, C. E., AND WEAVER, W. 1949. The Math- ance data compression. Computer 17, 6 (June),
ematical Theory of Communication. University of 8-19.
Illinois Press, Urbana, Ill. WILKINS, L. C., AND WINTZ, P. A. 1971.
SNYDERMAN, M., AND HUNT, B. 1970. The myriad Bibliography on data compression, picture prop-
virtues of text compaction. Datamation 16, 12 erties and picture coding. IEEE Trans. Znf. The-
(Dec.), 36-40. ory 17,2,180-197.
STANDISH, T. A. 1980. Data Structure Techniques. WITPEN, I. H., NEAL, R. M., AND CLEARY, J. G.
Addison-Wesley, Reading, Mass. 1987. Arithmetic coding for data compression.
STIFFLER, J. J. 1971. Theory of Synchronous Com- Commun. ACM 30,6 (June), 520-540.
munications. Prentice-Hall, Englewood Cliffs, ZIMMERMAN, S. 1959. An optimal search procedure.
N.J. Am. Math. Monthly 66, (Oct.), 690-693.
STORER, J. A., AND SZYMANSKI, T. G. 1982. Data ZIV, J., AND LEMPEL, A. 1977. A universal algorithm
compression via textual substitution. J. ACM 29, for sequential data compression. IEEE Trans.
4 (Oct.), 928-951. Znf. Theory 23,3 (May), 337-343.
TANAKA, H. 1987. Data structure of Huffman codes ZIV, J., AND LEMPEL, A. 1978. Compression of indi-
and its application to efficient encoding and vidual sequences via variable-rate coding. IEEE
decoding. IEEE Trans. Znf. Theory 33, 1 (Jan.), Trans. Znf. Theory 24,5 (Sept.), 530-536.
154-156.

Received May 1987; final revision accepted January 1988.

ACM Computing Surveys, Vol. 19, No. 3, September 198’7

You might also like