Professional Documents
Culture Documents
This paper surveys a variety of data compression methods spanning almost 40 years of
research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique
developed in 1986. The aim of data compression is to reduce redundancy in stored or
communicated data, thus increasing effective data density. Data compression has
important application in the areas of file storage and distributed systems. Concepts from
information theory as they relate to the goals and evaluation of data compression
methods are discussed briefly. A framework for evaluation and comparison of methods is
constructed and applied to the algorithms presented. Comparisons of both theoretical and
empirical natures are reported, and possibilities for future research are suggested
Categories and Subject Descriptors: E.4 [Data]: Coding and Information Theory-data
compaction and compression
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Adaptive coding, adaptive Huffman codes, coding,
coding theory, tile compression, Huffman codes, minimum-redundancy codes, optimal
codes, prefix codes, text compression
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
0 1966 ACM 0360-0300/87/0900-0261$1.50
i of the table and encoding time would be he or she is using. This requires only log k
O(Z). Connell’s algorithm makes use of the bits of overhead. Assuming that classes of
index of the Huffman code, a representa- transmission with relatively stable charac-
tion of the distribution of codeword lengths, teristics could be identified, this hybrid
to encode and decode in O(c) time, where c approach would greatly reduce the redun-
is the number of different codeword dancy due to overhead without significantly
lengths. Tanaka [1987] presents an imple- increasing expected codeword length. In ad-
mentation of Huffman coding based on dition, the cost of computing the mapping
finite-state machines that can be realized would be amortized over all files of a given
efficiently in either hardware or software. class. That is, the mapping would be com-
As noted earlier, the redundancy bound puted once on a statistically significant
for Shannon-Fan0 codes is 1 and the bound sample and then used on a great number of
for the Huffman method is p,, + 0.086, files for which the sample is representative.
where p,, is the probability of the least likely There is clearly a substantial risk associ-
source message (so p,, is less than or equal ated with assumptions about file character-
to 5, and generally much less). It is impor- istics, and great care would be necessary in
tant to note that in defining redundancy to choosing both the sample from which the
be average codeword length minus entropy, mapping is to be derived and the categories
the cost of transmitting the code mapping into which to partition transmissions. An
computed by these algorithms is ignored. extreme example of the risk associated
The overhead cost for any method in which with the codebook approach is provided by
the source alphabet has not been estab- Ernest V. Wright who wrote the novel
lished before transmission includes n log n Gadsby (1939) containing no occurrences of
bits for sending the n source letters. For a the letter E. Since E is the most commonly
Shannon-Fan0 code, a list of codewords used letter in the English language, an en-
ordered so as to correspond to the source coding based on a sample from Gadsby
letters could be transmitted. The additional would be disastrous if used with “normal”
time required is then c Zi, where the Zi are examples of English text. Similarly, the
the lengths of the codewords. For Huffman “normal” encoding would provide poor
coding, an encoding of the shape of the compression of Gadsby.
code tree might be transmitted. Since any McIntyre and Pechura [1985] describe
full binary tree may be a legal Huffman an experiment in which the codebook ap-
code tree, encoding tree shape may require proach is compared to static Huffman cod-
as many as log 4” = 2n bits. In most cases ing. The sample used for comparison is a
the message ensemble is very large, so that collection of 530 source programs in four
the number of bits of overhead is minute languages. The codebook contains a Pascal
by comparison to the total length of the code tree, a FORTRAN code tree, a
encoded transmission. However, it is im- COBOL code tree, a PL/I code tree, and an
prudent to ignore this cost. ALL code tree. The Pascal code tree is the
If a less-than-optimal code is acceptable, result of applying the static Huffman al-
the overhead costs can be avoided through gorithm to the combined character frequen-
a prior agreement by sender and receiver cies of all of the Pascal programs in the
as to the code mapping. Instead of a Huff- sample. The ALL code tree is based on the
man code based on the characteristics of combined character frequencies for all of
the current message ensemble, the code the programs. The experiment involves en-
used could be based on statistics for a class coding each of the programs using the five
of transmissions to which the current en- codes in the codebook and the static Huff-
semble is assumed to belong. That is, both man algorithm. The data reported for each
sender and receiver could have access to a of the 530 programs consist of the size of
codebook with k mappings in it: one for the coded program for each of the five
Pascal source, one for English text, and so predetermined codes and the size of the
on. The sender would then simply alert the coded program plus the size of the mapping
receiver as to which of the common codes (in table form) for the static Huffman
method. In every case the code tree for the constants cl and c2. We recall the definition
language class to which the program be- of an asymptotically optimal code as one
longs generates the most compact encoding. for which average codeword length ap-
Although using the Huffman algorithm on proaches entropy and remark that a uni-
the program itself yields an optimal map- versal code with cl = 1 is asymptotically
ping, the overhead cost is greater than the optimal.
added redundancy incurred by the less- An advantage of universal codes over
than-optimal code. In many cases, the ALL Huffman codes is that it is not necessary
code tree also generates a more compact to know the exact probabilities with which
encoding than the static Huffman algo- the source messages appear. Whereas Huff-
rithm. In the worst case an encoding con- man coding is not applicable unless the
structed from the codebook is only 6.6% probabilities are known, with universal
larger than that constructed by the Huff- coding it is sufficient to know the probabil-
man algorithm. These results suggest that, ity distribution only to the extent that the
for files of source code, the codebook ap- source messages can be ranked in probabil-
proach may be appropriate. ity order. By mapping messages in order of
Gilbert [1971] discusses the construction decreasing probability to codewords in or-
of Huffman codes based on inaccurate der of increasing length, universality can
source probabilities. A simple solution to be achieved. Another advantage to univer-
the problem of incomplete knowledge of the sal codes is that the codeword sets are fixed.
source is to avoid long codewords, thereby It is not necessary to compute a codeword
minimizing the error of badly underesti- set based on the statistics of an ensemble;
mating the probability of a message. The any universal codeword set will suffice as
problem becomes one of constructing the long as the source messages are ranked.
optimal binary tree subject to a height re- The encoding and decoding processes are
striction (see [Knuth 1971; Hu and Tan thus simplified. Although universal codes
1972; Garey 19741). Another approach in- can be used instead of Huffman codes as
volves collecting statistics for several general-purpose static schemes, the more
sources and then constructing a code based common application is as an adjunct to a
on some combined criterion. This approach dynamic scheme. This type of application
could be applied to the problem of designing is demonstrated in Section 5.
a single code for use with English, French, Since the ranking of source messages is
German, and so on, sources. To accomplish the essential parameter in universal coding,
this, Huffman’s algorithm could be used to we may think of a universal code as repre-
minimize either the average codeword senting an enumeration of the source
length for the combined source probabili- messages or as representing the integers,
ties or the average codeword length for which provide an enumeration. Elias [ 19751
English, subject to constraints on average defines a sequence of universal coding
codeword lengths for the other sources. schemes that maps the set of positive in-
tegers onto the set of binary codewords.
The first Elias code is one that is simple
3.3 Universal Codes and Representations of but not optimal. This code, y, maps an
the Integers integer x onto the binary value of x prefaced
A code is universal if it maps source mes- by Llog x 1 zeros. The binary value of n is
sages to codewords so that the resulting expressed in as few bits as possible and
average codeword length is bounded by therefore begins with a 1, which serves to
clH + c2. That is, given an arbitrary delimit the prefix. The result is an instan-
source with nonzero entropy, a universal taneously decodable code since the total
code achieves average codeword length that length of a codeword is exactly one greater
is at most a constant times the optimal than twice the number of zeros in the
possible for that source. The potential prefix; therefore, as soon as the first 1
compression offered by a universal code of a codeword is encountered, its length
clearly depends on the magnitudes of the is known. The code is not a minimum
redundancy code since the ratio of expected on the Fibonacci numbers of order m 2 2,
codeword length to entropy goes to 2 as where the Fibonacci numbers of order 2 are
entropy approaches infinity. The second the standard Fibonacci numbers: 1, 1, 2, 3,
code, 6, maps an integer x to a codeword 5,8,13,. . . . In general, the Fibonacci num-
consisting of r(Llog xl + 1) followed by bers of order m are defined by the recur-
the binary value of x with the leading 1 de- rence: Fibonacci numbers F-,+1 through F,,
leted. The resulting codeword has length are equal to 1; the kth number for iz L 1 is
Llog XJ + 2Llog(l + Llog xJ)J + 1. This con- the sum of the preceding m numbers. We
cept can be applied recursively to shorten describe only the order-2 Fibonacci code;
the codeword lengths, but the benefits the extension to higher orders is straight-
decrease rapidly. The code 6 is asymptoti- forward.
cally optimal since the limit of the ratio of Every nonnegative integer N has pre-
expected codeword length to entropy is 1. cisely one binary representation of the form
Figure 9 lists the values of y and 6 for a R(N) = Cf=o diFi (where di E (0, 11, k 5 N,
sampling of the integers. Figure 10 shows and the Fi are the order-2 Fibonacci num-
an Elias code for string EXAMPLE. The bers as defined above) such that there are
number of bits transmitted using this map- no adjacent ones in the representation. The
ping would be 161, which does not compare Fibonacci representations for a small
well with the 117 bits transmitted by the sampling of the integers are shown in
Huffman code of Figure 3. Huffman coding Figure 11, using the standard bit sequence
is optimal under the static mapping model. from high order to low. The bottom row of
Even an asymptotically optimal universal the figure gives the values of the bit posi-
code cannot compare with static Huffman tions. It is immediately obvious that this
coding on a source for which the probabil- Fibonacci representation does not consti-
ities of the messages are known. tute a prefix code. The order-2 Fibonacci
A second sequence of universal coding code for N is defined as F(N) = D 1, where
schemes, based on the Fibonacci numbers, D = d,,d,dz . -. dk (the di defined above).
is defined by Apostolico and Fraenkel That is, the Fibonacci representation is
[1985]. Although the Fibonacci codes are reversed and 1 is appended. The Fibonacci
not asymptotically optimal, they compare code values for a small subset of the inte-
well to the Elias codes as long as the num- gers are given in Figure 11. These binary
ber of source messages is not too large. codewords form a prefix code since every
Fibonacci codes have the additional attrib- codeword now terminates in two consecu-
ute of robustness, which manifests itself by tive ones, which cannot appear anywhere
the local containment of errors. This aspect else in a codeword.
of Fibonacci codes is discussed further in Fraenkel and Klein [ 19851 prove that the
Section 7. Fibonacci code of order 2 is universal, with
The sequence of Fibonacci codes de- cl = 2 and c2 = 3. It is not asymptotically
scribed by Apostolico and Fraenkel is based optimal since cl > 1. Fraenkel and Klein
space space
(4 (b)
(4
Figure 15. Algorithm FGK processing the ensemble EXAMPLE--(a) Tree
after processing LLaa bb”; 11 will be transmitted for the next b. (b) After
encoding the third b; 101 will be transmitted for the next spuce; the form of
the tree will not change-only the frequency counts will be updated; 100 will
be transmitted for the first c. (c) Tree after update following first c.
c b
whereas the FGK tree has height 4 and nodes with equal weights could be deter-
external path length 14. Algorithm V trans- mined on the basis of recency. Another
mits codeword 001 for the second c; FGK reasonable assumption about adaptive cod-
transmits 1101. This demonstrates the in- ing is that the weights in the current tree
tuition given earlier that algorithm V is correspond closely to the probabilities as-
better suited for coding the next message. sociated with the source. This assumption
The Vitter tree corresponding to Figure 16, becomes more reasonable as the length of
representing the final tree produced in the ensemble increases. Under this assump-
processing EXAMPLE, is only different tion, the expected cost of transmitting the
from Figure 16 in that the internal node of next letter is C pi& z x wili, SO that neither
weight 5 is to the right of both leaf nodes algorithm FGK nor algorithm V has any
of weight 5. Algorithm V transmits 124 bits advantage.
in processing EXAMPLE, as compared Vitter [1987] proves that the perform-
with the 129 bits of algorithm FGK and ance of his algorithm is bounded by S - n
117 bits of static Huffman coding. It should + 1 from below and S + t - 2n + 1 from
be noted that these figures do not include above. At worst then, Vitter’s adaptive
overhead and, as a result, disadvantage the method may transmit one more bit per
adaptive methods. codeword than the static Huffman method.
Figure 20 illustrates the tree built by The improvements made by Vitter do not
Vitter’s method for the ensemble of Fig- change the complexity of the algorithm;
ure 17. Both C li and maxI& 1 are smaller in algorithm V encodes and decodes in 0 (1)
the tree of Figure 20. The number of bits time, as does algorithm FGK.
transmitted during the processing of the
sequence is 47, the same used by algorithm 5. OTHER ADAPTIVE METHODS
FGK. However, if the transmission contin-
ues with d, b, c, f, or an unused letter, the Two more adaptive data compression
cost of algorithm V will be less than that methods, algorithm BSTW and Lempel-
of algorithm FGK. This again illustrates Ziv coding, are discussed in this section.
the benefit of minimizing the external path Like the adaptive Huffman coding tech-
length (C li) and the height (max(Zi)). niques, these methods do not require a first
It should be noted again that the strategy pass to analyze the characteristics of the
of minimizing external path length and source. Thus they provide coding and
height is optimal under the assumption transmission in real time. However, these
that any source letter is equally likely to schemes diverge from the fundamental
occur next. Other reasonable strategies in- Huffman coding approach to a greater
clude one that assumes locality. To take degree than the methods discussed in Sec-
advantage of locality, the ordering of tree tion 4. Algorithm BSTW is a defined-word
n k Static Algorithm V Algorithm FGK for experiments that compare the perform-
100 96 83.0 71.1 82.4 ance of algorithm BSTW to static Huffman
500 96 83.0 80.8 83.5 coding. Here the defined words consist of
961 97 83.5 82.3 83.7 two disjoint classes, sequence of alpha-
numeric characters and sequences of non-
Figure23. Simulation results for a small text file alphanumeric characters. The performance
[Vitter 19871: n, file size in 8-bit bytes; k, number of
distinct messages. of algorithm BSTW is very close to that of
static Huffman coding in all cases. The
experiments reported by Bentley et al. are
n k Static Algorithm V Algorithm FGK of particular interest in that they incorpo-
100 32 57.4 56.2 58.9 rate another dimension, the possibility that
500 49 61.5 62.2 63.0 in the move-to-front scheme one might
1000 57 61.3 61.8 62.4 want to limit the size of the data structure
10000 73 59.8 59.9 60.0 containing the codes to include only the m
12067 78 59.6 59.8 59.9 most recent words, for some m. The tests
Figure 24. Simulation results for Pascal source code
consider cache sizes of 8, 16, 32, 64, 128,
[Vitter 19871: n, file size in bytes; k, number of distinct and 256. Although performance tends to
messages. increase with cache size, the increase is
erratic, with some documents exhibiting
nonmonotonicity (performance that in-
coding. The version of arithmetic coding creases with cache size to a point and then
tested employs single-character adaptive decreases when cache size is further in-
frequencies and is a mildly optimized C creased).
implementation. Witten et al. compare the Welch [ 19841 reports simulation results
results provided by this version of arith- for Lempel-Ziv codes in terms of compres-
metic coding with the results achieved by sion ratios. His definition of compression
the UNIX compact program (compact is ratio is the one given in Section 1.3, C =
based on algorithm FGK). On three large (average message length)/(average code-
files that typify data compression applica- word length). The ratios reported are 1.8
tions, compression achieved by arithmetic for English text, 2-6 for COBOL data files,
coding is better than that provided by com- 1.0 for floating-point arrays, 2.1 for for-
pact, but only slightly better (average file matted scientific data, 2.6 for system log
size is 98% of the compacted size). A file data, 2.3 for source code, and 1.5 for object
over a three-character alphabet, with very code. The tests involving English text files
skewed symbol probabilities, is encoded by showed that long individual documents did
arithmetic coding in less than 1 bit per not compress better than groups of short
character; the resulting file size is 74% of documents. This observation is somewhat
the size of the file generated by compact. surprising in that it seems to refute the
Witten et al. also report encoding and de- intuition that redundancy is due at least in
coding times. The encoding time of arith- part to correlation in content. For purposes
metic coding is generally half the time of comparison, Welch cites results of
required by the adaptive Huffman coding Pechura and Rubin. Pechura [1982]
method. Decode time averages 65% of the achieved a 1.5 compression ratio using
time required by compact. Only in the static Huffman coding on files of English
case of the skewed file are the time statis- text. Rubin reports a 2.4 ratio for Eng-
tics quite different. Arithmetic coding again lish text when using a complex technique
achieves faster encoding, 67% of the time for choosing the source messages to which
required by compact. However, compact de- Huffman coding is applied [Rubin 19761.
codes more quickly, using only 78% of the These results provide only a very weak
time of the arithmetic method. basis for comparison, since the character-
Bentley et al. [1986] use C and Pascal istics of the files used by the three authors
source files, TROFF source files, and a are unknown. It is very likely that a single
terminal session transcript of several hours algorithm may produce compression ratios
FRAENKEL, A. S., AND KLEIN, S. T. 1985. Robust LAESER, R. P., MCLAUGHLIN, W. I., AND WOLFF,
universal complete codes as alternatives to Huff- D. M. 1986. Engineering Voyager 2’s encounter
man codes. Tech. Rep. CS85-16, Dept. of Applied with Uranus. Sci. Am. 255, 5 (Nov.), 36-45.
Mathematics, The Weizmann Institute of Sci- LANGDON, G. G., AND RISSANEN, J. J. 1983. A
ence, Rehovot, Israel. double-adaptive file compression algorithm.
FRAENKEL, A. S., MOR, M., AND PERL, Y. 1983. Is ZEEE Trans. Comm. 31,ll (Nov.), 1253-1255.
text compression by prefixes and suffixes practi- LLEWELLYN, J. A. 1987. Data compression for a
cal? Acta Znf. 20,4 (Dec.), 371-375. source with Markov characteristics. Comput. J.
GALLAGER, R. G. 1968. Information Theory and 30,2, 149-156.
Reliable Communication. Wiley, New York. MCINTYRE, D. R., AND PECHURA,M. A. 1985. Data
GALLAGER, R. G. 1978. Variations on a theme by compression using static Huffman code-decode
Huffman. IEEE Trans. Znf. Theory 24, 6 (Nov.), tables. Commun. ACM 28, 6 (June), 612-
668-674. 616.
GAREY,M. R. 1974. Optimal binary search trees with MEHLHORN, K. 1980. An efficient algorithm for
restricted maximal depth. SIAM J. Comput. 3, 2 constructing nearly optimal prefix codes. IEEE
(June), 101-110. Trans. Znf. Theory 26,5 (Sept.), 513-517.
GILBERT, E. N. 1971. Codes based on inaccurate NEUMANN, P. G. 1962. Efficient error-limiting
source probabilities. IEEE Trans. Znf. Theory 17, variable-length codes. IRE Trans. Znf. Theory 8,
3 (May), 304-314. 4 (July), 292-304.
GILBERT, E. N., AND MOORE, E. F. 1959. Variable- PARKER, D. S. 1980. Conditions for the optimality
length binary encodings. Bell Syst. Tech. J. 38,4 of the Huffman algorithm. SIAM J. Comput. 9,3
(July), 933-967. (Aug.), 470-489.
GLASSEY, C. R., AND KARP, R. M. 1976. On the PASCO, R. 1976. Source coding algorithms for fast
optimality of Huffman trees. SIAM J. Appl. data compression. Ph.D. dissertation, Dept. of
Math. 31,2 (Sept.), 368-378. Electrical Engineering, Stanford Univ., Stanford,
GONZALEZ,R. C., AND WINTZ, P. 1977. Digital Zm- Calif.
age Processing. Addison-Wesley, Reading, Mass. PECHURA, M. 1982. File archival techniques using
HAHN, B. 1974. A new technique for compression data compression. Commun. ACM 25, 9 (Sept.),
and storage of data. Commun. ACM 17,8 (Aug.), 605-609.
434-436. PERL, Y., GAREY, M. R., AND EVEN, S. 1975.
HESTER, J. H., AND HIRSCHBERG,D. S. 1985. Self- Efficient generation of ontimal prefix code: Equi-
organizing linear search. ACM Comput. Surv. 17, probable words using unequal cost letters.- J.
3 (Sept.), 295-311. ACM 22,2 (Apr.), 202-214.
HORSPOOL, R. N., AND CORMACK, G. V. 1987. A PKARC FAST! 1987. PKARC FAST! File Archival
locally adaptive data compression scheme. Com- Utility, Version 3.5. PKWARE, Inc. Glendale,
mun. ACM 30,9 (Sept.), 792-794. Wis.
Hu, T. C., AND TAN, K. C. 1972. Path length of REGHBATI, H. K. 1981. An overview of data
binary search trees. SIAM J. Appl. Math 22, 2 compression techniques. Computer 14, 4 (Apr.),
(Mar.), 225-234. 71-75.
Hu, T. C., AND TUCKER, A. C. 1971. Optimal com- RISSANEN, J. J. 1976. Generalized Kraft inequality
puter search trees and variable-length alphabetic and arithmetic coding. IBM J. Res. Dev. 20
codes. SIAM J. Appl. Math. 21, 4 (Dec.), 514- (May), 198-203.
532. RISSANEN,J. J. 1983. A universal data compression
HUFFMAN, D. A. 1952. A method for the construc- system. IEEE Trans. Znf. Theory 29, 5 (Sept.),
tion of minimum-redundancy codes. Proc. IRE 656-664.
40,9 (Sept.), 1098-1101. RODEH, M., PRATT, V. R., AND EVEN, S. 1981.
INGELS, F. M. 1971. Information and Coding Theory. Linear algorithm for data compression via string
Intext, Scranton, Penn. matching. J. ACM 28, 1 (Jan.), 16-24.
ITAI, A. 1976. Optimal alphabetic trees. SIAM J. RUBIN, F. 1976. Experiments in text tile compres-
Comput. 5, 1 (Mar.), 9-18. sion. Commun. ACM 19, 11 (Nov.), 617-623.
KARP, R. M. 1961. Minimum redundancy coding for RUBIN, F. 1979. Arithmetic stream coding using
the discrete noiseless channel. IRE Trans. Znf. fixed precision registers. IEEE Trans. Znf. Theory
Theory 7, 1 (Jan.), 27-38. 25, 6 (Nov.), 672-675.
KNUTH, D. E. 1971. Optimum binary search trees. RUDNER, B. 1971. Construction of minimum-
Acta Znf. 1, 1 (Jan.), 14-25. redundancy codes with optimal synchronizing
KNUTH, D. E. 1985. Dynamic Huffman coding. property. IEEE Trans. Znf. Theory 17, 4 (July),
J. Algorithms 6, 2 (June), 163-180. 478-487.
KRAUSE, R. M. 1962. Channels which transmit let- RUTH, S. S., AND KREUTZER, P. J. 1972. Data
ters of unequal duration. Znf. Control 5, 1 (Mar.), compression for large business files. Datamation
13-24. 18,9 (Sept.), 62-66.