8 Data Compression 10

University of Saskatchewan 11–1
CME 392 Computer Engineering Laboratory
Data Compression Algorithms

Objectives: This lab introduces the basic concepts of, and algorithms for, data
compression. We examine two data compression algorithms for use on ASCII text files in
this lab:
1. Run-length encoding: a simple and straightforward compression technique, and
2. Huffman encoding: a moderately sophisticated compression algorithm.
3. Combinations of the above two techniques.
In additional, the lab will ensure that the student has an introductory familiarity with
several basic computer engineering tools and techniques:
1. a generic Unix working environment,
2. Unix shell scripts, and
3. the examination of files at the bit level.
Resources: In order to carry out this lab, you will require the following resources:
1. A computer capable of compiling C code and running generic Unix shells. Both
Linux on a PC and MacOS (with the Unix shell available through the Term
application) have been used to run the lab’s software. The computers provided in
the Computer Engineering Lab in 2C61 are ideal for carrying out the work
associated with this lab.
2. The source of the compression algorithm to be used in the lab: “compress.c”, which
can be found on the webpage associated with this lab.
3. The source of the decompression algorithm to be used in the lab: “decompress.c”,
which can be found on the webpage associated with this lab.
4. The collection of sample text files to be used in the lab: “text1” through “text6”,
which can be found on the webpage associated with this lab.
5. The source of the shell script to be used in the lab: “test”, which can be found on the
webpage associated with this lab.
Preparation:
1. Review the theory of data compression described in the Appendix.
2. Review Unix shell and commands.
A Compression/Decompression Implementation: Happily, you are not expected to
produce an implementation of a compression/decompression algorithm pair for this lab –
instead, one is being provided for you. You will be asked to experiment with this
implementation, and make or describe minor alterations to it.
The implementation consists of two modules of C code:
1. compress.c, which implements run-length compression and Huffman encoding
compression.
2. decompress.c, which implements corresponding decompression algorithms for run-
length and Huffman.
In addition you are provided with a Unix shell script:
3. test, which drives compress and decompress to provide run-length, Huffman,
combined Huffman/run-length, and “best” compression algorithms.
Compress and decompress make use of the Unix features standard input, standard output,
and diagnostic output. The diagnostic output may be unfamiliar. It is a separate output
stream that is normally directed to the user’s ‘terminal”. When the standard output is
redirected (e.g. “process > output”), the diagnostic output is not normally redirected, but
continues to be output on the user’s terminal. This implementation uses the diagnostic
output to record log messages which give information about the various compression
processes.
Compress:
1. reads simple instructions from its standard input,
2. obtains the name of the file to be compressed from its standard input commands,
then opens and reads the file directly,
3. produces its compressed output on its standard output, and
4. produces informative/statistical messages on its diagnostic output.
Decompress:
1. reads the file to be decompressed from its standard input, and
2. produces its decompressed output on its standard output.
The standard input instructions to compress control all the options in the
compress/decompress pair. These instructions are implemented in a simple byte stream
composed of the following elements, which are interpreted and obeyed left-to-right:
1. “F(<file>)”: specifies the source file to be compressed,
2. “R”: commands the compressor to read <file> and produce a run-length compressed
version, stored in an internal buffer,
3. “L”: commands the compressor to emit the run-length compressed text to its
standard output,
4. “H”: commands the compressor to analyze the source text found in <file> and
produce a Huffman coding tree and table,
5. “E”: commands the compressor to use the Huffman encoding table to compress the
text found in <file>,
6. “C”: commands the compressor to emit both the compressed Huffman encoding
scheme and the Huffman-compressed text found in <file>,
7. “T”: commands the compressor to emit the original text with a prepended length
indicator (for test purposes), and
8. “X”: commands the compressor to terminate execution.
The following example of standard input text commands to compress causes the
implementation to read file “junk”, derive the Huffman encoding data structure for the text
found in “junk”, produce the Huffman compressed encoding of the text found in “junk”, emit
the coding scheme and text to the standard output, and finally exit. The command text is:
“F(junk)HECX”. During the various processes, compress will write log messages to the
diagnostic output for each action (other than “X”). You will find other examples in the
provided shell script “test”. Notice that the text file can come from outside the current
working directory by specifying a full path to the file
“F(/users/joe/compression/textfile)RLX”, or by specifying a path relative to the
current directory “F(../elsewhere/textfile)HECX”.
Compress and decompress are directly connected only by the standard output from
compress which becomes the standard input to decompress. This connection is most
naturally made by directly piping the compress standard output to the decompress
standard input in a Unix shell command. Because this piping connection is their only
communications path, it must contain both the compressed file and an indication of how the
file was compressed (run-length, Huffman, or Huffman/run-length). The implementations
share a simple character code and length of file indication. The coding for each compression
type is:
1. Run-length: “L(<length>)<compressed text>”: The format is indicated by the text
between the quotation marks. “L” indicates run-length compression; <length> is an
integer in ASCII characters which specifies the length in characters of the compressed
text; <compressed text>, of course, is the run-length encoded text.
2. Huffman: “C<length><coding><encoded text>”: “C” indicates Huffman coding, <length>
is an integer in ASCII which specifies the length of <coding> in bits; <coding> is the
compacted version of the Huffman encoding scheme (rounded up to a whole number of
bytes); and <encoded text> is the Huffman encoding as a bit string (rounded up to a
whole number of bytes). <encoded text> does not require a length field because of the
self-terminating TM feature.
Test script
The “test” shell script is capable of running any of the supplied compression/decompression
algorithms on any text file. The first argument of “test” is the file to be compressed (e.g.,
“text1”). The second argument of “test” indicates which compression algorithm is to be
used, as described in the following table:
Argument Description
raw Null
runlen Run-length
huff Huffman
huff-runlen Huffman/Run-length
best Best
For example, to run “test” on “text1” with the Huffman algorithm, type “./test text1
huff” The beginning “./” is required to indicate that the shell script “test” is located in your
current (working) directory.
Procedure:
This section describes the specific steps you are to take in carrying out this lab. All steps
require a corresponding entry in your logbook. Be complete in your logbook entries:
include the output produced by all of the run steps (either by hand-written entries or by
printing and including the Unix shell outputs); clearly write all required analyses and code
modifications. Your mark will be based on the contents of your lab book and the
evaluations of your in-lab work by the person(s) running the lab session.
Part I: Prepare your working environment
1. Obtain a suitable computer, preferably one in 2C61. If you use some computer other
than those provided in 2C61, you may have to alter the shell script test. The ‘awk’
command which selects two fields out of the results of ‘ls’ may have to be changed to suit
your Unix implementation. In particular, the line “| awk '{print $5, $9}' \” may
have to be changed to suit different subfields for the size and the name of each file.
2. Create a working directory within which to carry out the lab procedures (e.g., “mkdir
dcLabWork”).
3. Download copies of the software, the text files, and the test shell script described in
Resources above.
4. Compile the C sources. Using the Gnu C Compiler (gcc), compile both the compression
and the decompression implementations. Use the “-o <filename>’ option to specify the
appropriate name for the resultant executable module. There should be no errors or
warnings.
gcc compress.c –o compress
gcc decompress.c –o decompress
Part II: Compression/Decompression

Our compression/decompression experiments will begin with simple command line
instructions to the Unix shell.
1. Run the compressor with the null algorithm on the simplest sample file to get your feet
on the ground:
echo “F(text1)TX” | ./compress
Type the characters into your shell script when you are in your working directory. Notice
that “echo” directs its argument to its standard output, which is then piped (“|”) to the
compress process. Compress is referred to as “./compress” in order that the Unix shell can
find it in the current directory (“./”) without that directory being included in the Unix “path”
parameter. Identify the standard output results and the diagnostic output results.
Describe what you find in your lab report. If you do not understand the Unix command
given above, look it up online, and break it into its component parts for separate testing.
2. Next, try a real compression algorithm. Use the run-length algorithm on the same
input file:
echo “F(text1)RLX” | ./compress
Examine the various outputs. Does the standard output (compressed text) make sense?
No, it doesn’t, as the encoded length field will not print properly.
3. Re-run the above experiment, but this time capture the standard output from compress
in a Unix file:
echo “F(text1)RLX” | ./compress > result
This creates the Unix file “result” in your working directory (and the standard output
previously written to your Unix shell terminal disappears, as it has been captured by
redirection (“>”) to the file “result”). You now need to examine this file with a tool that can
show you the exact bit patterns in the file (including unprintable characters). Use the Unix
command “od”. First, look it up within Unix (“man od”), or via Google. Once you
understand its (simple) semantics, use it with the following command:
od –c result
which will show the file as printable characters and octal codes for unprintable characters.
Finally, combine the compress run and the “od” display of results with the following Unix
command:
echo “F(text1)RLX” | ./compress | od –c
Explain the output of step (3).
4. Test the decompressor in two ways, with the Unix shell commands:
./decompress < result
and
echo “F(text1)RLX” | ./compress | ./decompress
Verify that the compress/decompress pair has worked correctly. Explain why the two Unix
commands are or should be equivalent in result.
5. Now, experiment with this implementation by trying the Huffman encoding algorithm
on the text file “text1”.
6. Examine the resultant logged information produced in steps (4) and (5). Identify the
sizes of all the compressed files. Be sure you have the correct file lengths. Which algorithm
is best for this file? Which is worst?
7. Use any method you prefer (e.g. editor, cat, or printout) to examine the contents of all of
the sample text files (text1 through text6). For each file, predict which algorithm you
expect to perform the best (and the worst). Give reasons for your predictions.
8. Using the shell script “test” run all compression algorithms (runlen, huff, huff-runlen,
and best) on all text files provided.
9. Enter the file size results of the previous step into a spreadsheet. Have the spreadsheet
compute the compression quality for each run.
10. Compare your results gathered in step (9) to your predictions made in step (7). Explain
any discrepancies. Other than “best”, which compression algorithm do you consider to be
the best overall compresser of text files? Justify your answer.
11. Notice that compress emits an additional file named “huffmanLog” whenever the
Huffman encoding algorithm is run. This log file contains the assigned bit strings for each
character in the source file. Run the Huffman algorithm on “text6”; printout or view the
resultant “huffmanLog”. Now use that log to encode your last name. Clearly indicate
which of the resultant bits correspond to each character of your last name. How many bits
are required to encode your name (ignore the encoding of the Huffman table itself)? How
many bits per character?
Part III: Unix shell script

We turn our attention to the “test” shell script.
1. Using any method you prefer read the “test” shell script.
2. Explain the overall structure of “test”. This need not be in great detail, but feel free to
use Unix man pages or Google to obtain more detail and more understanding.
3. The trickiest part of “test” is the selection of the best result for the “best” option. Study
and explain the operation of the follow fragment of “test”, which is used to identify the
name of the smallest of the various compressed versions of the original file. It will probably
help to break this fragment into yet smaller fragments and test them independently in the
Unix shell.
best=`ls -l $1.*.tmp \
| awk '{print $5, $9}' \
| sort -n \
| awk 'NR == 1 {split($0,f); print(f[2])}'`;
4. Modify the “test” script to include Gnu zip (gzip) as one of compression methods.
5. A “data” text file (available on the class website) contains a series of digital data. Each
data is represented by a 5-digit binary number. Run all compression algorithms (runlen,
huff, huff-runlen, gzip, and best) on “data” text file. Enter the file size results of the
previous step into a spreadsheet. Have the spreadsheet compute the compression quality
for each run.
6. What pre-processing can be done to further improve the compression ratio? Write a
Unix script to implement a pre-processing method.
7. Run all compression algorithms (runlen, huff, huff-runlen, gzip and best) on the pre-
processed “data” text file. Enter the file size results of the previous step into the
spreadsheet.
8. Compare the compression results with and without the pre-processing.
Appendix
Data Compression:
Data compression algorithms reduce the space (i.e., the number of bits) required to
represent some original (uncompressed, or raw) data structure. There are four concepts of
‘data’ involved in data compression:
1. An ideal concept of the information (e.g., a list of characters). This concept of ideal data
is independent of any particular data structure.
2. Some original explicit representation of the ideal information (e.g., an array of ASCII
characters, representing an ideal list of characters).
3. Some compressed explicit representation of the ideal information (e.g., the data
structures of Huffman encoding, which will be presented later in this document).
4. A reconstructed explicit representation, produced by decompressing some compressed
representation.
Data compression algorithms always come in pairs; the two members of each pair are
inverse functions:
1. A compression algorithm, compress, which accepts some original form, D, of the data
and produces a compressed representation of the data.
2. A decompression algorithm, decompress, which accepts the compressed form, C, and
produces the reconstructed form of the data.
Data compression algorithms can be divided into two categories:
1. Lossless data compression algorithms, which always reconstruct the original form
exactly when the decompression algorithm is applied to a properly compressed form;
that is, the decompression of the compressed file must always return exactly the
original file.
2. Lossy data compression algorithms violate the key property of lossless compression.
With lossy compression, the original form is not necessarily returned. Note that lossy
compression may sometimes return the original form, but will not do so in general.
Let’s assume we can always measure the size of any data representation in bits (or bytes,
for some applications). Size is thus a function from any representation to a number of bits.
The quality of a compression/decompression algorithm can be measured by the ratio of the
size of the compressed form to the size of the original form: quality = size(C) / size(D). In
general, we find that lossy algorithms have better compression quality than lossless
algorithms. This should not be surprising, as more bits are generally required to represent
a data structure sufficiently accurately such that the original form can be exactly
reconstructed.
This definition of compression quality is rather weak: it is defined separately for every
particular original file, D. It turns out that nearly every compression/decompression
algorithm pair will produce a different quality ratio for different original files. This makes
the evaluation of the quality of a compression/decompression algorithm pair imprecise and
difficult. In general, quality is evaluated for some range of possible file types (e.g., how well
does a compression/decompression pair perform on the text files which encode the chapters
of this lab manual?).
Here’s a bit of bad news: data compression may or may not actually reduce the number of
bits required to represent the ideal information. Sometimes, the nature of the original
representation leads the compression algorithm to produce a larger ‘compressed’
representation. This lab will show some examples of this unfortunate behaviour.
Much research and development effort has been applied to data compression. As a result,
there are many algorithms available for use. This introductory lab will restrict its focus to
a few lossless compression algorithms which are appropriate for text files. However, the
student should be aware of at least the following important areas of data compression:
1. Lossless data compression techniques are used to reduce the disk space occupied by files
on computer disks. These techniques are, in principle, similar to the techniques we will
examine in this lab. “StuffIt” is a well-known commercial product which is used for
computer file compression.
2. Lossy compression techniques are used to reduce the size of audio/music files. MP3 is
one well-known audio compression algorithm.
3. Lossy compression techniques are used to reduce the size of digital photos. These
algorithms generally take advantage of the spatial redundancy found in most images –
e.g., the fact that the neighbouring pixels representing a human face are generally
nearly the same colour. JPEG is the dominant digital photo compression algorithm.
4. Lossy compression techniques are used to reduce the size of video sequences. These
algorithms take advantage of the temporal redundancy found in most video sequences –
e.g., the fact that individual pixels generally remain the same as the frames of a video
go by at some tens of frames per second. The various MPEG algorithms dominate this
area.
5. Lossless compression algorithms are used to communicate with deep space spacecraft.
In this application, the signal-to-noise ratio is very low, the radio power budget is low,
and the resultant channel is very slow. Efficient data compression is required to
maximize the use of these channels.
The next sections introduce these data compression techniques
Run-length Encoding
It is common for text files to contain long sequences of repeated characters. Examples
might include white space represented by sequences of blanks or tabs, or formatting
patterns such as sequences of dashes used to set off sections of a document. When such
sequences are sufficiently long, it is advantageous to replace such sequences with special
codes which represent the repeated character and its number of occurrences.
In its simplest form, a run-length compressed file uses the original ASCII codes for non-
repeated characters, but repeated characters are replaced by some special coding. The
special coding requires some means of distinguishing the special coding from the individual
ASCII characters. The simplest way to do this is to reserve one of the 256 (28) possible byte
codes as a special repeated sequence marker. When one of these marker characters is
found in the compressed text stream, the decompression algorithm knows that the two
following characters have special meaning: one is the character to be repeated, the other
encodes the number of times the repeated character is to be repeated.
In illustrating this concept below, we are limited in that we can only use ASCII characters
to represent non-ASCII characters and concepts. Thus, we choose to denote repeated
character codings as a convention mathematical tuple: <mark, character, length>. Let it be
understood that this notation represents the three bytes of a repeated sequence coding: the
mark character, the repeated character, and the representation of the length in one byte.
Using this simple notation, where the original data is:
-------------------------------- Footnotes --------------------------------
(Notice the single blanks before and after “Footnotes”.) A simple run-length encoding would
produce:
<mark, -, 32> F<mark, o, 2>tnotes <mark, -, 32>
The original data requires 75 ASCII bytes. The compressed data requires only 18 bytes.
Thus we have a compression quality of 18/75 = 0.240 which is very attractive.
Notice, however, that the compression of “oo” in “Footnotes” actually costs more than the
original. The repeated sequence code costs us three bytes, which is more than the two bytes
of this short repeated sequence. It is thus sensible to not use the repeated sequence code
for sequences of three or fewer characters. With this modified rule, the compressed form
becomes:
<mark, -, 32> Footnotes <mark, -, 32>
which slightly improves our compression to 17/75 = 0.226.
Run-length encoding is a simple technique which may or may not produce useful results,
depending on the nature of the file to be compressed. When the original text contains no
repeated sequences longer than three, of course run-length compression will fail to provide
any gain whatsoever. The next technique we shall examine – Huffman coding – is much
more sophisticated, interesting, and generally useful.
Huffman Encoding
Consider a file of ASCII characters, for instance the file that generates this document.
Each ASCII character, c, occurs some definite number of times, nc, in that file, where the
total number of characters is
N= ∑n
i: ASCII
i .
When the file is represented in eight-bit ASCII, N×8 bits are required.
However, what if we vary the number of bits required to represent each character? Some
characters can be assigned fewer than eight bits, other characters may be forced to have
more than eight bit representations. If we consistently assign the shorter representations
to the more common characters, and the longer representations to the less common
characters, we may end up with an average (overall represented characters) representation
of less than eight bits. This would allow the file to be represented in fewer than N×8 bits.
Let’s examine in more detail how this might work.
First of all, we must know the frequency of occurrence for each character. This is simple: fc
= nc / N. Next, we sort the characters by ascending fc. At this point, let’s introduce a simple
example to help elucidate the concepts. Assume our original text is “Boo!”. In tabular form
we have the following table with the counts, the frequencies, and the proper sorted order:
c nc fc
B 1 0.25
! 1 0.25
o 2 0.50
We would like to assign the shortest representation to “o”, and any longer representations
to “B” and “!” in order to achieve our goal of a minimal representation cost. But how do we
go about determining minimal length codes which allow lossless reconstruction by the
decompression algorithm? Binary trees provide the answers. Consider a family of binary
trees such that each branch node has two children (i.e., it is a ‘binary’ tree), each branch is
labeled by a zero or a one, and each leaf contains only a single ASCII character.
c code
B 00
! 01
o 1
Each path from the tree’s root to a leaf traverses a sequence of branches. The zero/one
labels of these sequences of branches are taken as the code for the character found at the
end of the sequence. If we arrange such a tree to contain leaves for all relevant characters,
then we have a set of (possibly) varying length binary codes for all of the characters. If we
traverse every path from the root of the above tree to the leaves, we can generate the above
table of bit sequences used to represent each of the coded characters.
The above table is used to encode our original text (“Boo!”) to a binary bit string. The “B”
produces the code 00 (according to the first row of the coding table), which starts the overall
code for the text; the first “o” is coded as 1, which is appended to the overall code to produce
001; the second “o” adds another 1 to produce 0011; the final “!” is coded as 01, which adds
to the accumulated code to produce 001101, which is the complete code for “Boo!”. Thus we
require six bits to require “Boo!”, given this particular coding scheme.
Decoding is equally simple. In this case we start with the binary string 001101 and use the
binary tree above to produce the original text, “Boo!”. We begin with a pointer to the first
code bit (shown by underlining): 001101. We also start with another pointer to the root of
the tree. We then execute the following algorithm until the code string is exhausted:
a. If the tree pointer points to a leaf node, output the associated character. Reset the
tree pointer to the root of the tree and continue with step (a).
b. Otherwise, move the tree pointer to either the zero or the one sub-tree of the current
branch node, depending on the value of the current code bit, which is consumed.
Following this algorithm, the first two 0’s get us to “B”, the next 1 gets us to “o”, the next 1
gets us to another “o”, and the final 01 sequence gets us to “!”. The accumulated output
characters constitute our original text “Boo!”.
This is nice, but how do we build the decoding tree, which gives us the encoding table, and
these lovely compression results? The key idea is to place the less frequently occurring
characters (“B” and “!” in this text) further down the tree, and the more frequently
occurring character nearer the root of the tree.
Each tree is as described above, except each branch and each leaf includes a place to record
the frequency which the contents of the tree occur in the text. Notice that the frequency
associated with each branch is the sum of the frequencies associated with its two sub-trees.
The formal definition of binary trees goes something like this: “a tree is either a single leaf,
or a branch which connects two (sub-)trees.” From this definition, we see that even a single
leaf is considered a tree. A set of trees is called a “forest”. We begin our Huffman encoding
process with a forest of single-leaf trees which represent all the characters found in the
original text document. The tree-building process combines the trees in the forest two at a
time to form a new tree, thereby reducing the number of the trees in the forest by one each
step. This process continues until the forest consists of one remaining tree; this one tree
contains all of the leaves originally found in the forest, with enough branches to form that
many leaves into a tree. (Where there are X leaves, X-1 branches are required. Why?) For
our simple example (”Boo!”), the following forest of three leaves is our starting point.
The algorithm has only one step, which is repeated until the forest has been reduced to a
single tree. Sort the trees by their frequencies. Take two of the smallest-frequency trees
from the forest, combine them with a new branch (which records their summed
frequencies), and place this new tree in the forest. In our example, the trees are already
sorted and the smallest two leaves are the leaves containing “B” and “!”. Combine them to
form a new tree, resulting in the following forest (left). In this simple example, the two
remaining trees are necessarily the smallest, and must be combined to produce the final
single tree in the forest (right).
0 0.5 1 “o”
0.50
“B” “!”
0.25 0.25
Notice that we did not specify the order in which the two sub-trees should be attached to
newly constructed branch/trees. It does not matter.
Any properly constructed tree will have the following properties:
1. all characters in the document will exist as leaves in the tree,
2. if f c > f c ' then the leaf for c will be no lower in the tree than the leaf for c' , and
3. the frequency of the tree will be 1.0.
Because the structure is a 0/1-labeled binary tree, there will be a uniquely labeled path to
each leaf, which allows a unique left-to-right interpretation of any bit stream as a set of
compressed characters.
Given such a tree and coding scheme, we can always determine the number of bits required
to encode the original text. Where code(c) is the binary code sequence found from the tree
or table for each character c, and len(code(c)) is the bit length of that code, and freq(c) is the
frequency associated with each character, we know that the average number of bits
required per character is:
N chars
∑ len(code(c)) × freq(c) .
c
In our running example, this evaluates to 2×0.25 + 2×0.25 + 1×0.5 = 1.5 bits/character.
Given four characters, we thus expect to require 4×1.5 = 6 bits for the encoding, which is
what we found above. Notice that the compression quality of this example is
6/(4×8)=0.1875.
Multiple Compressions
Run-length encoding and Huffman encoding are quite different, and exploit quite different
properties of the text files to be compressed. Run-length encoding is very efficient where
there are long runs; Huffman encoding is fairly efficient, but less spectacularly so, on files
with less structure. But suppose a file has some nicely encode-able runs, but also a great
amount of nearly random text. Should it not be possible to use both algorithms to obtain
better compression results for such a file? Happily, the answer is yes, and it is quite
straightforward to arrange.
Suppose we have some text file. We proceed in the following steps: (1) apply run-length
encoding, and (2) apply Huffman encoding to this result. To decompress, simply (1) apply
Huffman decoding, and (2) apply run-length decoding: As both schemes are lossless, our
original text reconstructed.
“Best” Compression
We have now seen three compression schemes: Run-length compression, Huffman
compression, and Combined Huffman/Run-length compression. For any given file, these
schemes will behave varyingly well. Instead of sticking with one of these schemes, why
don’t we always pick the best performer for any particular file, and use that encoding? If
we include a field in the coded file to indicate which compression algorithm was used to
construct it, then a multi-algorithm decompresser can always use the appropriate
algorithm to reconstruct the original text. Thus we can always have the most useful of our
compression algorithms at the cost of running all of them and adding a single byte to the
compressed file.
More Issues Regarding Huffman Compression
We saw the construction of a Huffman coding scheme for the original text string “Boo!”.
The constructed coding scheme clearly depends on the source text. If we are to use this
coding scheme, it is necessary to store (or transmit) the scheme with the encoded text. The
representation of this coding scheme will require more bits, which must be accounted for in
computing the quality of the scheme, for the coded bits are useless without the coding
scheme.
An alternative approach is to generate one Huffman-based compression scheme to be used
for a wide variety of text files. The scheme must include all characters used in all text files.
Also, the scheme cannot be ideal for all of the possible text files, but can only be ideal for
some particular selected character occurrence numbers. This approach is a reasonable
solution for some applications, but will not be pursued further in this introductory lab note.
Instead, we will consider the costs of having the compression process produce a
representation of the decoding tree, which is then stored or transmitted with the encoded
text. The decompressing process must first reconstruct the decoding tree, and then use it
to reconstruct the compressed text.
There are a number of ways to encode the Huffman decoding/decompression tree. In the
interest of brevity in this lab note, we describe only one such approach. Given a possible
256 characters to be reconstructed, we produce a list of bit codings for each such character.
If the character is missing (e.g., the letter “Q” is never used in our “Boo!” example), the
coding for that character is a single binary 0. If the character is present, we produce a
string of 1+5+L characters, where the first bit is a 1 to indicate the character is present, the
next five bits encode the length of the character’s bit sequence (L, in the range 0..31); and
the final L bits are the character’s Huffman code. As a special optimization, any string of
trailing 0’s are clipped from the representation and are assumed to be all zeros during the
reconstruction process. The decompressing process runs through all 256 possible
characters while consuming this bit string, and reconstructs the codes for all used
characters. From this information, standard techniques allow the Huffman decoding tree
to be reconstructed. All uses of Huffman encoding in this lab use this method of encoding
the Huffman codes. The costs of representing these codes must be accounted for in the
quality measures.
A last issue regarding Huffman encoding based compression is how we know when the
decoding process is at the end of the coded string. Since every possible bit past the end of
the coded file has meaning as the start of a possible further character, it is necessary to
know precisely where the end of the bit string occurs. It would be possible to precede the
encoded string with its length in bits. For the purposes of this lab, however, we have used a
different and more interesting technique. We add a special end-of-string character (called
TM, for termination mark) to the Huffman encoding problem. This special character has
only one occurrence at the end of the original text; it’s added to the Huffman tree and is
assigned its own bit string code (which is one of the longer codes, due to the single
occurrence). When this special code is encountered in the decoding/decompression process,
the decoder knows to stop.

8 Data Compression 10

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

8 Data Compression 10

Uploaded by

Copyright:

Available Formats

University of Saskatchewan 11–1

CME 392 Computer Engineering Laboratory

Data Compression Algorithms

Part II: Compression/Decompression

Part III: Unix shell script

You might also like