You are on page 1of 8

Data Compression Seminar John Kieffer

1 Introduction
In data compression, one wishes to give a compact representation of data generated by a
data source. Depending upon the source of the data, the data could be of various types,
such as text data, image data, speech data, audio data, video data, etc. Data compression is
performed in order to more easily store the data or to more easily transmit the data. It is the
job of the data compression practitioner to design a data compression system for compressing
the data. Here is a block diagram illustrating a general data compression system:

source → source data → encoder → compressed data → decoder → reconstructed data

In this diagram, we have termed the data generated by the data source as the source data,
and we have termed the compact representation of the source data the compressed data. The
data compression system consists of encoder and decoder. The encoder converts the source
data into the compressed data, and the decoder attempts to reconstruct the source data
from the compressed data. The reconstructed data generated by the decoder either coincides
with the source data or is perceptually indistinguishable from it.
The data compression practitioner would need to evaluate how good a potential data
compression system is. The compression ratio is a figure of merit via which this can be done.
By a compression ratio of r to 1, the data compression practitioner means that
size of source data in bits
=r
size of compressed data in bits
Thus a compression ratio of 2 to 1 means that the compressed data is half the size of the
source data. The higher the compression ratio, the better the compression system is.
EXAMPLE. This example illustrates how data compression can assist one in storing
data or in transmitting data. Suppose the data source generates an arbitrary 512 × 512
digital image consisting of 256 colors. Each color is represented by an intensity from the
set {0, 1, 2, . . . , 255}. Mathematically, this image is a 256 × 256 matrix, each of whose
elements comes from the set {0, 1, 2, . . . , 255}. (These elements are called pixel elements.)
The intensity with which each pixel element is designated can be represented using 8 bits.
Thus, the size of the source data in bits is 8×512×512 = 221 , which is about 2.1 megabytes. A
one gigabyte hard disk could thus store only about 476 of these images, without compression.
Suppose, however, that one can compress each such image at a compression ratio of 8 to
1. Then, one can store about 3800 images on this hard disk! Suppose now that one wants
to transmit an uncompressed image over a telephone channel that can transmit 30, 000 bits
per second. Computing 221 /30000, one sees that it would take about 70 seconds to do this.
With the 8 to 1 compression ratio, the compressed image could be trasmitted in under 9
seconds!
There are two varieties of data compression systems:
• lossless compression systems
• lossy compression systems
1.1 Lossless compression
In a lossless data compression system, the decoder is able to perfectly reconstruct the source
data. Thus, the block diagram of a lossless compression system looks like:

X → encoder → B(X) → decoder → X

We have represented the source data as a sequence X = (x1 , x2 , . . . , xn ). The samples xi


comprising this sequence come from a finite data alphabet A. We suppose that the length n
of the source data sequence X is fixed, but that it can be arbitrarily large. The data source
could conceivably generate as output X any data sequence of length n over the alphabet A.
We have represented the encoder output (the compressed data) corresponding to the source
data X as a binary sequence B(X) = (b1 , b2 , . . . , bk ), in which the length k of B(X) can vary
with X. We call B(X) the codeword assigned to X by the encoder (or the codeword into
which X is encoded by the encoder). Since this system is lossless, the encoder must assign
distinct codewords to two distinct encoder inputs X.
The compression ratio for our lossless system is r to 1, where

n log2 |A|
r=
k
and |A| is the size of the alphabet A. Equivalently, we can use the compression rate as our
figure of merit. The compression rate R is defined by

R = k/n
The units of the compression rate R are “codebits per data sample”. Finding a compression
system with large compression ratio is the same as finding a compression system with small
compression rate.
EXAMPLE. We take as our data to be compressed the following 4 × 4 image, in the
four colors R = “red”, O = “orange”, Y = “yellow”, G = “green”:

R R O Y
R O O Y
O O Y G
Y Y Y G

If we horizontally scan this image, and assign intensities 3, 2, 1, 0 to R, O, Y, G, respec-


tively, then we can represent this image as the data sequence X = (3, 3, 2, 1, 3, 2, 2, 1, 2, 2, 1, 0, 1, 1, 1, 0).
We shall compress X using the encoder given by the following encoder table:

encoder table
3 001
2 01
1 1
0 000
To use the encoder table, replace each symbol in X with the binary codeword for that symbol
given in the table. The compressed data is then

B(X) = (0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0)

The compression ratio is


32
r=
31
or 1.0323 to 1. The compression rate is R = 31/16 codebits per data sample. Notice that we
could instead have trivially encoded the individual samples of X using 2 codebits per sample.
(This is because the size of the alphabet is 4 and log2 4 = 2.) Our compression system given
by the above encoder table yields a lower compression rate than does the trivial compression
system just described, and is therefore a better system.

1.2 Lossy compression


In a lossy compression system, the decoder is not able to perfectly reconstruct the source
data. The block diagram of a lossy compression system looks like:

X → quantizer → X̂ → encoder → B(X̂) → decoder → X̂

As in the lossless case, we let X = (x1 , x2 , . . . , xn ) denote the source data. The reconstructed
data is denoted X̂ = (x̂1 , x̂2 , . . . , x̂n ). Notice the “quantizer” present in the lossy system
that was not present in the lossless system. The quantizer, in response to the data input
X, produces the output X̂ close enough to X (with respect to Euclidean distance or some
other distance function) that there will be no perceptual difference between X and X̂. The
quantizer works by alphabet reduction. (A simple quantizer is the “rounding off” function
— this quantizer would produce X̂ = (1, 3, 4, 2, 2) in response to X = (1.1, 2.6, 4.4, 2.3, 1.7),
thereby reducing the alphabet from {1.1, 2.6, 4.4, 2.3, 1.7} to {1, 2, 3, 4}.) Notice that the
encoder acts losslessly on X̂, the vector into which the vector X has been “quantized”
— the vector X̂ is assigned a unique binary codeword B(X̂) = (b1 , b2 , . . . , bk ), and then
the decoder reconstructs X̂ from B(X̂). However, the overall system is lossy because two
distinct X can be quantized into the same X̂.
In addition to the compression rate R = k/n, there is another figure of merit via which
the lossy compression system should be judged, namely, the distortion D induced by the
system, defined by
n
D = n−1
X
d(xi , x̂i )
i=1

where d is some distance function. If you have a lossy compression system with rate and
distortion R1 , D1 , and a second compression system with rate and distortion R2 , D2 , then
the first system is the better one if R1 < R2 and D1 < D2 . Unfortunately, however, it is not
possible to design a lossy compression system for which R and D are simultaneously small,
since these two parameters are inversely related. Instead, one’s design goal should be to find
a compression system that yields the smallest R for a fixed D, or the smallest D for a fixed
R. The theory detailing the R, D trade-offs that are possible in compression system design
is called rate distortion theory.
EXAMPLE. We lossily compress the same 4 × 4 image which was losslessly compressed
above. We quantize X = (3, 3, 2, 1, 3, 2, 2, 1, 2, 2, 1, 0, 1, 1, 1, 0) into X̂ = (3, 3, 2, 1, 3, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1
(The quantizer operates sample-by-sample, leaving 3, 2, 1 unchanged and converting 0 into
1, thereby reducing the alphabet from {0, 1, 2, 3} to {1, 2, 3}.) Use the encoder given by the
encoding table:

encoder table
3 00
2 01
1 1

The following compressed data B(X̂) is obtained:

B(X̂) = (0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1)
The compression rate is R = 24/16 = 1.5 codebits per data sample, and the distortion is
16
1 X
D= |xi − x̂i | = 0.125
16 i=1
The reconstructed image is

R R O Y
R O O Y
O O Y Y
Y Y Y Y

Here is a question for the reader: Is there a lossy compression system for which R = 1.5 and
D < 0.125?
Whether one performs lossless or lossy compression depends on the type of data one has.
In text compression, for example, one would want to do lossless compression because one
would want perfect recontruction of the text from its compressed version. In image compres-
sion, lossy compression is frequently used because the reconstructed image can appear to
be perceptually the same as the original image, without coinciding with the original image.
Lossy compression, where feasible, yields a gain in compression over lossless compression.
For example, in the compression of images a compression ratio of 2 to 1 may be the best one
can do with lossless compression, whereas a compression ratio of 8 to 1 may be feasible for
lossy compression.
1.3 MATLAB m-files
We present some MATLAB functions that will be useful to us later on. MATLAB (unlike
LISP and Mathematica, which are list processing languages) cannot handle binary sequences
(bit strings) of varying lengths very well. The MATLAB programs we present here via allow
MATLAB to deal with bitstrings by converting them to integer indices.

1.3.1 bitstring to index


The MATLAB function “bitstring to index” is given by the m-file below:

%name this m-file bitstring_to_index.m


%The function bitstring_to_index assigns to a binary string its index in
%the natural ordering of all binary strings
function y=bitstring_to_index(x)
N=length(x);
S=1;
for i=1:N;
S=2*S+x(i);
end
y=S-1;

There is a natural ordering of all bitstrings:

0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, 0000, . . .
To find the position of a bitstring x in this list, it can be seen that you convert the bit string
1x to integer form and then subtract 1. For example, the bitstring 0001 is the 16-th str list
because when you convert 10001 to integer form you get 16 + 1 = 17.
If you execute the MATLAB line

bitstring_to_index([0 0 0 1])

you will see that MATLAB returns the index 16.

1.3.2 index to bitstring


The MATLAB function “index to bitstring” is given by the m-file below:

%Suppose a bitstring is the i-th bitstring in the list of all


%bitstrings 0, 1, 00, 01, 10, 11, 000, ... . Then i is called
%the index of the given bitstring. The function
%index_to_bitstring(i) computes a bitstring from its index i.
function y=index_to_bitstring(x)
S=x+1;
i=1;
while S>1
z(i)=rem(S,2);
S=fix(S/2);
i=i+1;
end
N=length(z);
j=1:N;
k=N+1-j;
y=z(k);

The function “index to bitstring” is the inverse of the function “bitstring to index”. Thus
if you execute the MATLAB line

index_to_bitstring(16)

you will see that MATLAB will give you the vector [0001].

1.3.3 print bitstrings


The MATLAB function “print bitstrings” is given by the following MATLAB m-file:

%Let b(i) denote the binary string with index i.


%If x=(i1,i2,...,ik), executing print_bitstrings(x)
%yields a display of the strings b(i1), b(i2), ..., b(ik)
%on the screen
function y=print_bitstrings(x)
N=length(x);
for i=1:N;
z=index_to_bitstring(x(i));
M=length(z);
for j=1:M;
fprintf(1,’%1.0f’,z(j))
end
fprintf(1,’\n’)
end

This MATLAB function is used for printing bitstrings stored in MATLAB memory to the
screen. For example, if you execute the MATLAB line

print_bitstrings([16 17 18 19])
you will see the following bitstrings printed on the screen:

0001
0010
0011
0100
These are the 16-th, 17-th, 18-th, and 19-th bitstrings in the list of all bitstrings.

1.3.4 archive bitstrings


Here is the m-file for the MATLAB function “archive bitstrings”:

%Let x be the vector of indices of a set of bitstrings. The


%command archive_bitstrings(x) prints this set of bitstrings to a
%file named ’bitstring.txt’.
function y=archive_bitstrings(x)
fid = fopen(’bitstring.txt’,’a’);
N=length(x);
for i=1:N;
z=index_to_bitstring(x(i));
M=length(z);
for j=1:M;
fprintf(fid,’%1.0f’,z(j));
end
fprintf(fid,’\n’);
end

The MATLAB function “archive bitstrings” is very much like “print bitstrings”, except that
the bitstrings are printed to a file named “bitstring.txt”. For example, executing the MAT-
LAB line

archive_bitstrings([16 17 18 19])

will put the binary strings 0001, 0010, 0011, 0100 in the file “bitstring.txt”. Try it!

1.3.5 input bitstrings


The m-file “input bitstrings” is given below:
%Suppose, for example, you want to enter the binary strings
%01, 101, 1100 in memory. Create the vector
%x=[0, 1, 2, 1, 0, 1, 2, 1, 1, 0, 0, 2], where the symbol "2"
%is used as a marker indicating where a binary string ends.
%Then, input_bitstrings(x) is a vector whose components are
%the indices of the strings 01, 101, 1100.
function y = input_bitstrings(x)
N=length(x);
j=1;
while N > 0
S=1;
i=1;
while S < 2
i=i+1;
S=x(i);
end
z=x(1:i-1);
y(j)=bitstring_to_index(z);
x=x(i+1:N);
N=length(x);
j=j+1;
end;
y;

As we have seen, the function “bitstring to index” can be used to enter one bitstring into
MATLAB memory. To enter two or more bitstrings into MATLAB memory, you use the
function “input bitstrings”. You form one big vector whose components are the components
of the bitstrings to go in memory, with the components of each bitstring separated from the
components of the next bitstring by a component of “2”. For example, suppose you want to
store the bitstrings 110, 000, 10, 1 in MATLAB memory. Form the vector
and then execute the MATLAB line

input_bitstrings(x)

You will then see that MATLAB returns the vector [13752], indicating that 110, 000, 10, 1
are the 13-th, 7-th, 5-th, and 2-nd bitstrings in the list of all bitstrings. To check this result,
execute the MATLAB line

print_bitstrings([13 7 5 2])

and you will see the bitstrings 110, 000, 10, 1 printed out on the screen.

You might also like