You are on page 1of 20

Data Compression

I Unit

Topic Cover:
Compression Techniques: Loss less compression, Lossy Compression, Measures of
performance, Modeling and coding, Mathematical Preliminaries for Lossless compression: A
brief introduction to information theory, Models: Physical models, Probability models,
Markov models, composite source model, Coding: uniquely decodable codes, Prefix codes.
Data Compression

Compression is used just about everywhere. All the images you get on the web are
compressed, typically in the JPEG or GIF formats, most modems use compression, HDTV will
be compressed using MPEG-2, and several file systems automatically compress files when
stored, and the rest of us do it by hand. The neat thing about compression, as with the other
topics we will cover in this course, is that the algorithms used in the real world make heavy
use of a wide set of algorithmic tools, including sorting, hash tables, tries, and FFTs.
Furthermore, algorithms with strong theoretical foundations play a critical role in real-world
applications.

Compression Techniques

Why needs compression?

Compression reduces the size of a file:


• To save space when storing it.
• To save time when transmitting it.
• Most files have lots of redundancy.
Data compression implies sending or storing a smaller number of bits. Although many
methods are used for this purpose, in general these methods can be divided into two broad
categories: lossless and lossy methods.

Lossless Methods

Lossless compression techniques, as their name implies, involve no loss of information. If


data have been losslessly compressed, the original data can be recovered exactly from the
compressed data. Lossless compression is generally used for applications that cannot
tolerate any difference between the original and reconstructed data.
Text compression is an important area for lossless compression. It is very important that the
reconstruction is identical to the text original, as very small differences can result in
statements with very different meanings. Consider the sentences “Do not send money” and
“Do now send money.” A similar argument holds for computer files and for certain types of
data such as bank records.

Example of Lossless methods-

A) Run-length
B) Huffman
C) Lempel Ziv

Lossy Methods

Lossy compression techniques involve some loss of information, and data that have been
compressed using lossy techniques generally cannot be recovered or reconstructed exactly.
In return for accepting this distortion in the reconstruction, we can generally obtain much
higher compression ratios than is possible with lossless compression.
In many applications, this lack of exact reconstruction is not a problem. For example,
when storing or transmitting speech, the exact value of each sample of speech is not
necessary. Depending on the quality required of the reconstructed speech, varying amounts
of loss of information about the value of each sample can be tolerated. If the quality of the
reconstructed speech is to be similar to that heard on the telephone, a significant loss of
information can be tolerated. However, if the reconstructed speech needs to be of the
quality heard on a compact disc, the amount of information loss that can be tolerated is
much lower. Similarly, when viewing a reconstruction of a video sequence, the fact that the
reconstruction is different from the original is generally not important as long as the
differences do not result in annoying artifacts. Thus, video is generally compressed using
lossy compression.

Example of Lossy Methods

A) JPEG
B) MPEG
C) MP3

Entropy

In particular a system is assumed to have a set of possible


states it can be in, and at a given time there is a probability distribution over those states.
Entropy is then defined as:
where S is the set of possible states, and p(s) is the probability of state s ∈ S. This definition
indicates that the more even the probabilities the higher the entropy (disorder) and the
more biased the probabilities the lower the entropy—e.g. if we know exactly what state the
system is in then H(S) = 0. One might remember that the second law of thermodynamics
basically says that the entropy of a closed system can only increase.

In the context of information theory Shannon simply replaced “state” with “message”, so S
is a set of possible messages, and p(s) is the probability of message s ∈ S. Shannon also
defined the notion of the self information of a message as

i(s) = log2 1/ p(s)

Measures of Performance

A compression algorithm can be evaluated in a number of different ways. We could


measure the relative complexity of the algorithm, the memory required to implement the
algorithm, how fast the algorithm performs on a given machine, the amount of
compression, and how closely the reconstruction resembles the original. In this book we will
mainly be concerned with the last two criteria. Let us take each one in turn.
A very logical way of measuring how well a compression algorithm compresses a given
set of data is to look at the ratio of the number of bits required to represent the data before
compression to the number of bits required to represent the data after compression. This
ratio is called the compression ratio. Suppose storing an image made up of a square array of
256×256 pixels requires 65,536 bytes. The image is compressed and the compressed version
requires 16,384 bytes. We would say that the compression ratio is 4:1. We can also
represent the compression ratio by expressing the reduction in the amount of data required
as a percentage of the size of the original data. In this particular example the compression
ratio calculated in this manner would be 75%.

In lossy compression, the reconstruction differs from the original data. Therefore, in
order to determine the efficiency of a compression algorithm, we have to have some way of
quantifying the difference. The difference between the original and the reconstruction is
often called the distortion.

Lossy techniques are generally used for the compression of data that originate as analog
signals, such as speech and video. In compression of speech and video, the final arbiter of
quality is human. Because human responses are difficult to model mathematically, many
approximate measures of distortion are used to determine the quality of the reconstructed
waveforms.
Other terms that are also used when talking about differences between the reconstruction
and the original are fidelity and quality. When we say that the fidelity or quality of a
reconstruction is high, we mean that the difference between the reconstruction and the
original is small. Whether this difference is a mathematical difference or a perceptual
difference should be evident from the context.

Modelling & Coding

The development of data compression algorithms for a variety of data can be divided
into two phases.

-The first phase is usually referred to as modelling.

- The second phase is called coding

Run-Length Coding
Run-length encoding is probably the simplest method of
compression. It can be used to compress data made of any
combination of symbols. It does not need to know the
frequency of occurrence of symbols and can be very efficient
if data is represented as 0s and 1s

The general idea behind this method is to replace consecutive repeating occurrence of a
symbol followed by the number of occurrences.
The method can be even more efficient if the data uses only two symbols (for example 0
and 1) in its bit pattern and one symbol is more frequent than the other.
Example:
000000000000001000011000000000000

14 4 0 12
We take pattern as: number’s of 0 before the first 1.
And convert the obtain 1’s number into 4 bit
As: 14=1110
4=0100
0=0000
12=1100
Now compressed data is: 1110010000001100
Encoding Algorithm for Run Length Coding
1- Count the number of 0’s between two 1’s.
2- If the number is less than 15, write it down in binary form.
3- If it greater than or equal to 15, write down 1111, and a following binary number to
indicate the rest of the 0s. If more than 30, repeat this process.
4- If data starts with a 1, write down 0000 at the beginning.
5- If the data end with a 1, write down 0000 at the end.
6- Send the binary string.

Huffman Code
Huffman codes are optimal prefix codes generated from a set of probabilities by a
particular algorithm, the Huffman Coding Algorithm. David Huffman developed the
algorithm as a student in a class on information theory at MIT in 1950. The algorithm
is now probably the most prevalently used component of compression algorithms,
used as the back end of GZIP, JPEG and many other utilities.
The Huffman algorithm is very simple and is most easily described in terms of how it
generates the prefix-code tree.

Algorithm for Huffman Code

1- Start with a forest of trees, one for each message. Each tree contains a single
vertex with weight wi = pi
2- Repeat until only a single tree remains
3- Select two trees with the lowest weight roots (w1 and w2).
4- Combine them into a single tree by adding a new root with weight w1 +w2, and
making the two trees its children. It does not matter which is the left or right child,
but our convention will be to put the lower weight root on the left if w1 ≠ w2.

For a code of size n this algorithm will require n-1 steps since every complete binary
tree with n leaves has n-1 internal nodes, and each step creates one internal node. If
we use a priority queue with O(log n) time insertions and find-mins (e.g., a heap) the
algorithm will run in O(n log n) time. The key property of Huffman codes is that they
generate optimal prefix codes. We show this in the following theorem, originally
given by Huffman.

Important Terms

1. Compression ratio: It is very logical way of measuring how well a compression


algorithm compresses a given set of data is to look at the ratio of the number of bits
required to represent the data before compression to the number of bits required to
represent the data after compression. This ratio is called ‘Compression ratio’. Ex.
Suppose storing n image requires 65536 bytes, this image is compressed and the
compressed version requires 16384 bytes. So the compression ratio is 4:1. It can be
also represented in terms of reduction in the amount of data required as a
percentage i.e 75%
2. Distortion: In order to determine the efficiency of a compression algorithm, we have
to have same way of quantifying the difference. The difference between the original
and the reconstruction is called as ‘Distortion’. Lossy techniques are generally used
for the compression of data that originate as analog signals, such as speech and
video. In compression of speech and video, the final arbiter of quality is human.
Because human responses are difficult to model mathematically, many approximate
measures of distortion are used to determine the quality of the reconstructed
waveforms.
3. Compression rate: It is the average number of bits required to represent a single
sample. Ex. In the case of the compressed image if we assume 8 bits per byte (or
pixel) the average number of bits per pixel in the compressed representation is 2.
Thus we would say that the compression rate is 2 bits/ pixel.
4. Fidelity and Quality: The difference between the reconstruction and the original are
fidelity and quality. When we say that the fidelity or quality of a reconstruction is
high, we mean that the difference between the reconstruction and the original is
small. Whether the difference is a mathematical or a perceptual difference should be
evident from the context.
5. Self Information: Shannon defined a quantity called Self – Information. Suppose we
have an event A, which is set of outcomes of some random experiment. If P(A) is the
probability that event A will occur, then the self-information associated with A is
given by:
$i(A)=log_b=-log_b P(A) .......................(1)$
If the probability of an event is low, the amount of self-information associated with it
is high. If the probability of an event is high, the information associated with it is low.
6. Binary Code: Binary code is the simplest form of computer code or programming
data. It is represented entirely by a binary system of digits consisting of a string of
consecutive zeros and ones. Binary code is often associated with machine code in
that binary sets can be combined to form raw code, which is interpreted by a
computer or other piece of hardware. A binary code represents text, computer
processor instructions, or other data using any two-symbol system, but often
the binary number system's 0 and 1. The binary code assigns a pattern of binary
digits (bits) to each character, instruction, etc. For example, a binary string of eight
bits can represent any of 256 possible values and can therefore represent a variety
of different items.
In computing and telecommunications, binary codes are used for various methods
of encoding data, such as character strings, into bit strings. Those methods may use
fixed-width or variable-width strings. In a fixed-width binary code, each letter, digit,
or other character is represented by a bit string of the same length; that bit string,
interpreted as a binary number, is usually displayed in code tables
in octal, decimal or hexadecimal notation. There are many character sets and
many character encodings for them.
7. BMP: The BMP file format, also known as bitmap image file or device independent
bitmap (DIB) file format or simply a bitmap, is a raster graphics image file
format used to store bitmap digital images, independently of the display
device (such as a graphics adapter), BMP is a palette-based graphics file format for
images with 1, 2, 4, 8, 16, 24, or 32 bit-planes. It uses a simple form of RLE to
compress images with 4 or 8 bit-planes. The BMP image file format is native to the
Microsoft Windows operating system. The format of a BMP file is simple. It starts
with a file header that contains the two bytes BM and the file size. This is followed
by an image header with the width, height, and number of bit-planes (there are two
different formats for this header). Following the two headers is the color palette
(that can be in one of three formats) which is followed by the image pixels, either in
raw format or compressed by RLE.
Models
1. Physical Models: If we know something about the physics of the data generation
process, we can use that information to construct a model.
For Ex. In speech- related applications, knowledge about the physics of speech
production can be used to construct a mathematical model for the sampled speech
process. Sampled speech can be encoded using this model.
Real life Application: Residential electrical meter readings
2. Probability Models: The simplest statistical model for the source is to assume
that each letter that is generated by the source is independent of every other letter,
and each occurs with the same probability. We could call this the ignorance model
as it would generation be useful only when we know nothing about the source. The
next step up in complexity is to keep the independence assumption but remove the
equal probability assumption and assign a probability of occurrence to each letter in
the alphabet.
For a source that generates letters from an alphabet A=a1,a2, am we can have
a probability model P=P(a1),P(a2) P(aM)
3. Markov Models: Markov models are particularly useful in text compression,
where the probability of the next letter is heavily influenced by the preceding letters.
In current text compression, the Kth order Markov Models are more widely known as
finite context models, with the word context being used for what we have earlier
defined as state. Consider the word ‘preceding’. Suppose we have already
processed ‘preceding’ and we are going to encode the next ladder. If we take no
account of the context and treat each letter a surprise, the probability of letter ‘g’
occurring is relatively low. If we use a 1st order Markov Model or single letter context
we can see that the probability of g would increase substantially. As we increase the
context size (go from n to in to din and so on), the probability of the alphabet
becomes more and more skewed which results in lower entropy.
4. Composite Source Model: In many applications it is not easy to use a single
model to describe the source. In such cases, we can define a composite source,
which can be viewed as a combination or composition of several sources, with only
one source being active at any given time. A composite source can be represented
as a number of individual sources Si , each with its own model Mi and a switch that
selects a source Si with probability Pi. This is an exceptionally rich model and can be
used to describe some very complicated processes.

You might also like