You are on page 1of 40

CHAPTER 6

Compression Techniques

Objectives:

Able to perform data compression


Able to use different compression
techniques

Introduction

What is Compression?
{

{
{

Data compression requires the identification and


extraction of source redundancy.
In other words, data compression seeks to
reduce the number of bits used to store or
transmit information.
There are a wide range of compression methods
which can be so unlike one another that they
have little in common except that they compress
data.

The Need For Compression

In terms of storage, the capacity of a storage


device can be effectively increased with
methods that compresses a body of data on its
way to a storage device and decompresses it
when it is retrieved.
In terms of communications, the bandwidth of a
digital communication link can be effectively
increased by compressing data at the sending
end and decompressing data at the receiving
end.

A Brief History of Data


Compression

The late 40's were the early years of


Information Theory, the idea of developing
efficient new coding methods was just
starting to be fleshed out. Ideas of entropy,
information content and redundancy were
explored.
One popular notion held that if the
probability of symbols in a message were
known, there ought to be a way to code the
symbols so that the message will take up
less space.

A Brief History of Data


Compression

The first well-known method for compressing


digital signals is now known as Shannon- Fano
coding. Shannon and Fano [~1948]
simultaneously developed this algorithm which
assigns binary codewords to unique symbols
that appear within a given data file.

While Shannon-Fano coding was a great leap


forward, it had the unfortunate luck to be quickly
superseded by an even more efficient coding
system : Huffman Coding.

A Brief History of Data


Compression

Huffman coding [1952] shares most


characteristics of Shannon-Fano coding.
Huffman coding could perform effective data
compression by reducing the amount of
redundancy in the coding of symbols.
It has been proven to be the most efficient
fixed-length coding method available.

A Brief History of Data


Compression

In the last fifteen years, Huffman coding has


been replaced by arithmetic coding.
Arithmetic coding bypasses the idea of
replacing an input symbol with a specific
code.
It replaces a stream of input symbols with a
single floating-point output number.
More bits are needed in the output number
for longer, complex messages.

A Brief History of Data


Compression

Terminology
CompressorSoftware (or hardware) device that
compresses data
DecompressorSoftware (or hardware) device
that decompresses data
CodecSoftware (or hardware) device that
compresses and decompresses data
AlgorithmThe logic that governs the
compression/decompression process

Compression can be
categorized in two broad ways:
Lossless compression
recover the exact original data after
compression.
mainly use for compressing database
records, spreadsheets or word
processing files, where exact
replication of the original is essential.

Compression can be
categorized in two broad ways:
Lossy compression
will result in a certain loss of accuracy in
exchange for a substantial increase in
compression.
more effective when used to compress graphic
images and digitised voice where losses outside
visual or aural perception can be tolerated.
Most lossy compression techniques can be
adjusted to different quality levels, gaining higher
accuracy in exchange for less effective
compression.

Lossless Compression
Algorithms:

Lossless Compression
Algorithms:

Dictionary-based compression algorithms


Repetitive Sequence Suppression
Run-length Encoding*
Pattern Substitution
Entropy Encoding*
{
{
{

The Shannon-Fano Algorithm


Huffman Coding*
Arithmetic Coding*

Dictionary-based
compression algorithms

Dictionary-based compression algorithms


use a completely different method to
compress data.
They encode variable-length strings of
symbols as single tokens.
The token forms an index to a phrase
dictionary.
If the tokens are smaller than the phrases,
they replace the phrases and compression
occurs.

Dictionary-based
compression algorithms

Suppose we want to encode the Oxford


Concise English dictionary which contains
about 159,000 entries. Why not just transmit
each word as an 18 bit number?
Problems:
{
{
{

Too many bits,


everyone needs a dictionary,
only works for English text.

Solution: Find a way to build the dictionary


adaptively.

Dictionary-based
compression algorithms

Two dictionary- based compression


techniques called LZ77 and LZ78 have been
developed.
LZ77 is a "sliding window" technique in which
the dictionary consists of a set of fixed- length
phrases found in a "window" into the
previously seen text.
LZ78 takes a completely different approach to
building a dictionary. Instead of using
fixedlength phrases from a window into the
text, LZ78 builds phrases up one symbol at a
time, adding a new symbol to an existing
phrase when a match occurs.

Dictionary-based
compression algorithms

The LZW Compression Algorithm can


summarised as follows:

Example

Dictionary-based
compression algorithms

The LZW decompression Algorithm can


summarised as follows:

10

Example:

Dictionary-based compression
algorithms Problem

What if we run out of dictionary space?


{

Solution 1: Keep track of unused entries


and use LRU
Solution 2: Monitor compression
performance and flush dictionary when
performance is poor.

11

Repetitive
Sequence Suppression

Fairly straight forward to understand


and implement.
Simplicity is their downfall: NOT best
compression ratios.
Some methods have their applications,
e.g.Component of JPEG, Silence
Suppression.

Repetitive Sequence Suppression

If a sequence a series on & successive


tokens appears
{

Replace series with a token and a count number


of occurrences.
Usually need to have a special flag to denote
when the repeated token appears

Example
89400000000000000000000000000000000
we can replace with 894f32, where f is the
flag for zero.

12

Repetitive Sequence Suppression

How Much Compression?


{

Compression savings depend on the content of


the data.

Applications of this simple compression


technique include:
{
{
{
{
{
{

Suppression of zeros in a file (Zero Length


Suppression)
Silence in audio data, Pauses in conversation
etc.
Bitmaps
Blanks in text or program source files
Backgrounds in images
Other regular image or data tokens

Run-length Encoding

13

Run-length Encoding

Run-length Encoding

14

Run-length Encoding

Run-length Encoding
Uncompress

Blue White White White White White White Blue


White Blue White White White White White Blue etc.
Compress

1XBlue 6XWhite 1XBlue


1XWhite 1XBlue 4Xwhite 1XBlue 1XWhite
etc.

15

Run-length Encoding

Pattern Substitution

16

Entropy Encoding

The Shannon-Fano Coding

To create a code tree according to Shannon


and Fano an ordered table is required
providing the frequency of any symbol.
Each part of the table will be divided into two
segments.
The algorithm has to ensure that either the
upper and the lower part of the segment
have nearly the same sum of frequencies.
This procedure will be repeated until only
single symbols are left.

17

The Shannon-Fano
Algorithm

The Shannon-Fano
Algorithm

18

The Shannon-Fano
Algorithm

The Shannon-Fano
Algorithm

19

The Shannon-Fano
Algorithm

Example Shannon-Fano
Coding

20

Example Shannon-Fano
Coding
STEP 1
SYMBOL

FREQ

SUM

CODE

STEP 2
SUM

CODE

11

STEP 3
SUM

CODE

11

Example Shannon-Fano
Coding

21

Example Shannon-Fano
Coding

Huffman Coding

22

Huffman Coding

Huffman Coding

23

Huffman Coding

Huffman Coding

24

Huffman Coding

Huffman Coding

25

Huffman Coding

Huffman Coding

26

Huffman Code: Example

Huffman Code: Example

27

Huffman Code: Example

Huffman Coding

28

Huffman Coding

Huffman Coding

29

Arithmetic Coding

Arithmetic Coding

30

Arithmetic Coding

Arithmetic Coding

31

Arithmetic Coding

Arithmetic Coding

32

Arithmetic Coding

Arithmetic Coding

33

Arithmetic Coding

Arithmetic Coding

34

Arithmetic Coding

Arithmetic Coding

35

Arithmetic Coding

How to translate range to bit

1.

Example:
BACA
low = 0.59375, high = 0.60937.

2.

CAEE$
low = 0.33184, high = 0.3322.

36

Decimal
1
2
3
4
5
+ 2+ 3+ 4+ 5
1
10 10 10 10 10
= 1 10 1 + 2 10 2 + 3 10 3 + 4 10 4 + 5 10 5

0.12345 =

0.12345
x
x
x
x
x

10-5
10-4
10-3
10-2
10-1

Binary
0 . 01010101
0
1
0
1
0
1
0
1
+
+
+
+
+
+
+
21 2 2 2 3 2 4 2 5 2 6 2 7 2 8
= 1 2 2 + 1 2 4 + 1 2 6 + 1 2 8

0.01010
x 2-5
x 2-4
x 2-3
x 2-2
x 2-1

37

Binary to decimal
0.12 = 0.510
0.012 = 0.2510

What is a value of
0.010101012 in
decimal?

0.0012 = 0.12510
0.00012 = 0.062510
0.000012 = 0.0312510
0.033203125

Generating codeword for


encoder
[0.33184,0.33220]
BEGIN
code=0;
k=1;
while( value(code) < low )
{
assign 1 to the k-th binary fraction bit;
if ( value(code) > high)
replace the k-th bit by 0;
k = k + 1;
}
END

38

Example1
Range (0.33184,0.33220)
BEGIN
Binary
code=0;
Decimal
k=1;
while( value(code) < 0.33184 )
{
assign 1 to the k-th binary fraction bit;
if ( value(code) > 0.33220 )
replace the k-th bit by 0;
k = k + 1;
}
END

Example1
Range (0.33184,0.33220)
1.

Assign 1 to the first fraction


(codeword=0.12) and compare with low
(0.3318410)
{
{

2.

value(0.12=0.510)> 0.3318410 -> out of range


Hence, we assign 0 for the first bit.
value(0.02)< 0.3318410 -> while loop continue

Assign 1 to the second fraction (0.012)


=0.2510 which is less then high (0.33220)

39

Example1
Range (0.33184,0.33220)
3.

4.

5.

Assign 1 to the third fraction (0.0112)


=0.2510+ 0.12510 = 0.37510 which is bigger
then high (0.33220), so replace the kbit by
0. Now the codeword = 0.0102
Assign 1 to the fourth fraction (0.01012) =
0.2510 + 0.062510 =0.312510 which is less
then high (0.33220). Now the codeword =
0.01012
Continue

Example1
Range (0.33184,0.33220)

Eventually, the binary codeword


generate is 0.01010101 which
0.033203125
8 bit binary represent CAEE$

40