You are on page 1of 53

Huffman Coding: An Application of Binary Trees and Priority Queues (probability based) for Encoding and

Compression of Data

Purpose of Huffman Coding


Proposed by Dr. David A. Huffman in 1952
A Method for the Construction of Minimum Redundancy Codes

Applicable to many forms of data transmission


Our example: text files

The Basic Algorithm


Huffman coding is a form of statistical coding Not all characters occur with the same frequency! Yet all characters are allocated the same amount of space
1 char = 1 byte, be it

or

The Basic Algorithm


Any savings in tailoring codes to frequency of character? Code word lengths are no longer fixed like ASCII. Code word lengths vary and will be shorter for the more frequently used characters.

The Basic Algorithm


1. 2. 3. 4. 5. Scan text to be compressed and tally occurrence of all characters. Sort or prioritize characters based on number of occurrences in text. Build Huffman code tree based on prioritized list. Perform a traversal of tree to determine all code words. Scan text again and create new file using the Huffman codes.

Building a Tree
Scan the original text

Consider the following short text:


Eerie eyes seen near lake.

Count up the occurrences of all characters in the text

Building a Tree
Scan the original text

Eerie eyes seen near lake. What characters are present?


E e r i space y s n a r l k .

Building a Tree
Scan the original text

Eerie eyes seen near lake.


What is the frequency of each character in the text?

Char Freq. E 1 e 8 r 2 i 1 space 4

Char Freq. y 1 s 2 n 2 a 2 l 1

Char Freq. k 1 . 1

Building a Tree
Prioritize characters

Create binary tree nodes with character and frequency of each character Place nodes in a priority queue
The lower the occurrence, the higher the priority in the queue

Building a Tree
The queue after inserting all nodes

E
1

i
1

y
1

l
1

k
1

.
1

r
2

s
2

n
2

a
2

sp
4

e
8

Null Pointers are not shown

Building a Tree
While priority queue contains two or more nodes
Create new node Dequeue node and make Dequeue next node and subtree Frequency of new node frequency of left and Enqueue new node back it left subtree make it right equals sum of right children into queue

Building a Tree

E 1

i 1

y 1

l 1

k 1

. 1

r 2

s 2

n 2

a 2

sp 4

e 8

Building a Tree

y 1

l 1

k 1

. 1

r 2

s 2

n 2

a 2

sp 4

e 8

2 E 1 i 1

Building a Tree

y 1

l 1

k 1

. 1

r 2

s 2

n 2

a 2 E 1

2
i 1

sp 4

e 8

Building a Tree

k 1

. 1

r 2

s 2

n 2

a 2

sp 4

e 8

E 1

i 1

2 y 1

l 1

Building a Tree

k 1

. 1

r 2

s 2

n 2

a 2

2 y 1

2 l 1

sp 4

e 8

E 1

i 1

Building a Tree

2 y 1

2 l 1

sp

2
E 1 i 1

2 k 1 . 1

Building a Tree

2 i 1

sp

2
E 1

4 y 1 l 1
k 1 . 1

Building a Tree

2
E 1 i 1

sp 4 . 1

e 8

y 1

l 1
4 r 2 s 2

k 1

Building a Tree

2 E 1 i 1
y 1

2
l 1 k 1

2
. 1

sp 4 r 2

4 s 2

Building a Tree

2 y 1

2 l 1 k 1

2 . 1

sp 4 r 2

4 s 2

e 8

E 1

i 1

4 n 2 a 2

Building a Tree

2 y 1

2 l 1 k 1

2 . 1

sp 4 r 2

4 s 2 n 2

4 a 2

e 8

E 1

i 1

Building a Tree

2 k 1 . 1

sp 4 r 2

4 s 2 n 2

4 a 2

4 2 E 1 i 1 2

y 1

l 1

Building a Tree

2 k 1 . 1

sp 4 r 2

4 s 2 n 2

4 a 2 2

4 2 y 1

e 8

E 1

i 1

l 1

Building a Tree

4 r 2 s 2 n 2

4 a 2 E 1 6 2 k 1 . 1 sp 4 2 i 1

4 2 y 1 l 1

e 8

Building a Tree

4 r 2 s 2 n 2

4 a 2

e sp 4 8

2 E 1 i 1
y 1

2 l 1 k 1

. 1

What is happening to the characters with a low number of occurrences?

Building a Tree

4 2 E 1 i 1 2 2 l 1

6 sp 4

e 8

y 1

k 1

. 1

8 4 r 2 s 2 n 2 4 a 2

Building a Tree

4 2 E 1 i 1 2 2 l 1

6 sp 4

e 8 4 r 2 s 2

8 4 n 2 a 2

y 1

k 1

. 1

Building a Tree

e 8

4
r 2 s 2 n 2

4 10 a 2 2 4 2 y 1 6

2 l 1
k 1 . 1

sp 4

E 1

i 1

Building a Tree

e 8

10 4 4 a 2 E 1 2 i 1 2 2 l 1 6 sp 4

4
r 2 s 2 n 2

y 1

k 1

. 1

Building a Tree

10 4 2 E 1 i 1 2 2 l 1 16 6 sp 4 e 8 4 r 2 s 2 n 2 8 4 a 2

y 1

k 1

. 1

Building a Tree

10 4 2 E 1 i 1 2 2 l 1 6 sp 4

16 e 8 4 r 2 s 2 n 2

8 4
a 2

y 1

k 1

. 1

Building a Tree
26

10
4 2 E 1 i 1 y 1 2 l 1 k 1 2 . 1

16 e 8 sp 4 r 2 4 s 2 n 2

8 4
a 2

Building a Tree
After enqueueing this node there is only one node left in priority queue.
4 s 2 n 2 a 2

26 10 4 2 E 1 i 1 y 1 2 l 1 k 1 6 e 8 sp 4 . 1 r 2 4 16 8

Building a Tree
Dequeue the single node left in the queue. This tree contains the new code words for each character.
2 10 4 2 2 6 sp 4 e 8 4 26 16 8 4

Frequency of root node should equal number of characters in text.

E i y l k . 1 1 1 1 1 1

r s n a 2 2 2 2

Eerie eyes seen near lake.

26 characters

Encoding the File


Traverse Tree for Codes
Perform a traversal of the tree to obtain new code words Going left is a 0 going right is a 1 code word is only completed when a leaf node is reached

26 10 16 e 8 sp 4 4 8 4

4 2 2 2

E i y l k . 1 1 1 1 1 1

r s n a 2 2 2 2

Encoding the File


Traverse Tree for Codes
Char E i y l k . space e r s n a Code 0000 0001 0010 0011 0100 0101 011 10 1100 1101 1110 1111

26 10 16 e 8 sp 4 4 8 4

4 2 2 2

E i y l k . 1 1 1 1 1 1

r s n a 2 2 2 2

Encoding the File


Rescan text and encode file using new code words
Eerie eyes seen near lake.

0000101100000110011 1000101011011010011 1110101111110001100 1111110100100101


Why is there no need for a separator character?
.

Char E i y l k . space e r s n a

Code 0000 0001 0010 0011 0100 0101 011 10 1100 1101 1110 1111

Encoding the File


Results
Have we made 0000101100000110011 things any 1000101011011010011 better? 1110101111110001100 73 bits to encode 1111110100100101 the text ASCII would take 8 * 26 = 208 bits If modified code used and if 4 bits per character are needed for fixed length coding. Total bits 4 * 26 = 104. Savings not as great.

Decoding the File


How does receiver know what the codes are? There are so many constraints and ways found from the survey. Tree constructed for each text file and sent.
Look up Tree directly sent Frequency of each character

Big hit on compression, especially for smaller files because Final Data transmission is bit based instead of byte based

Tree predetermined based on the applications.


based on statistical analysis of text files or file types or the previously stored record.

Decoding the File


Once receiver has tree, it scans incoming bit stream 0 go left 1 go right 0000101100000110011 1000101011011010011 1110101111110001100 1111110100100101
26 10 4 6 e 8 sp 4 4 16 8 4

E i y l k . 1 1 1 1 1 1

r s n a 2 2 2 2

Huffman coding
Binary Non-binary Adaptive

Another is Shannon-Fano coding

Arithmetic Coding

The Model a way of calculating, in any given context, the distribution of probabilities for the next input symbol. The decoder must have access to the same model & be able to regenerate the same input string from the encoded string. Uses blended message, frame or set of symbols together.

Model Based Approach

The Basic Algorithm


1. We begin with a current interval" [L; H) initialized to [0; 1). 2. For each symbol, we perform two steps : (a) We subdivide the current interval into subintervals, one for each possible alphabet symbol. The size of a symbol's subinterval is proportional to the estimated probability that the symbol will be the next symbol in the file, according to the model of the input. (b) We select the subinterval corresponding to the symbol that actually occurs next in the file, and make it the new current interval. 3. We output enough bits to distinguish the final current interval from all other possible final intervals.

Encoding
The message is represented by an interval of real numbers between 0 & 1. As the message becomes longer, its interval becomes smaller the no. of bits needed to specify that interval grows. The reduction of the intervals size is according to the symbols probability (generated by the model). A more common symbol will reduce the range by adding fewer bits to the message..

A simple example : encoding the message eaii!


The mode l

Symbol
a

Probability
.2

Range
[0,0.2)

e
i o u

.3
.1 .2 .1

[0.2, 0.5)
[0.5, 0.6) [0.6, 0.8) [0.8, 0.9)

.1

[0.9, 1)

1 ! 0 .9 u 0
.8 0 .6 0 .5

0.5

! u o i e

0.26

! u o i

0.236

! u o i e a

0.2336

! u o i

0.2336

! u

o
i e

i
e

e
a

0 .2

0.2

0.2

0.23

0.233

a a 0.23354

The size of the final range is


0.2336 -0.23354 = 0.00006,

that is also exactly the multiplication of the probabilities of the five symbols in the message eaii! :
0.00006. (0.3)* (0.2)* (0.1)* (0.1)* (0.1) =

it takes 5 decimal digits to encode the message.


The best compression code is the output length contains a contribution of -logp bits from the encoding of each symbol whose probability of occurrence is p.

According to Shannon :

The entropy of eaii!

is

-log 0.3 -log 0.2 -log 0.1 -log 0.1 -log 0.1 =-log 0.00006 = 4.22

A few remarks..
5 decimal digits seems a lot to encode a message comprising 4 vowels! Our example ended up by expanding rather than compressing. Different models will give different entropies

Decoding
The decoder receives the final range or a value from that range & finds symbol. The decoder deduces the message from the first symbol to the last, according to the symbols probability of which the interval belongs to.

Each message ends with EOF to avoid ambiguity: i.e. a single no. 0.0 could represent any of a,aa,aaa

Back to the example


The decoder gets the final range : [0.23354,0.2336). The range lies entirely within the space the model allocate for e first character was e !!! Initially [0,1) [0.2, 0.5)

Now it can simulate the operation of the encoder:

e a !

[0.23354,0.2336).

Two major difficulties


The shrinking current interval requires use of high precision arithmetic. No output is produced until the entire message has been encoded.