You are on page 1of 52

Huffman Coding

An Application of Binary
Trees and Priority
Queues
Greedy Approach

1
Huffman code
• Very often used for text compression
• Do you know how gzip or winzip works?
•  Compression methods

• ASCII code uses codes of equal length for all letters

• Idea behind Huffman code: use shorter


length codes for letters that are more
frequent

2
• Build a list of letters and frequencies
“have a great day today”

• Build a Huffman Tree bottom up, by


grouping letters with smaller
occurrence frequencies

3
Huffman coding is an algorithm used for lossless
data compression developed by David A.
Huffman

"Huffman Codes" are widely used applications


that involve the compression and transmission of
digital data, such as: fax machines, modems,
computer networks, and high-definition television
(HDTV), etc.

4
Motivation

The motivations for data compression are obvious:

 reducing the space


required to store files
on disk or tape

 reducing the time


to transmit large files.

Huffman savings are between 20% - 90%


5
Basic Idea :

It uses a variable-length code table for encoding a


source symbol (such as a character in a file) where
the variable-length code table has been derived in a
particular way based on the frequency of occurrence
for each possible value of the source symbol.

6
Prefix code
• No prefix of a A 00 1 00
codeword is a B 010 01 10
codeword
C 011 001 11
• Uniquely decodable
D 100 0001 0001
E 11 00001 11000
F 101 000001 101

7
Example:
Suppose you have a file with 100K characters.

For simplicity assume that there are only 6 distinct


characters in the file from a through f, with frequencies as
indicated below.
We represent the file using a unique binary string for each
character.
a b c d e f

Frequency 45 13 12 16 9 5
(in 1000s)
Fixed-length 000 001 010 011 100 101
codeword

Space = (45*3 + 13*3 + 12*3 + 16*3 + 9*3 + 5*3) * 1000

= 300K bits 8
By using variable-length codes instead of fixed-length
codes.
Idea : Giving frequent characters short codewords, and
infrequent characters long codewords.
i.e. The length of the encoded character is inversely
proportional to that character's frequency.
a b c d e f

Frequency 45 13 12 16 9 5
(in 1000s)
Fixed-length 000 001 010 011 100 101
codeword
Variable-length 0 101 100 111 1101 1100
codeword

Space = (45*1 + 13*3 + 12*3 + 16*3 + 9*4 + 5*4) * 1000


9
= 224K bits ( Savings = 25%)
Huffman codes are also known as prefix codes
– no individual code is a prefix of any other
code

•a 0 d 111
•b 101 e 1101
•c 100 f 1100

– this makes decompression unambiguous:


1010111110001001101

– note: since the code is specific to a


particular file, it must be stored along with
the compressed file in order to allow for
eventual decompression

10
Prefix Codes:
a b c d e f

Variable-length 0 101 100 111 1101 1100


codeword

F A C E
Encoded as 1100 0 100 1101 = 110001001101
To decode, we have to decide where each code begins and ends,
since they are no longer all the same length. But this is easy, since,
no codes share a prefix. This means we need only scan the input
string from left to right, and as soon as we recognize a code, we can
print the corresponding character and start looking for the next code.
In the above case, the only code that begins with “1100.." or a prefix
is “f", so we can print “f" and start decoding “0100...", get “a", etc.

11
Representation:
The Huffman algorithm is represented as:
• binary tree
• each edge represents either 0 or 1
• 0 means "go to the left child"
• 1 means "go to the right child."
• each leaf corresponds to the sequence of 0s and 1s
traversed from the root to reach it, i.e. a particular
code.
Since no prefix is shared, all legal codes are at the leaves,
and decoding a string means following edges, according to
the sequence of 0s and 1s in the string, until a leaf is
reached.

12
The (Real) Basic Algorithm
1. Scan text to be compressed and tally
occurrence of all characters.
2. Sort or prioritize characters based on
number of occurrences in text.
3. Build Huffman code tree based on
prioritized list.
4. Perform a traversal of tree to determine
all code words.
5. Scan text again and create new file
using the Huffman codes.
13
Building a Tree
Scan the original text

• Consider the following short text:

 Eerie eyes seen near lake.

• Count up the occurrences of all


characters in the text

14
Building a Tree
Scan the original text

Eerie eyes seen near lake.


• What characters are present?

E e r i space
y s n a r l k .

15
Building a Tree
Scan the original text
Eerie eyes seen near lake.
• What is the frequency of each
character in the text?
Char Freq. Char Freq. Char Freq.
E 1 y 1 k 1
e 8 s 2 . 1
r 2 n 2
i 1 a 2
space 4 l 1

16
Building a Tree
Prioritize characters

• Create binary tree nodes with


character and frequency of each
character
• Place nodes in a priority queue
– The lower the occurrence, the higher
the priority in the queue

17
Building a Tree
• The queue after inserting all nodes

E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8

• Null Pointers are not shown

18
Building a Tree
• While priority queue contains two or
more nodes
– Create new node
– Dequeue node and make it left subtree
– Dequeue next node and make it right subtree
– Frequency of new node equals sum of
frequency of left and right children
– Enqueue new node back into queue

19
Building a Tree

E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8

20
Building a Tree

y l k . r s n a sp e
1 1 2 2 2 2 4 8
1 1

E i
1 1

21
Building a Tree

y l k . r s n a sp e
2
1 1 2 2 2 2 4 8
1 1 E i
1 1

22
Building a Tree

k . r s n a sp e
2
1 1 2 2 2 2 4 8
E i
1 1

y l
1 1

23
Building a Tree

2
k . r s n a 2 sp e
1 1 2 2 2 2 4 8
y l
E i 1 1
1 1

24
Building a Tree

r s n a 2 2 sp e
2 2 2 2 4 8
y l
E i 1 1
1 1

k .
1 1

25
Building a Tree

r s n a 2 2 sp e
2
2 2 2 2 4 8
E i y l k .
1 1 1 1 1 1

26
Building a Tree

n a 2 sp e
2 2
2 2 4 8
E i y l k .
1 1 1 1 1 1

r s
2 2

27
Building a Tree

n a 2 sp e
2 2 4
2 2 4 8

E i y l k . r s
1 1 1 1 1 1 2 2

28
Building a Tree

2 4 e
2 2 sp
8
4
y l k . r s
E i 2 2
1 1 1 1
1 1

n a
2 2

29
Building a Tree

2 4 4 e
2 2 sp
8
4
y l k . r s n a
E i 2 2 2 2
1 1 1 1
1 1

30
Building a Tree

4 4 e
2 sp
8
4
k . r s n a
1 1 2 2 2 2

2 2

E i y l
1 1 1 1
31
Building a Tree

4 4 4
2 sp e
4 2 2 8
k . r s n a
1 1 2 2 2 2
E i y l
1 1 1 1

32
Building a Tree

4 4 4
e
2 2 8
r s n a
2 2 2 2
E i y l
1 1 1 1

2 sp
4
k .
1 1
33
Building a Tree

4 4 4 6 e
2 sp 8
r s n a 2 2
4
2 2 2 2 k .
E i y l 1 1
1 1 1 1

What is happening to the characters


with a low number of occurrences?

34
Building a Tree

4 6 e
2 2 2 8
sp
4
E i y l k .
1 1 1 1 1 1 8

4 4

r s n a
2 2 2 2

35
Building a Tree

4 6 e 8
2 2 2 8
sp
4 4 4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2

36
Building a Tree

8
e
8
4 4
10
r s n a
4
2 2 2 2 6
2 2
2 sp
4
E i y l k .
1 1 1 1 1 1

37
Building a Tree

8 10
e
8 4
4 4
6
2 2
r s n a 2 sp
2 2 2 2 4
E i y l k .
1 1 1 1 1 1

38
Building a Tree

10
16
4
6
2 2 2 e 8
sp 8
4
E i y l k . 4 4
1 1 1 1 1 1

r s n a
2 2 2 2

39
Building a Tree

10 16

4
6
e 8
2 2 8
2 sp
4 4 4
E i y l k .
1 1 1 1 1 1
r s n a
2 2 2 2

40
Building a Tree
26

16
10

4 e 8
6 8
2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2

41
Building a Tree
•After
enqueueing
26 this node
there is only
16
10 one node left
4 e 8
in priority
6 8 queue.
2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2

42
Building a Tree
Dequeue the single node
left in the queue.
26

16
This tree contains the 10
new code words for each
4 e 8
character. 6 8
2 2 2 sp 4 4
4
Frequency of root node E i y l k .
r s n a
should equal number of 1 1 1 1 1 1
2 2 2 2
characters in text.
Eerie eyes seen near lake.  26 characters
43
Encoding the File
Traverse Tree for Codes
• Perform a traversal of
the tree to obtain new
code words
• Going left is a 0 going 26

right is a 1 16
10
• code word is only
completed when a leaf 4
6
e
8
8
node is reached 2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2

44
Encoding the File
Traverse Tree for Codes
Char Code
E 0000
i 0001
y 0010 26
l 0011 16
k 0100 10
. 0101 4
space 011 6
e
8
8

e 10 2 2 2 sp 4 4
r 1100 4
s 1101 E i y l k .
1 1 1 1 1 1 r s n a
n 1110 2 2 2 2
a 1111
45
Encoding the File
• Rescan text and encode
file using new code Char Code
E 0000
words i 0001
Eerie eyes seen near lake. y 0010
0000101100000110011 l 0011
k 0100
1000101011011010011 . 0101
1110101111110001100 space 011
1111110100100101 e 10
r 1100
• Why is there no need for s 1101
a separator character? n 1110
a 1111
.
46
Encoding the File
Results
• Have we made
things any better? 0000101100000110011
1000101011011010011
• 73 bits to encode the
1110101111110001100
text
1111110100100101
• ASCII would take 8 *
26 = 208 bits

If modified code used 4 bits per


character are needed. Total bits
4 * 26 = 104. Savings not as great.
47
Decoding the File
• How does receiver know what the codes are?
• Tree constructed for each text file.
– Considers frequency for each file
– Big hit on compression, especially for smaller files
• Tree predetermined
– based on statistical analysis of text files or file types
• Data transmission is bit based versus byte based

48
Decoding the File
• Once receiver has tree
it scans incoming bit 26

stream 10
16

• 0  go left 4 e
6 8
• 1  go right 2 2
8
2 sp 4 4
4
101000110111101111 E i y l k .
r s n a
1 1 1 1 1 1
01111110000110101 2 2 2 2

49
Complexity Analysis
• The time complexity of the Huffman
algorithm is O(nlogn). Using a heap to
store the weight of each tree, each
iteration.
• Using a heap to store the weight of each
tree, each iteration
requires O(logn) time to determine the
cheapest weight and insert the new
weight. There are O(n) iterations, one
for each item.

50
Drawbacks
The main disadvantage of Huffman’s method is that it makes
two passes over the data:
• one pass to collect frequency counts of the letters in the
message, followed by the construction of a Huffman tree
and transmission of the tree to the receiver; and
• a second pass to encode and transmit the letters
themselves, based on the static tree structure.
This causes delay when used for network communication,
and in file compression applications the extra disk accesses
can slow down the algorithm.

51
Summary
• Huffman coding is a technique used
to compress files for transmission
• Uses statistical coding
– more frequently used symbols have
shorter code words
• Works well for text and fax
transmissions
• An application that uses several data
structures
52

You might also like