Professional Documents
Culture Documents
An Application of Binary
Trees and Priority
Queues
Greedy Approach
1
Huffman code
• Very often used for text compression
• Do you know how gzip or winzip works?
• Compression methods
2
• Build a list of letters and frequencies
“have a great day today”
3
Huffman coding is an algorithm used for lossless
data compression developed by David A.
Huffman
4
Motivation
6
Prefix code
• No prefix of a A 00 1 00
codeword is a B 010 01 10
codeword
C 011 001 11
• Uniquely decodable
D 100 0001 0001
E 11 00001 11000
F 101 000001 101
7
Example:
Suppose you have a file with 100K characters.
Frequency 45 13 12 16 9 5
(in 1000s)
Fixed-length 000 001 010 011 100 101
codeword
= 300K bits 8
By using variable-length codes instead of fixed-length
codes.
Idea : Giving frequent characters short codewords, and
infrequent characters long codewords.
i.e. The length of the encoded character is inversely
proportional to that character's frequency.
a b c d e f
Frequency 45 13 12 16 9 5
(in 1000s)
Fixed-length 000 001 010 011 100 101
codeword
Variable-length 0 101 100 111 1101 1100
codeword
•a 0 d 111
•b 101 e 1101
•c 100 f 1100
10
Prefix Codes:
a b c d e f
F A C E
Encoded as 1100 0 100 1101 = 110001001101
To decode, we have to decide where each code begins and ends,
since they are no longer all the same length. But this is easy, since,
no codes share a prefix. This means we need only scan the input
string from left to right, and as soon as we recognize a code, we can
print the corresponding character and start looking for the next code.
In the above case, the only code that begins with “1100.." or a prefix
is “f", so we can print “f" and start decoding “0100...", get “a", etc.
11
Representation:
The Huffman algorithm is represented as:
• binary tree
• each edge represents either 0 or 1
• 0 means "go to the left child"
• 1 means "go to the right child."
• each leaf corresponds to the sequence of 0s and 1s
traversed from the root to reach it, i.e. a particular
code.
Since no prefix is shared, all legal codes are at the leaves,
and decoding a string means following edges, according to
the sequence of 0s and 1s in the string, until a leaf is
reached.
12
The (Real) Basic Algorithm
1. Scan text to be compressed and tally
occurrence of all characters.
2. Sort or prioritize characters based on
number of occurrences in text.
3. Build Huffman code tree based on
prioritized list.
4. Perform a traversal of tree to determine
all code words.
5. Scan text again and create new file
using the Huffman codes.
13
Building a Tree
Scan the original text
14
Building a Tree
Scan the original text
E e r i space
y s n a r l k .
15
Building a Tree
Scan the original text
Eerie eyes seen near lake.
• What is the frequency of each
character in the text?
Char Freq. Char Freq. Char Freq.
E 1 y 1 k 1
e 8 s 2 . 1
r 2 n 2
i 1 a 2
space 4 l 1
16
Building a Tree
Prioritize characters
17
Building a Tree
• The queue after inserting all nodes
E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
18
Building a Tree
• While priority queue contains two or
more nodes
– Create new node
– Dequeue node and make it left subtree
– Dequeue next node and make it right subtree
– Frequency of new node equals sum of
frequency of left and right children
– Enqueue new node back into queue
19
Building a Tree
E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
20
Building a Tree
y l k . r s n a sp e
1 1 2 2 2 2 4 8
1 1
E i
1 1
21
Building a Tree
y l k . r s n a sp e
2
1 1 2 2 2 2 4 8
1 1 E i
1 1
22
Building a Tree
k . r s n a sp e
2
1 1 2 2 2 2 4 8
E i
1 1
y l
1 1
23
Building a Tree
2
k . r s n a 2 sp e
1 1 2 2 2 2 4 8
y l
E i 1 1
1 1
24
Building a Tree
r s n a 2 2 sp e
2 2 2 2 4 8
y l
E i 1 1
1 1
k .
1 1
25
Building a Tree
r s n a 2 2 sp e
2
2 2 2 2 4 8
E i y l k .
1 1 1 1 1 1
26
Building a Tree
n a 2 sp e
2 2
2 2 4 8
E i y l k .
1 1 1 1 1 1
r s
2 2
27
Building a Tree
n a 2 sp e
2 2 4
2 2 4 8
E i y l k . r s
1 1 1 1 1 1 2 2
28
Building a Tree
2 4 e
2 2 sp
8
4
y l k . r s
E i 2 2
1 1 1 1
1 1
n a
2 2
29
Building a Tree
2 4 4 e
2 2 sp
8
4
y l k . r s n a
E i 2 2 2 2
1 1 1 1
1 1
30
Building a Tree
4 4 e
2 sp
8
4
k . r s n a
1 1 2 2 2 2
2 2
E i y l
1 1 1 1
31
Building a Tree
4 4 4
2 sp e
4 2 2 8
k . r s n a
1 1 2 2 2 2
E i y l
1 1 1 1
32
Building a Tree
4 4 4
e
2 2 8
r s n a
2 2 2 2
E i y l
1 1 1 1
2 sp
4
k .
1 1
33
Building a Tree
4 4 4 6 e
2 sp 8
r s n a 2 2
4
2 2 2 2 k .
E i y l 1 1
1 1 1 1
34
Building a Tree
4 6 e
2 2 2 8
sp
4
E i y l k .
1 1 1 1 1 1 8
4 4
r s n a
2 2 2 2
35
Building a Tree
4 6 e 8
2 2 2 8
sp
4 4 4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
36
Building a Tree
8
e
8
4 4
10
r s n a
4
2 2 2 2 6
2 2
2 sp
4
E i y l k .
1 1 1 1 1 1
37
Building a Tree
8 10
e
8 4
4 4
6
2 2
r s n a 2 sp
2 2 2 2 4
E i y l k .
1 1 1 1 1 1
38
Building a Tree
10
16
4
6
2 2 2 e 8
sp 8
4
E i y l k . 4 4
1 1 1 1 1 1
r s n a
2 2 2 2
39
Building a Tree
10 16
4
6
e 8
2 2 8
2 sp
4 4 4
E i y l k .
1 1 1 1 1 1
r s n a
2 2 2 2
40
Building a Tree
26
16
10
4 e 8
6 8
2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
41
Building a Tree
•After
enqueueing
26 this node
there is only
16
10 one node left
4 e 8
in priority
6 8 queue.
2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
42
Building a Tree
Dequeue the single node
left in the queue.
26
16
This tree contains the 10
new code words for each
4 e 8
character. 6 8
2 2 2 sp 4 4
4
Frequency of root node E i y l k .
r s n a
should equal number of 1 1 1 1 1 1
2 2 2 2
characters in text.
Eerie eyes seen near lake. 26 characters
43
Encoding the File
Traverse Tree for Codes
• Perform a traversal of
the tree to obtain new
code words
• Going left is a 0 going 26
right is a 1 16
10
• code word is only
completed when a leaf 4
6
e
8
8
node is reached 2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
44
Encoding the File
Traverse Tree for Codes
Char Code
E 0000
i 0001
y 0010 26
l 0011 16
k 0100 10
. 0101 4
space 011 6
e
8
8
e 10 2 2 2 sp 4 4
r 1100 4
s 1101 E i y l k .
1 1 1 1 1 1 r s n a
n 1110 2 2 2 2
a 1111
45
Encoding the File
• Rescan text and encode
file using new code Char Code
E 0000
words i 0001
Eerie eyes seen near lake. y 0010
0000101100000110011 l 0011
k 0100
1000101011011010011 . 0101
1110101111110001100 space 011
1111110100100101 e 10
r 1100
• Why is there no need for s 1101
a separator character? n 1110
a 1111
.
46
Encoding the File
Results
• Have we made
things any better? 0000101100000110011
1000101011011010011
• 73 bits to encode the
1110101111110001100
text
1111110100100101
• ASCII would take 8 *
26 = 208 bits
48
Decoding the File
• Once receiver has tree
it scans incoming bit 26
stream 10
16
• 0 go left 4 e
6 8
• 1 go right 2 2
8
2 sp 4 4
4
101000110111101111 E i y l k .
r s n a
1 1 1 1 1 1
01111110000110101 2 2 2 2
49
Complexity Analysis
• The time complexity of the Huffman
algorithm is O(nlogn). Using a heap to
store the weight of each tree, each
iteration.
• Using a heap to store the weight of each
tree, each iteration
requires O(logn) time to determine the
cheapest weight and insert the new
weight. There are O(n) iterations, one
for each item.
50
Drawbacks
The main disadvantage of Huffman’s method is that it makes
two passes over the data:
• one pass to collect frequency counts of the letters in the
message, followed by the construction of a Huffman tree
and transmission of the tree to the receiver; and
• a second pass to encode and transmit the letters
themselves, based on the static tree structure.
This causes delay when used for network communication,
and in file compression applications the extra disk accesses
can slow down the algorithm.
51
Summary
• Huffman coding is a technique used
to compress files for transmission
• Uses statistical coding
– more frequently used symbols have
shorter code words
• Works well for text and fax
transmissions
• An application that uses several data
structures
52