Huffman Code

You might also like

You are on page 1of 47

Greedy

Algorithms
Huffman tree
Huffman Coding
It is Compression technique to reduce the
size of the data or message.
(Lossless data Compression technique)
Computer Data Encoding:
How do we represent data in binary?

Fixed length codes.


Encode every symbol by a unique binary
string of a fixed length.
Examples: ASCII (8 bit code),
American Standard Code for
Information Interchange
ASCII Example:

ABCA

A B C A
01000001 01000010 01000011 01000001
ASCII Example:
Suppose we have a message

BCCABBDDAECCBBAEDDCC
Message send using ASCII code
ASCII codes are 8 bit code
A-65
B-66
C-67
D-68
E-69
Total space usage in bits:

Assume an l bit fixed length code.

For a file of n characters

Need nl bits.
ASCII Example:
Suppose we have a message

BCCABBDDAECCBBAEDDCC

A B C D E
01000001 01000010 01000011 01000100
01000101

So total bits = 8*20 =160bits (8 bit for each alphabet)


Fixed Length codes
Idea: In order to save space, use less bits.
Character Frequency Code
A 3 000
B 5 001
C 6 010
D 4 011
E 2 100

There are 20 character 20*3=60 bits


Also send the table for receiver end to
understand the code for decode the
message. So total cost of the message is:
Fixed Length codes
Idea: In order to save space, use less bits.
Character Frequency Code
A 3 000
B 5 001
C 6 010
D 4 011
E 2 100

Character Size= 20*3=60 bits


Original Character size= 5*8=40 bits
New code for 5 character= 5*3= 15 bits
Total= 115 bits
(along with table)
Variable Length codes
Idea: In order to save space, use less bits
for frequent characters and more bits
for rare characters.

The variable length codes assigned to


input character are prefix codes means
the codes (bit sequence) are assigned in
such a way that the code assigned to one
character is not prefix of code assigned
to any other character.
Variable Length codes
Idea: In order to save space, use less bits
for frequent characters and more bits
for rare characters.

Example: suppose alphabet of 3 symbols:


{ A, B, C }.
suppose in file: 1,000,000
characters.
Need 2 bits for a fixed length
code for a total of
2,000,000 bits.
Variable Length codes - example
Suppose the frequency distribution of the
characters is:
A B C
999,000 500 500

Encode: A B C
0 10 11

Note that the code of A is of length 1, and the codes for B


and C are of length 2
Total space usage in bits:

Fixed code: 1,000,000 x 2 = 2,000,000

Varable code: 999,000 x 1


+ 500 x 2
500 x 2

1,001,000

A savings of almost 50%


How do we decode?
In the fixed length, we know where every
character starts, since they all have the
same number of bits.

Example: A = 00
B = 01
C = 10

000000010110101001100100001010
A A A BB CCCBC B A A CC
How do we decode
In the variable length code, we use an
idea called Prefix code, where no code is a
prefix of another.

Example: A = 0
B = 10
C = 11

None of the above codes is a prefix of


another.
Prefix Code
Let us understand prefix code with a
counter example: Let there be four
character a,b,c and d and their
corresponding variable length codes be
00, 01, 0, 1.
This coding leads to ambiguity because
code assigned to c is prefix of codes
assigned to a and b if the compressed bit
stream is 0001, the de-compressed output
may be cccd or ccb or acd or ab.
How do we decode
Example: A = 0
B = 10
C = 11

So, for the string:


AAABBC C CBC BAACC the encoding:

0 0 01010111111101110 0 01111
Prefix Code
Example: A = 0
B = 10
C = 11

Decode the string

0 0 01010111111101110 0 01111

AA AB B C C C BC BAAC C
Requirement:

Construct a variable length code for a


given file with the following properties:

1. Prefix code.
2. Using shortest possible codes.
3. Efficient.
Huffman Tree

There are mainly two major parts in


Huffman Coding:

1. Build a Huffman tree from input


characters.
2. traverse the Huffman tree and assigned
codes to character.
Steps to Build Huffman Tree:
1. Create a leaf node for each unique character and build
a min heap of all leaf nodes.

2. Extract two nodes with the minimum frequency from


the min heap.

3. Create a new internal node with frequency equal to the


sum of the nodes frequencies. Make the first
extracted node as its left child and the other
extracted node as its right child add this node to the
min heap.
repeat steps 2 and 3 until the heap contain only one node.
After completion of the tree assign 0 to left child and 1
to right child in whole tree.
Idea
Consider a binary tree, with:
0 meaning a left turn
1 meaning a right turn.

0 1
A
0 1
B
0 1
C D
Huffman Tree Example:

Alphabet: A, B, C, D, E, F

Frequency table:
A B C D E F
10 20 30 40 50 60

Total File Length: 210


Algorithm Run:
A 10 B 20 C 30 D 40 E 50 F 60
Algorithm Run:
X 30 C 30 D 40 E 50 F 60

A 10 B 20
Algorithm Run:
Y 60 D 40 E 50 F 60

X 30 C 30

A 10 B 20
Algorithm Run:
D 40 E 50 Y 60 F 60

X 30 C 30

A 10 B 20
Algorithm Run:
Z 90 Y 60 F 60

D 40 E 50 X 30 C 30

A 10 B 20
Algorithm Run:
Y 60 F 60 Z 90

X 30 C 30 D 40 E 50

A 10 B 20
Algorithm Run:
W 120 Z 90

Y 60 F 60 D 40 E 50

X 30 C 30

A 10 B 20
Algorithm Run:
Z 90 W 120

D 40 E 50 Y 60 F 60

X 30 C 30

A 10 B 20
Algorithm Run:
V 210
0 1

Z 90 W 120
1
0 0 1
D 40 E 50 Y 60 F 60
0 1

X 30 C 30
0 1

A 10 B 20
The Huffman encoding:
V 210
A: 1000 0 1
B: 1001
C: 101 Z 90
1
W 120
0 0 1
D: 00
E: 01 D 40 E 50 Y 60 F 60
0 1
F: 11
X 30 C 30
0 1

A 10 B 20

File Size: 10x4 + 20x4 + 30x3 + 40x2 + 50x2 + 60x2 =


40 + 80 + 90 + 80 + 100 + 120 = 510 bits
Note the savings:

The Huffman code:


Required 510 bits for the file.

Fixed length code:


Need 3 bits for 6 characters.
File has 210 characters.

Total: 630 bits for the file.


Example: Construct a Huffman
Code for the following data and
also calculate the cost of the
tree.
Character A B C D E
Probability 12 04 45 16 23
The Huffman encoding:
Z 100
A: 0001 0 1
B: 0000
C: 1 Y 55
1
C 45
0
D: 001
E: 01 X 32 E 23

0 1
W 60 D 16
0 1
B 12 A 04

File Size: 4x4 + 12x4 + 45x1 + 16x3 + 23x2 =


16 + 48 + 45 + 48 + 46 = 203 bits
Example: Construct a Huffman
Code for the following data and
also calculate the cost of the
tree and decode the code
1101000010001.
Character A B C D E F
Probability0.35 0.12 0.21 0.05 0.18 0.09
Example: Construct a Huffman
Code for the following message
their occurrence are given below
and decode the code whose
ending using the Huffman code
001110001010000010.
Character A B C D E F G
Probability 23 10 03 21 20 06 17
How to decode the code

100010111001010
Example: Construct a Huffman
Code for the following message
and decode the code whose
ending using the Huffman code
100010111001010.

Character A B C D E
Probability 0.4 0.1 0.2 0.15 0.15
The Huffman encoding:
Z 1
A: 1 0
B: 000
W 0.6
C: 001 0 1
1
D: 001 X 0.25 Y 0.35
E: 010 0 1 0 1

B 0.1 D 0.15 E 0.15 C 0.2 A 0.4

100010111001010
Huffman Tree:

As extractMin( ) calls minHeapify( ), it


takes O(logn) time.

In each iteration: one less subtree.

Initially: n subtrees.

Total: O(n log n) time.


Advantages of Huffman Encoding-

1)This encoding scheme results in saving lot of


storage space, since the binary codes generated
are variable in length.
2)It generates shorter binary codes for encoding
symbols/characters that appear more frequently
in the input string.
3)The binary codes generated are prefix-free.
Disadvantages of Huffman Encoding-
1) Lossless techniques like Huffman encoding are
suitable only for encoding text and program files and
are unsuitable for encoding digital images.
2) Huffman encoding is a relatively slower process
since it uses two passes-one for building the statistical
model and another for encoding. Thus, the lossless
techniques that use Huffman encoding are considerably
slower than others.
Real-life applications of Huffman
Encoding
1)Huffman encoding is widely used in
compression formats like  GZIP, PKZIP
(winzip) and BZIP2.
Thank You

You might also like