0% found this document useful (0 votes)
481 views16 pages

Huffman Coding

Huffman coding is a variable-length encoding algorithm that assigns variable-length binary codes to characters, with more frequent characters having shorter codes. It builds a Huffman coding tree from character frequencies in a document, with external nodes storing characters and frequencies. The tree is built by inserting frequency-character pairs into a priority queue and repeatedly removing the two lowest frequencies to combine them as parent nodes in the tree until only one node remains.

Uploaded by

arupsil
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
481 views16 pages

Huffman Coding

Huffman coding is a variable-length encoding algorithm that assigns variable-length binary codes to characters, with more frequent characters having shorter codes. It builds a Huffman coding tree from character frequencies in a document, with external nodes storing characters and frequencies. The tree is built by inserting frequency-character pairs into a priority queue and repeatedly removing the two lowest frequencies to combine them as parent nodes in the tree until only one node remains.

Uploaded by

arupsil
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Huffman Coding

Lawrence M. Brown

Huffman Coding†

• Huffman Coding
• Variable Length Encoding
• Building a Tree
• Decoding
• Encoding

†Adapted from: [Link]

25 September, 1999
1
Huffman Coding
Lawrence M. Brown

Huffman Coding
• Huffman Coding is a variable-length prefix encoding algorithm for
compression of character streams.

• Codes are assigned to characters such that the length of the


code depends on the relative frequency of the corresponding
character.
Letter Frequency Letter Frequency
A 77 N 67
Examples:
B 17 O 67
• File compression: C 32 P 20
• JPEG images. D 42 Q 5
• MPEG movies. E 120 R 59
• Transmission of data over band-limited F 24 S 67
G 17 T 85
channels: H 50 U 37
• Modem data compression. I 76 V 12
J 4 W 22
K 7 X 4
L 42 Y 22
M 24 Z 2
1
Frequency of occurrence per 1000 letters .

1Shaffer, Clifford A., A Practical Introduction to Data Structures and Algorithm Analysis, Java Edition, Prentice Hall (1998).
25 September, 1999
2
Huffman Coding
Lawrence M. Brown

Data Representation
Bits and Bytes

• Digital computers store data in binary or base-2 format.


• A binary digit (bit) is represented by a 0 or 1.

• A byte is an 8-bit number and is typically the smallest size of a binary


number represented on a computer.

010010112 = 0 × 2 7 + 1× 2 6 + 0 × 25 + 0 × 2 4 + 1× 23 + 0 × 2 2 + 1× 21 + 1× 2 0
= 1× 64 + 1× 8 + 1× 2 + 1× 1
= 6410 + 810 + 210 + 110
= 7510.

Longer words (16-bit, 32-bit, 64-bit) are constructed from 8-bit bytes.

25 September, 1999
3
Huffman Coding
Lawrence M. Brown

Unicode and ASCII


• Unicode is an International Standard that defines an universal character set
(16-bit unsigned integers).

ACSII Character Set


• Unicode characters range from 0 to 65,535
0 1 2 3 4 5 6 7
(\u0000 to \uFFFF) and incorporate all 0 NUL SOH STX ETX EOT ENQ ACK BEL
8 BS HT NL VT NP CR SO SI
languages (English, Russian, Asian, etc.).
16 DLE DC1 DC2 DC3 DC4 NAK SYN ETB
24 CAN EM SUB ESC FS GS RS US
32 SP ! " # $ % & '
• Java stores characters (char) as Unicode. 40 ( ) * + , - . /
48 0 1 2 3 4 5 6 7
56 8 9 : ; < = ?
• The standard set of ASCII characters 64 @ A B C D E F G
72 H I J K L M N O
still range from 32 to 127 80 P Q R S T U V W
88 X Y Z [ \ ] ^ _
(\u0020 to \u007F Unicode). 96 ` a b c d e f g
104 h i j k l m n o
112 p q r s t u v w
• ASCII characters represent the lowest 7 120 x y z { | } ~ DEL
bits of the Unicode set, with the upper
9 bits set to zero.

25 September, 1999
4
Huffman Coding
Lawrence M. Brown

Variable-length Encoding
• Unicode and ASCII are fixed-length encoding schemes. All characters
require the same amount of storage (16 bits and 8 bits, respectively).

• Huffman coding is a variable-length encoding scheme. The number of


bits required to store a coded character varies according to the relative
frequency or weight of the character.

• A significant space savings is achieved for frequently used


characters (requiring only one, two or three bits).

• Little space saving is achieved for infrequent characters.

Letter Frequency Huffman Code


E 120
I 10
2

25 September, 1999
5
Huffman Coding
Lawrence M. Brown

Huffman Coding Tree

• A Huffman Coding Tree is built from the observed frequencies of


characters in a document.

• The document is scanned and the occurrence of each character is


recorded.

• Next, a Binary Tree is built in which the external nodes store the
character and the corresponding character frequency observed in
the document.

Often, pre-scanning a document and generating a custom Huffman


Coding Tree is impractical. Instead, typical frequencies are used
instead of specific frequencies from a particular document.

25 September, 1999
6
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


• Consider the observed frequency of characters in a string that requires
encoding:

Character C D E F K L U Z
Frequency 32 42 120 24 7 42 37 2

• The first step is to construct a Priority Queue and insert each


frequency-character (key-element) pair into the queue.

• Step 1:

2 7 24 32 37 42 42 120
Z K F C U L D E

Sorted, sequence-based, priority queue.

25 September, 1999
7
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


• In the second step, the two Items with the lowest key values are
removed from the priority queue.

• A new Binary Tree is created with the lowest-key Item as the left
external node, and the second lowest-key Item as the right external
node.
• The new Tree is then inserted back into the priority queue.

• Step 2:

24 32 37 42 42 120
9 F C U L D E

2 7
Z K

25 September, 1999
8
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


• The process is continued until only one node (the Binary Tree) is left in
the priority queue.

37 42 42 120
• Step 3: 32
C
33 U L D E

9 24
F

2 7
Z K

• Step 4: 37 42 42 120
U L D 65 E

32
33
C

9 24
F

2 7
Z K

25 September, 1999
9
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


• Step 5:

42 120
D 65 79 E

32 37 42
33
C U L

9 24
F

2 7
Z K

25 September, 1999
10
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


• Final tree, after n = 8 steps:

306

120 186
E

79 107

37 42 42 65
U L D

32 33
C

9 24
F

2 7
Z K

25 September, 1999
11
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


Algorithm Huffmann( X ):
Input: String X of length n.
Ouput: Coding tree for X.

Compute frequency f(c) of each character c in X.


Initialize a priority queue Q.
for each character c in X do
Create a single-node tree T storing c.
Insert T into Q with key f(c).
while [Link]() > 1 do
f1 ← [Link]()
T1 ← [Link]()
f2 ← [Link]()
T2 ← Q. removeMinElement()
Create a new tree T with left subtree T1 and right subtree T2.
Insert T into Q with key f1 + f2.
return [Link]() // return tree

25 September, 1999
12
Huffman Coding
Lawrence M. Brown
Decoding
• To decode a bit stream (from the leftmost bit), start at the root node of the Tree:
• move to the left child if the bit is a “0”.
• move to the right child if the bit is a “1”.
• When an external node is reached, the character at the node is sent to the
decoded string.
• The next bit is then decoded from the root of the tree.

306
0 1

120 186
E
0 1
Decode:
79 107
1011001110111101 0 1
0 1
L1001110111101
37 42 42 65
L U1110111101 U L D
0 1
L U C111101
L U C K 32
33
C
0 1

9 24
F
0 1

2 7
Z K
25 September, 1999
13
Huffman Coding
Lawrence M. Brown

Encoding
• Create a lookup table storing the binary code corresponding to the path
to each letter.
• If encoding ASCII text, an 128-element array would suffice.
String[] encoder = new String[128];
encoder[‘C’] = “1110”;

Character Frequency Code # bits


Encode: C 32 1110 4
D 42 110 3
DEED E 120 0 1
110EED F 24 11111 5
1100ED K 7 111101 6
L 42 101 3
11000D U 37 100 3
11000110 Z 2 111100 6

• ASCII representation would require 32 bits.


• Huffman encoding requires 8 bits.
25 September, 1999
14
Huffman Coding
Lawrence M. Brown

Analysis
• Define fi = frequency of letter li, i = 1, … , n.
• Define ci = cost for each letter li (number of bits).

∑c f i i
• Expected cost per character, ECPC = i =1
n
bits/character.
∑f
i =1
i

• Actual message length, ML = ECPC ⋅ N bits, where N is the total


number of characters in the message.

Character Frequency Code # bits


4 ⋅ 32 + 3 ⋅ 42 + 1⋅120 + 5 ⋅ 24 + 6 ⋅ 7 + 3 ⋅ 42 + 3 ⋅ 42 + 3 ⋅ 37 + 6 ⋅ 2
ECPC =
C 32 1110 4

32 + 42 + 120 + 24 + 7 + 42 + 37 + 2 characters
D 42 110 3
E 120 0 1
F 24 11111 5
K 7 111101 6
L 42 101 3 ≈ 2.57 bits/character
U 37 100 3
Z 2 111100 6

A fixed-length encoding on 8 characters would require 3 bits per character,


with an ML of 918 bits.

25 September, 1999
15
Huffman Coding
Lawrence M. Brown

Summary
• Huffman codes are variable length and are based on the observed
frequency of characters.

• No Huffman code for a character in the set is the prefix of another


character.

• The best space savings for Huffman Coding compression is when the
variation in the frequencies of the letters is large.

25 September, 1999
16

You might also like