How Huffman Coding Compresses Data by Assigning Variable-Length Codes

Creating
a Huffman Tree

By the end of this worksheet you should:

• Explain how data can be compressed
using Huffman coding.

So, a Huffman Tree can look confusing; all binary trees
appear this way at first.

The nice thing is that we only have to remember that it is
called a binary tree because each node can have a
maximum of two branches only.

A node without any branches (so the end of a branch) is
called a leaf node.

Yet, how could we create a Huffman Tree in the first
place?

Well, let’s take a sentence:

THE CAT SAT ON THE MAT

Now let’s go through the steps…

1. create a table showing the frequency of each
character in the sentence (including spaces,
punctuation and any other special characters):

Character Frequency
T 5
H 2
E 2
SPACE 5
C 1
A 3
S 1
O 1
N 1
M 1

Page 1 of 10
Creating a Huffman Tree
2. Order the list by frequency. The order of the

actual items (being in alphabetical order, for
example) is not important. The only order that is
important is that they are ordered by frequency,
highest value at the top:

Character Frequency
T 5
SPACE 5
A 3
H 2
E 2
C 1
S 1
O 1
N 1
M 1

3. Now comes the drawing part.

Start from the bottom of the list (the least
frequent items) and put these into circles, starting
at the bottom right of your paper (best to have
the paper as landscape; gives you more room):

4. Now take the two items furthest to the right, pair
them together, joining the branches by summing
the frequencies together.
Continue to do so until you have matched all the
pairs you can, leaving any spares for the moment:

Page 2 of 10
5. Include the next layer of characters, putting them

higher in the chart, so it’s clear that they appear
more frequently than the last lot of characters.
Start from the right most spare item (even if this
came from the last lot of characters) and join
them as before:

6. Continue adding each character, and joining them
together until all characters are now shown on the
diagram:

7. You should now have all the characters on your
diagram. These are known as the leaf nodes.
Now return to the right-most side of the diagram
and begin joining the paired values together:

Page 3 of 10
8. You now need to look at each set of nodes next to

each other, working out the sum of their two
frequencies.
Connect those neighbouring nodes that add up to
the lowest value (if you have two choices, it is best
to choose the right-most pair).
In this example, we could combine 4 and 3, or 3
and 5. 4 + 3 = 7 whilst 3 + 5 = 8.
Since 7 is a lower value, we have chosen to join
those two together:

9. Continue pairing up the tree until you come to a
single node, called the root node:

Page 4 of 10
10. Look at the root node’s value; it is 22.

Now add all of the frequencies in the table
together. What do you get?
1 + 1 + 1 + 1 + 1 + 2 + 2 + 3 + 5 + 5 = 22!

So you can confirm that your tree is correct by
summing the frequencies together and comparing
to your root node!

11. Start at the root node, put 0 next to each left-
hand branch, and 1 against each right-hand
branch:
Page 5 of 10

12. Again, start from the root node, find the path to
each leaf node (each character), identifying the 0
and 1 branches you have to use to get to each
one.
Place these in the Code column of your table:

Character Code
T 00
SPACE 01
A 100
H 101
E 1100
C 1101
S 11100
O 11101
N 11110
M 11111

13. You now have the code for each character.
Simply replace each character with its code and
you have your Huffman Coded data:

T H E SP C A T SP
00 101 1100 01 1101 100 00 01

S A T SP O N SP
11100 100 00 01 11101 11110 01

T H E SP M A T
00 101 1100 01 11111 100 00

The data is…

0010111000111011000001111001000001111011111001
001011100011111110000

…which adds up to 67 bits.

Page 6 of 10
Yet, why bother?

What does Huffman Coding do for us?

Well, remember that computers use Unicode values to
store characters in binary.

Unicode stores characters in 8-bit binary patterns, so our
sentence would look like this:

the cat sat on the mat

becomes

T H E SP
01010100 01001000 01000101 00100000

C A T SP
01000011 01000001 01010100 00100000

S A T SP
01010011 01000001 01010100 00100000

O N SP
01001111 01001110 00100000

T H E SP
01010100 01001000 01000101 00100000

M A T
01001101 01000001 01010100

Each character is represented by 8 bits.

We require 22 eight bit codes to represent our sentence
in memory (or in storage). That is 22 x 8 or 176 binary
digits.

Page 7 of 10
Our Huffman Coded data can be calculated by taking the

frequency and the number of bits for each character,
then adding up the total:

Character Frequency No. Bits Total Bits
t 5 2 10
SPACE 5 2 10
a 3 3 9
h 2 3 6
e 2 4 8
c 1 4 4
s 1 5 5
o 1 5 5
n 1 5 5
m 1 5 5
TOTAL 67

So, our Huffman Coded sentence only uses 67 bits, whilst
our Unicode version uses 176 bits.

We have saved 109 bits; yet have not lost any of our
original sentence (we can always get our complete
sentence back)!

Huffman Coding allows us to reduce the size of the data
without losing any of the detail

Huffman Coding is a form of lossless compression.

Page 8 of 10
Why not use Huffman Coding instead of ASCII or

Unicode?

If you think about this, the question itself does not make
sense.

The purpose of Huffman Coding is to reduce the file size,
but there is no relationship between the code and the
actual character (the lowercase letter ‘a’ will not always
have the same Huffman Code allocated to it, for
example).

In fact, the Huffman Code we just worked out, and the
tree we created to work out the code, will only ever work
for the sentence:

THE CAT SAT ON THE MAT

To convert text from Huffman back into text you would
need to:

a) Read in the Huffman Code
b) Compare the value of each bit pattern to the
stored Huffman Tree data
c) When the bits match a Huffman Code, identify the
Unicode value for that character
d) Display the Unicode character

You should be able to see that this takes time; more time
than simply taking the Unicode value and displaying the
corresponding character map.

To display a Huffman Coding file you would first need to
decompress the data and then display the original data.

Huffman Coded files save storage space, but they slow
down the processing of the data.

Page 9 of 10
EXAM ALERT

The exam might ask you to calculate the ASCII value for
some text.

Remember that ASCII uses 7 bits, and not 8 bits per
character.

However, since all computers store data in bytes, and
when ASCII is transmitted it is always transmitted in
blocks of 8 bits, the exam will accept any calculation
worked out using 8 bits instead of 7.

Why warn you?

Only so that if you see an answer given with 7 bits instead
of 8 you will know why!
Page 10 of 10

How Huffman Coding Compresses Data by Assigning Variable-Length Codes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How Huffman Coding Compresses Data by Assigning Variable-Length Codes

Uploaded by

Copyright:

Available Formats

Creating

2. Order the list by frequency. The order of the

5. Include the next layer of characters, putting them

8. You now need to look at each set of nodes next to

10. Look at the root node’s value; it is 22.

Yet, why bother?

Our Huffman Coded data can be calculated by taking the

Why not use Huffman Coding instead of ASCII or

You might also like