Professional Documents
Culture Documents
1
Introduction
Data compression can reduce the cost of dealing with data or enable
things t h a t would not otherwise be possible wit h available resources.
2
Compressed data is often useful only for storage and transmission and needs
t o be decompressed before doing anything else w ith it.
• Compression (encoding) is the process of transforming the
original (uncompressed) data into compressed data.
• Decompression (decoding) is the reverse process of recovering the
original data fr om the compressed data.
3
There are two broad types of
compression:
• Lossy compression throws away some information. T h e data produced
by decompression is similar but not an exact copy of the original data.
– Very effective with image, video and audio compression.
– Removes unessential information such as noise and undetectable
details.
– Throwing away more information improves compression but
reduces the quality of the data.
4
B i t is the basic unit of information or size of data or computer storage.
International Electrotechnical Commission ( I EC ) recommends:
For example, 8 kbit means 8000 bits and 1 K i B means 8192 bits.
6
1. Variable-length encoding
Definition 1.1: Let Σ be the source alphabet and Γ
the code alphabet. A code is defined by an injective
mapping
C : Σ → Γ∗
The code is extended to sequences over Σ by
concatenation:
C : Σ ∗ → Γ ∗ : s1 s2 . . . sn ›→ C( s1 ) C( s2 ) . . . C( sn )
• We will usually assume that Γ is the binary
alphabet {0, 1}. Then the code is called a binary
code.
• Σ is an arbitrary set, possibly even an infinite
set such as the set of natural numbers.
7
Example 1.2: Morse code is a type of variable length code.
8
Let us construct a binary code based on
Morse code.
First a tt empt : Use zero for d ot and one for dash.
A ›→ 01 B ›→ 1000 C ›→ 1010
D ›→ 100
..
However, when applied t o sequences, we have a problem: .
9
T h e kind of code we need can be formalized as follows.
Proof. Exercise.
10
L et us return t o the Morse code example and define a binary prefix
code based on the Morse code:
• Encode dots and dashes wit h ones and the spaces with zeros.
In any encoded sequence, three consecutive zeros always marks the end of a
symbol code, acting as a kind of a delimiter and ensuring t h a t the code is a
prefix code.
11
T h e code on the previous slide is not ideal for compression because
the encoded sequences tend t o be quite long.
CAB ›→ 110101101001011001101010100
• In the original Morse code, the length of dashes and long spaces is a
compromise between compression (for faster transmission) and
avoiding errors.
• Some amount of redundancy is necessary if we want t o be able t o
detect and correct errors. However, such error-correcting codes are
not covered on this cause.
12
Huffman coding
• T hus we can reduce the average code length for any code t h a t
violates the principle.
14
An optimal code can be computed with the Huffman algorithm.
15
Example 1.7: Huffman algorithm
P ( { e} ) = 0.3 P ( { e} ) = P ( { e} ) = P ( { a, o} ) =
P ( { a} ) = 0.2 P
0.3 ( { a} ) = P ( { i , u, y} ) =
0.3 P ( { e} ) =
0.4
P ( { o} ) = 0.2 P
0.2 ( { o} ) = P ( { a} ) = 0.2
0.3 P ( { i , u, y} ) =
0.3
P ( { i } ) = 0.1 P
0.2 ( { u, y} ) = 0.2 P ( { o} ) = 0.3
0.2
P ( { u} ) = 0.1 P ( { i } ) = 0.1
P ( { y} ) = 0.1
P ( { e, i , u, y} ) = P ( { e, a, o, i , u, y} ) = 1.0
P ( { a, o} ) = 0.4
0.6
Huffman code Huffman tree
1.0
C ( a) = 10
C( e) = 00 0.6 0.4
C ( i ) = 011
C (o) = 11 0.3
e 0.3 a 0.2 0.2 o
C( u) = 0100
C(y) = 0.2
0.1 i
0101
u 0.1
0.1 y
16
• T h e collection of sets T can be implement using a priority queue so
t h a t the minimum elements are found quickly. T h e total number of
operations on T is O ( σ) and each operation needs O (log σ) time.
Thus the time complexity is O ( σ log σ ) .
17
L et us prove the optimality of the Huffman code.
Lemma 1.8: L et Σ = {s 1 , s2, . . . , s σ − 1 , s σ } be an alphabet with a
distribution
probability P satisfying P (s 1 ) ≥ P (s 2 ) ≥ · · · ≥ P ( s σ − 1 ) ≥ P (s σ ) > 0.
there
T he n exists an optimal binary prefix code C satisfying
• No code C(s i ) is longer than C(s σ ).
• C (s σ − 1 ) is the same as C(s σ ) except t h a t the last bit is different.
Proof. Each iteration of the main loop of the Huffman algorithm performs
an operation t h a t is essentially identical t o the operation in Lemma 1.9:
remove the two symbols/sets w ith the lowest probability and add a new
symbol/set representing the combination of the removed symbols/sets.
Lemma 1.9 shows t h a t each iteration computes the optimal code
provided t h a t the following iterations compute the optimal code. Since
the code in the last iteration is trivially optimal, the algorithm computes
the optimal code.
Q
20