You are on page 1of 47

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE

5.5 D ATA C OMPRESSION


‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression
F O U R T H E D I T I O N

R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu

Last updated on Dec 9, 2015, 4:43 PM


5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression

R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu
Data compression

Compression reduces the size of a file:


・To save space when storing it.
・To save time when transmitting it.
・Most files have lots of redundancy.
Who needs compression?
・Moore's law: # transistors on a chip doubles every 18–24 months.
・Parkinson's law: data expands to fill space available.
・Text, images, sound, video, …
“ Every day, we create 2.5 quintillion bytes of data—so much that
90% of the data in the world today has been created in the last
two years alone. ” — IBM report on big data (2011)

Basic concepts ancient (1950s), best technology recently developed.


3
Applications

Generic file compression.


・Files: GZIP, BZIP, 7z.
・Archivers: PKZIP.
・File systems: NTFS, ZFS, HFS+, ReFS, GFS.
Multimedia.
・Images: GIF, JPEG.
・Sound: MP3.
・Video: MPEG, DivX™, HDTV.
Communication.
・ITU-T T4 Group 3 Fax.
・V.42bis modem.
・Skype, Google hangout.
Databases. Google, Facebook, NSA, ....
4
Compression and expansion

Message. Bitstream B we want to compress.


Compress. Generates a "compressed" representation C (B).
Expand. Reconstructs a bitstream B′.
uses fewer bits
(you hope)

Compress Expand
bitstream B compressed version C(B) original bitstream
reconstructed B B′
bitstream
0110110101... 1101011111... 0110110101...

Basic model for data compression

5
Lossy vs. Lossless

Lossless compression. Reconstructed bitstream equals original B.


・Suitable for text, generic data files, etc.
Compression ratio. Bits in C (B) / bits in B.
Ex. 50–75% or better compression ratio for natural language.

Compress Expand
bitstream B compressed version C(B) original bitstream B
0110110101... 1101011111... 0110110101...

Basic model for data compression

Lossy compression. [not covered in this course]


・Suitable for images, video, audio: JPEG, MPEG, MP3, …
・Greater compression by throwing away “imperceptible” parts of signal
・FFT/DCT, wavelets, fractals, …

6
Rdenudcany in Enlgsih lnagugae

Q. How much redundancy in the English language?


A. Quite a bit.

“ ... randomising letters in the middle of words [has] little or no


effect on the ability of skilled readers to understand the text. This
is easy to denmtrasote. In a pubiltacion of New Scnieitst you
could ramdinose all the letetrs, keipeng the first two and last two
the same, and reibadailty would hadrly be aftcfeed. My ansaylis
did not come to much beucase the thoery at the time was for
shape and senqeuce retigcionon. Saberi's work sugsegts we may
have some pofrweul palrlael prsooscers at work. The resaon for
this is suerly that idnetiyfing coentnt by paarllel prseocsing
speeds up regnicoiton. We only need the first and last two letetrs
to spot chganes in meniang. ” — Graham Rawlinson

The gaol of data cmperisoson is to inetdify rdenudcany and epxloit it.


7
Food for thought

Data compression has been omnipresent since antiquity:


・Number systems. X1
1 ⇡2
=
・Natural languages. n=1
n 2 6
・Mathematical notation.
has played a central role in communications technology,
・Grade 2 Braille. b r a i l l e

・Morse code.
・Telephone system. but rather a I like like every

and is part of modern life.


・JPEG.
・MP3.
・MPEG.
Q. What role will it play in the future?
8
Universal data compression

ZeoSync. Announced 100:1 lossless compression of random data using


Zero Space Tuner™ and BinaryAccelerator™ technology.

ZeoSync corporation folds after issuing $40 million in private stock


9
Universal data compression

Proposition. No algorithm can compress every bitstring.


U

Pf 1. [by contradiction]
・Suppose you have a universal data compression algorithm U
U

that can compress every bitstream.


・Given bitstring B , compress it to get smaller bitstring B .
U
0 1

・Compress B to get a smaller bitstring B .


.
1 2 .
.

・Continue until reaching bitstring of size 0.


・Implication: all bitstrings can be compressed to 0 bits! U

Pf 2. [by counting]
・Suppose your algorithm that can compress all 1,000-bit strings. U

・2 possible bitstrings with 1,000 bits.


1000
!

・Only 1 + 2 + 4 + … + 2 + 2 can be encoded with ≤ 999 bits.


998 999
Universal
data compression?
・Similarly, only 1 in 2 bitstrings can be encoded with ≤ 500 bits!
499

10
Data representation: genomic code

Genome. String over the alphabet { A, T, C, G }.

Goal. Encode an N-character genome: A T A G A T G C A T A G . . .

Standard ASCII encoding. Two-bit encoding.


・8 bits per char. ・2 bits per char.
・8 N bits. ・2 N bits (25% compression ratio).
char hex binary char binary

'A' 41 01000001 'A' 00

'T' 54 01010100 'T' 01

'C' 43 01000011 'C' 10

'G' 47 01000111 'G' 11

Fixed-length code. k-bit code supports alphabet of size 2k.

11
8-bit bytestreams,
streams so we might
are the primary decidefor
abstraction todata
readcompression,
and write bytestreams
we go a bit to further
match I/O for-
to allow
mats with the internal representations of intermixed
primitive types, encoding an 8-bit char with
Reading and writing
clients
1tive
byte,
to read
a 16-bit
types binary
and write
short
and String ). The2data
individual bits,
with bytes,isato32-bit
goal
with data of various types (primi-
int with 4 bytes, and so forth. Since bit-
minimize the necessity for type conversion in
streams are the primary abstraction for data compression,
client programs and also to take care of operating-system conventions we go a bit further to allow
for representing
clients
data.We to use
readthe
and write individual
following bits, intermixed
API for reading a bitstreamwith datastandard
from of various types (primi-
input:
tive types
Binary standard input. and StringRead). Thebits
goal is
fromto minimize the necessity
standard input. for type conversion in
public
client classand
programs BinaryStdIn
also to take care of operating-system conventions for representing
data.We
boolean use the following API for reading a bitstream from standard input:
readBoolean() read 1 bit of data and return as a boolean value
public class
char BinaryStdIn
readChar() read 8 bits of data and return as a char value
boolean
char readBoolean() read
readChar(int r) read1rbit
bitsofofdata
dataand
andreturn
returnasasa aboolean value
char value
char readChar()
[similar read 8(16
methods for byte (8 bits); short bitsbits);
of data and return as a char value
int (32 bits); long and double (64 bits)]
char readChar(int
boolean isEmpty() r) read bits of data
is ther bitstream and return as a char value
empty?
voidmethods
[similar for byte (8 bits); short
close() close(16
thebits); int (32 bits); long and double (64 bits)]
bitstream
boolean isEmpty() is the
API for static methods that bitstream
read empty? on standard input
from a bitstream
void close() close the bitstream
A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan-
API for static methods that read from a bitstream on standard input
dard input is not necessarily aligned on byte boundaries. If the input stream is a single
byte, a client could read it 1 bit at a time with 8 calls to readBoolean(). The close()
Amethod
key feature
is notofessential,
the abstraction
but, isclean
for that, in marked constrast
termination, clients to StdIn, the data on stan-
should call close() to in-
Binary standard
dard output. Write bits to standard output
dicateinput
thatisnonot necessarily
more bits arealigned on byte
to be read. As boundaries. If the input
with StdIn/StdOut , we stream
use theisfollowing
a single
byte, a client could
complementary APIread
foritwriting
1 bit atbitstreams
a time with to 8standard
calls to readBoolean()
output: . The close()
method is not essential, but, for clean termination, clients should call close() to in-
public
dicate thatclass
no moreBinaryStdOut
bits are to be read. As with StdIn/StdOut, we use the following
complementary API for writingb)bitstreams to standard output:
void write(boolean write the specified bit
public class
void BinaryStdOut
write(char c) write the specified 8-bit char
void
void write(boolean
write(char c, b)
int r) write
writethe
thespecified bit
r least significant bits of the specified char
voidmethods
[similar write(char (8 bits); shortwrite
for byte c) the specified
(16 bits); int (32 8-bit
bits);char
long and double (64 bits)]

close()
void write(char
void close the rbitstream
c, int r) write least significant bits of the specified char
[similar methods
APIfor static(8
forbyte bits); short
methods (16 bits);
that write int (32 bits);
to a bitstream long and
on standard double (64 bits)]
output
12
void close() close the bitstream
Writing binary data

Date representation. Three different ways to represent 12/31/1999.

A character stream (StdOut)


A character stream (StdOut)
StdOut.print(month + "/" + day + "/" + year);
ay + "/" +StdOut.print(month
year); + "/" + day + "/" + year);
00110001001100100010111100110111001100010010111100110001001110010011100100111001
00110001001100100010111100110111001100010010111100110001001110010011100100111001
1 0 0 1 0 1 1 1 1 0 0 1 1 0 0 011 0 0 1 1 1 0 021 0 0 1 1 1 0 0/
1 0 0 1 1 1 0 031 1 / 1 9 9 9
1 2 / 3 1 / 1 9 9 9
80 bits
/ 1
Three ints (BinaryStdOut) 9 9 9 80 bits
80 bits
Three ints (BinaryStdOut)
BinaryStdOut.write(month);
BinaryStdOut.write(month);
BinaryStdOut.write(day);
BinaryStdOut.write(day);
BinaryStdOut.write(year);
BinaryStdOut.write(year);
000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111
000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 01020 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 1 31 1999
96 bits
12 31 1999
31 1999 96 bits
Two chars and a short (BinaryStdOut) 96Abits
4-bit field, a 5-bit field, and a 12-bit field (BinaryStdOut)
Two chars and a short (BinaryStdOut) A 4-bit field, a 5-bit field, and a 12-bit field (BinaryStdOut)
A BinaryStdOut.write((char)
4-bit field, a 5-bit field, and a 12-bit field (BinaryStdOut)
month); BinaryStdOut.write(month, 4);
nth); BinaryStdOut.write((char) month);
BinaryStdOut.write((char) 4);
BinaryStdOut.write(month, day); BinaryStdOut.write(month,
BinaryStdOut.write(day, 5);4);
y); BinaryStdOut.write((char)
BinaryStdOut.write((short)
BinaryStdOut.write(day, day);
5); year); BinaryStdOut.write(day,
BinaryStdOut.write(year,5);
12);
ear); BinaryStdOut.write((short)
BinaryStdOut.write(year, year);
12); BinaryStdOut.write(year, 12);
00001100000111110000011111001111 110011111011111001111000
00001100000111110000011111001111 110011111011111001111000
1 1 0 01121 1 1 1 0 1 1
3 1 1 1 0 0 1 1 1 1 0 0109 9 9 12 31 1999
12 31 1999
32 bits 12 31 1999
21 bits ( + 3 bits for byte alignment at close)
12 31 1999 32 bits 21 bits ( + 3 bits for byte alignment at close)
bits 21 bits ( + 3 bits for byte alignment at close)
Four ways to put a date onto standard output
ways to put a date onto standard output Four ways to put a date onto standard output
13
Genomic code

public class Genome { Alphabet data type converts


between symbols { A, C, T, G }
public static void compress() { and integers 0―3.
Alphabet DNA = new Alphabet("ACTG");
String s = BinaryStdIn.readString();
int N = s.length(); read genomic string from stdin;
BinaryStdOut.write(N); write to stdout using 2-bit code
for (int i = 0; i < N; i++) {
int d = DNA.toIndex(s.charAt(i));
BinaryStdOut.write(d, 2);
}
BinaryStdOut.close();
}

public static void expand() {


Alphabet DNA = new Alphabet("ACTG");
int N = BinaryStdIn.readInt(); read 2-bit code from stdin;
write genomic string to stdout
for (int i = 0; i < N; i++) {
char c = BinaryStdIn.readChar(2);
BinaryStdOut.write(DNA.toChar(c));
}
BinaryStdOut.close();
}
}

14
standard input, encoded with the characters 0 and 1. This program is useful for debug-
ging when
ging when working
working with
with small
small inputs.
inputs. We We use
use aa slightly
slightly more
more complicated
complicated version
version that
that
Exercise 5.5.X).
5.5.X). The
Binaryjust
dumps
just prints
prints the
client HexDump
the count
count when
when the
groups the
HexDump groups
the width
the data
width argument
data into
argument is
into 8-bit
8-bit bytes
is 00 (see
bytes and
(see Exercise
and prints
prints each
each as
as two
The similar
similar
two hexadecimal
hexadecimal
client
digits that
digits that each
each represent
represent 44 bits.
bits. The
The client
client PictureDump displays the
PictureDump displays the bits
bits in
in aa Picture
Picture..

Q. HowYou
You can
tocan download HexDump
download
examine HexDump andand PictureDump
the contents PictureDump from from the
of a bitstream? the booksite.
booksite. Typically,
Typically, we
we use
use pip-
pip-
ing and
ing and redirection
redirection at
at the
the command-line
command-line level level when
when working
working with
with binary
binary files:
files: we
we can
can
pipe the
pipe the output
output of
of an
an encoder
encoder to to BinaryDump
BinaryDump,, HexDump
HexDump,, or or PictureDump
PictureDump,, or or redirect
redirect
it to
it to aa file.
file.
Standard character
Standard character stream
character stream
stream Bitstream represented
Bitstream represented with
represented with hex
with hex digits
hex digits
digits
Standard Bitstream
% more
% more abra.txt
more abra.txt
abra.txt % java
% java HexDump
java HexDump
HexDump 4 4
4 << abra.txt
< abra.txt
abra.txt
% %
ABRACADABRA!
ABRACADABRA! 41
41 4242 52
42 52 41
52 41
41
ABRACADABRA! 41
43
43 4141 44
41 44 41
44 41
41
43
Bitstream represented as 42
42 5252 41
52 41 21
41 21
21
Bitstream represented
Bitstream as 00
represented as and
and 11
0 and characters
1 characters
characters 42
12 bytes
12 bytes
bytes
% java BinaryDump 16 < abra.txt 12
% java
% java BinaryDump
BinaryDump 16 16 << abra.txt
abra.txt
0100000101000010
0100000101000010 Bitstream represented as pixels in
0100000101000010 Bitstream represented
Bitstream represented as
as pixels in aa
pixels in Picture
a Picture
Picture
0101001001000001
0101001001000001
0101001001000001 % java PictureDump 16 6 < abra.txt
0100001101000001 % java
% java PictureDump
PictureDump 16 16 6 6 < < abra.txt
abra.txt
0100001101000001
0100001101000001
0100010001000001
0100010001000001
0100010001000001 16-by-6 pixel
pixel
0100001001010010
0100001001010010
16-by-6
window, magnified
magnified
0100001001010010 window,
0100000100100001
0100000100100001 6.5 n Data Compression 667
0100000100100001
96 bits
96 bits
bits 96 bits
96 bits
bits
96 96
Four ways
Four ways
Four to
ways to look
to look at
at aa
look at bitstream
a bitstream
bitstream
ASCII encoding. When you HexDump a bit- 0 1 2 3 4 5 6 7 8 9 A B C D E F
stream that contains ASCII-encoded charac- 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

ters, the table at right is useful for reference. 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

Given a 2-digit hex number, use the first hex 2 SP ! “ # $ % & ‘ ( ) * + , - . /


digit as a row index and the second hex digit 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
as a column reference to find the character 4 @ A B C D E F G H I J K L M N O
that it encodes. For example, 31 encodes the 5 P Q R S T U V W X Y Z [ \ ] ^ _
digit 1, 4A encodes the letter J, and so forth. 6 ` a b c d e f g h i j k l m n o
This table is for 7-bit ASCII, so the first hex 7 p q r s t u v w x y z { | } ~ DEL
digit must be 7 or less. Hex numbers starting
Hexadecimal to ASCII conversion table
with 0 and 1 (and the numbers 20 and 7F)
15
correspond to non-printing control charac-
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression

R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu
Run-length encoding

Simple type of redundancy in a bitstream. Long runs of repeated bits.


0000000000000001111111000000011111111111
40 bits

Representation. 4-bit counts to represent alternating runs of 0s and 1s:


15 0s, then 7 1s, then 7 0s, then 11 1s.

1111011101111011 16 bits (instead of 40)


15 7 7 11

Q. How many bits to store the counts?


A. We typically use 8 (but 4 in the example above for brevity).

Q. What to do when run length exceeds max count?


A. Intersperse runs of length 0.

Applications. JPEG, ITU-T T4 Group 3 Fax, ...


17
Run-length encoding: Java implementation

public class RunLength


{
private final static int R = 256; maximum run-length count
private final static int lgR = 8; number of bits per count

public static void compress()


{ /* see textbook */ }

public static void expand()


{
boolean bit = false;
while (!BinaryStdIn.isEmpty())
{
int run = BinaryStdIn.readInt(lgR); read 8-bit count from standard input
for (int i = 0; i < run; i++)
BinaryStdOut.write(bit); write 1 bit to standard output
bit = !bit;
}
BinaryStdOut.close(); pad 0s for byte alignment
}

18
Data compression: quiz 1

What is the best compression ratio achievable from run-length coding


when using 8-bit counts?

A. 1 / 256

B. 1 / 16

C. 8 / 255

D. 24 / 510 = 4 / 85

E. I don't know.

19
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression

R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu

David Huffman
Variable-length codes

Use different number of bits to encode different chars.

Ex. Morse code: • • • − − − • • •

Issue. Ambiguity.
SOS ?
V7 ?
IAMIE ?
EEWNI ?

In practice. Use a medium gap to


separate codewords.
codeword for S is a prefix
of codeword for V

21
Variable-length codes

Q. How do we avoid ambiguity?


A. Ensure that no codeword is a prefix of another.
Codeword table Trie representation

key value
Ex 1. Fixed-length code. ! 101 0 1

A 0
A
Ex 2. Append special stop characater to each codeword. B 1111 0 1

C 110
0 1
Ex 3. General prefix-free code. D 100 0 1

R 1110 D ! C
0 1

R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !

Codeword table Trie representation


Codeword table Trie representation
key value key value
! 101 0 1
! 101 0 1
A 0 A 11
A
B 1111 0 1
B 00 0 1 0 1
C 110 C 010
0 1 0 1
B A
D 100 D 100 0 1 0 1
R 1110 D ! C R 011 C R D !
0 1

R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 22
Prefix-free codes: trie representation

Q. How to represent the prefix-free code?


A. A binary trie!
・ Characters in leaves. Codeword table

key value
Trie representation

・Codeword is path from root to leaf. !


A
101
0
0

A
1

B 1111 0 1

C 110
0 1 0 1
D 100
R 1110 D ! C
0 1

R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !

Codeword table Trie representation


Codeword table Trie representation
key value key value
! 101 0 1
! 101 0 1
A 0 A 11
A
B 1111 0 1
B 00 0 1 0 1
C 110 C 010
0 1 0 1
B A
D 100 D 100 0 1 0 1
R 1110 D ! C R 011 C R D !
0 1

R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 23
Prefix-free codes: expansion

Expansion.
・Start at root.
・Go left if bit is 0; go right if 1. Codeword table

key value
Trie representation

・If leaf node, write character; return to root node; repeat. !


A
101
0
0

A
1

B 1111 0 1

C 110
0 1 0 1
D 100
R 1110 D ! C
0 1

R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !

Codeword table Trie representation


Codeword table Trie representation
key value key value
! 101 0 1
! 101 0 1
A 0 A 11
A
B 1111 0 1
B 00 0 1 0 1
C 110 C 010
0 1 0 1
B A
D 100 D 100 0 1 0 1
R 1110 D ! C R 011 C R D !
0 1

R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 24
Prefix-free codes: compression

Compression.
・Method 1: start at leaf; follow path up to the root; print bits in reverse.
・Method 2: create ST of key-value pairs.Codeword table
key value
Trie representation

! 101 0 1

A 0
A
B 1111 0 1

C 110
0 1 0 1
D 100
R 1110 D ! C
0 1

R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !

Codeword table Trie representation


Codeword table Trie representation
key value key value
! 101 0 1
! 101 0 1
A 0 A 11
A
B 1111 0 1
B 00 0 1 0 1
C 110 C 010
0 1 0 1
B A
D 100 D 100 0 1 0 1
R 1110 D ! C R 011 C R D !
0 1

R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 25
Data compression: quiz 2

Consider the following trie representation of a prefix-free code.


Expand the compressed bitstring 1 0 0 1 0 1 0 0 0 1 1 1 0 1 1 ?

A. PEED
0 1
B. PESDEY
E
C. SPED
0 1
D. SPEEDY D
E. I don't know. 0 1

S
0 1

P Y

26
Huffman coding overview

Static model. Use the same prefix-free code for all messages.
Dynamic model. Use a custom prefix-free code for each message.

Compression.
・Read message.
・Build best prefix-free code for message. How? [ahead]
・Write prefix-free code (as a trie).
・Compress message using prefix-free code.
Expansion.
・Read prefix-free code (as a trie) from file.
・Read compressed message and expand using trie.

27
Huffman trie node data type

private static class Node implements Comparable<Node>


{
private final char ch; // used only for leaf nodes
private final int freq; // used only by compress()
private final Node left, right;

public Node(char ch, int freq, Node left, Node right)


{
this.ch = ch;
this.freq = freq; initializing constructor
this.left = left;
this.right = right;
}

public boolean isLeaf() is Node a leaf?


{ return left == null && right == null; }

public int compareTo(Node that) compare Nodes by frequency


{ return this.freq - that.freq; } (stay tuned)
}

28
Prefix-free codes: expansion

public void expand()


{
Node root = readTrie(); read in encoding trie
int N = BinaryStdIn.readInt(); read in number of chars

for (int i = 0; i < N; i++)


{
Node x = root;
while (!x.isLeaf()) expand codeword for ith char
{
if (!BinaryStdIn.readBoolean())
x = x.left;
else
x = x.right;
}
BinaryStdOut.write(x.ch, 8);
}
BinaryStdOut.close();
}

Running time. Linear in input size N.


29
Prefix-free codes: how to transmit

Q. How to write the trie?


A. Write preorder traversal of trie; mark leaf and internal nodes with a bit.

preorder private static void writeTrie(Node x)


traversal 1
{
A 2 if (x.isLeaf())
{
3 5 BinaryStdOut.write(true);
BinaryStdOut.write(x.ch, 8);
D 4 R B
return;
! C }
BinaryStdOut.write(false);
leaves writeTrie(x.left);
A D ! C R B
writeTrie(x.right);
01010000010010100010001000010101010000110101010010101000010
}
1 23 4 5 internal nodes
Using preorder traversal to encode a trie as a bitstream

Note. If message is long, overhead of transmitting trie is small.


30
Prefix-free codes: how to transmit

Q. How to read in the trie?


A. Reconstruct from preorder traversal of trie.

preorder private static Node readTrie()


traversal 1
{
A 2 if (BinaryStdIn.readBoolean())
{
3 5 char c = BinaryStdIn.readChar(8);
return new Node(c, 0, null, null);
D 4 R B
}
! C Node x = readTrie();
Node y = readTrie();
leaves return new Node('\0', 0, x, y);
A D ! C R B
}
01010000010010100010001000010101010000110101010010101000010
arbitrary value
1 23 4 5 internal nodes (value not used with internal nodes)
Using preorder traversal to encode a trie as a bitstream

31
Huffman codes

Q. How to find best prefix-free code?

Huffman algorithm:
・Count frequency freq[i] for each char i in input.
・Start with one node corresponding to each char i (with weight freq[i]).
・Repeat until single trie formed:
– select two tries with min weight freq[i] and freq[j]
– merge into single trie with weight freq[i] + freq[j]

Applications:

32
Constructing a Huffman encoding trie: Java implementation

private static Node buildTrie(int[] freq)


{
MinPQ<Node> pq = new MinPQ<Node>();
for (char i = 0; i < R; i++) initialize PQ with
if (freq[i] > 0) singleton tries
pq.insert(new Node(i, freq[i], null, null));

merge two
while (pq.size() > 1) smallest tries
{
Node x = pq.delMin();
Node y = pq.delMin();
Node parent = new Node('\0', x.freq + y.freq, x, y);
pq.insert(parent);
}

return pq.delMin(); not used for total frequency two subtries


internal nodes
}

33
Huffman compression summary

Proposition. Huffman's algorithm produces an optimal prefix-free code.


Pf. See textbook. no prefix-free code
uses fewer bits

Two-pass implementation (for compression).


・Pass 1: tabulate character frequencies; build trie.
・Pass 2: encode file by traversing trie (or symbol table).

Running time (for compression). Using a binary heap ⇒ N + R log R .


Running time (for expansion). Using a binary trie ⇒ N .
input alphabet
size size

Q. Can we do better? [stay tuned]

34
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression

R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu

Abraham Lempel Jacob Ziv


Statistical methods

Static model. Same model for all texts.


・Fast.
・Not optimal: different texts have different statistical properties.
・Ex: ASCII, Morse code.
Dynamic model. Generate model based on text.
・Preliminary pass needed to generate model.
・Must transmit the model.
・Ex: Huffman code.
Adaptive model. Progressively learn and update model as you read text.
・More accurate modeling produces better compression.
・Decoding must start from beginning.
・Ex: LZW.

36
LZW compression demo

input A B R A C A D A B R A B R A B RR A

matches A B R A C A D A B R A B R A B R A

value 41 42 52 41 43 41 44 81 83 82 88 41 80

LZW compression for A B R A C A D A B R A B R A B R A

key value key value key value

⋮ ⋮ AB 81 DA 87

A 41 BR 82 ABR 88

B 42 RA 83 RAB 89

C 43 AC 84 BRA 8A

D 44 CA 85 ABRA 8B

⋮ ⋮ AD 86

codeword table 37
Lempel-Ziv-Welch compression

LZW compression.
・Create ST associating W-bit codewords with string keys.
・Initialize ST with codewords for single-character keys.
・Find longest string s in ST that is a prefix of unscanned part of input.
・Write the W-bit codeword associated with s. longest prefix match
・Add s + c to ST, where c is next character in the input.
Q. How to represent LZW compression code table?
A. A trie to support longest prefix match.
A 41 B 42 C 43 D 44 R 52

B 81 C 84 D 86 R 82 A 85 A 87 A 83

R 88 A 8A B 89

A 8B

38
LZW expansion demo

value 41 42 52 41 43 41 44 81 83 82 88 41 80

output A B R A C A D A B R A B R A B R A

LZW expansion for 41 42 52 41 43 41 44 81 83 82 88 41 80

key value key value key value

⋮ ⋮ 81 AB 87 DA

41 A 82 BR 88 ABR

42 B 83 RA 89 RAB

43 C 84 AC 8A BRA

44 D 85 CA 8B ABRA

⋮ ⋮ 86 AD

codeword table 39
LZW expansion

LZW expansion. key value

・Create ST associating string values with W-bit keys. ⋮ ⋮

・Initialize ST to contain single-character values. 65

66
A

B
・Read a W-bit key. 67 C
・Find associated string value in ST and write it out. 68 D

・Update ST. ⋮ ⋮

129 AB

130 BR
Q. How to represent LZW expansion code table?
131 RA
A. An array of length 2W.
132 AC

133 CA

134 AD

135 DA

136 ABR

137 RAB

138 BRA

139 ABRA
⋮ ⋮

40
Data compression: quiz 3

What is the LZW compression of A B A B A B A ?

A. 41 42 41 42 41 42 80

B. 41 42 41 81 81

C. 41 42 81 81 41

D. 41 42 81 83 80

E. I don't know.

41
LZW tricky case: compression

input A B A B A B A

matches A B A B A B A

value 41 42 81 83 80

LZW compression for ABABABA

key value key value

⋮ ⋮ AB 81

A 41 BA 82

B 42 ABA 83

C 43

D 44

⋮ ⋮

codeword table 42
LZW tricky case: expansion

value 41 42 81 83 80
need to know code for 83
output A B A B A B A
x
before it is in codeword table!

we can deduce that


LZW expansion for 41 42 81 83 80
the code for 83 is ABx
for some character x

key value key value now, we have deduced x!

⋮ ⋮ 81 AB

41 A 82 BA

42 B 83 ABx
ABA
?

43 C

44 D

⋮ ⋮

codeword table 43
LZW implementation details

How big to make ST?


・How long is message?
・Whole message similar model?
・[many other variations]
What to do when ST fills up?
・Throw away and start over. [GIF]
・Throw away when not effective. [Unix compress]
・[many other variations]
Why not put longer substrings in ST?
・[many variations have been developed]

44
LZW in the real world

Lempel-Ziv and friends.


・LZ77.
・LZ78.
・LZW.
・Deflate / zlib = LZ77 variant + Huffman.

Unix compress, GIF, TIFF, V.42bis modem: LZW. previously under patent

zip, 7zip, gzip, jar, png, pdf: deflate / zlib. not patented
(widely used in open source)
iPhone, Wii, Apache HTTP server: deflate / zlib.

45
Lossless data compression benchmarks

year scheme bits / char

1967 ASCII 7.00

1950 Huffman 4.70

1977 LZ77 3.94

1984 LZMW 3.32

1987 LZH 3.30

1987 move-to-front 3.24

1987 LZB 3.18

1987 gzip 2.71

1988 PPMC 2.48

1994 SAKDC 2.47

1994 PPM 2.34

1995 Burrows-Wheeler 2.29 next programming assignment

1997 BOA 1.99

1999 RK 1.89

data compression using Calgary corpus


46
Data compression summary

Lossless compression.
・Represent fixed-length symbols with variable-length codes. [Huffman]
・Represent variable-length symbols with fixed-length codes. [LZW]

Lossy compression. [not covered in this course]


・JPEG, MPEG, MP3, …
・FFT/DCT, wavelets, fractals, …

n
X
Theoretical limits on compression. Shannon entropy: H(X) = p(xi ) lg p(xi )
i

Practical compression. Exploit extra knowledge whenever possible.

47

You might also like