Professional Documents
Culture Documents
http://algs4.cs.princeton.edu
http://algs4.cs.princeton.edu
Data compression
Compress Expand
bitstream B compressed version C(B) original bitstream
reconstructed B B′
bitstream
0110110101... 1101011111... 0110110101...
5
Lossy vs. Lossless
Compress Expand
bitstream B compressed version C(B) original bitstream B
0110110101... 1101011111... 0110110101...
6
Rdenudcany in Enlgsih lnagugae
・Morse code.
・Telephone system. but rather a I like like every
Pf 1. [by contradiction]
・Suppose you have a universal data compression algorithm U
U
Pf 2. [by counting]
・Suppose your algorithm that can compress all 1,000-bit strings. U
10
Data representation: genomic code
11
8-bit bytestreams,
streams so we might
are the primary decidefor
abstraction todata
readcompression,
and write bytestreams
we go a bit to further
match I/O for-
to allow
mats with the internal representations of intermixed
primitive types, encoding an 8-bit char with
Reading and writing
clients
1tive
byte,
to read
a 16-bit
types binary
and write
short
and String ). The2data
individual bits,
with bytes,isato32-bit
goal
with data of various types (primi-
int with 4 bytes, and so forth. Since bit-
minimize the necessity for type conversion in
streams are the primary abstraction for data compression,
client programs and also to take care of operating-system conventions we go a bit further to allow
for representing
clients
data.We to use
readthe
and write individual
following bits, intermixed
API for reading a bitstreamwith datastandard
from of various types (primi-
input:
tive types
Binary standard input. and StringRead). Thebits
goal is
fromto minimize the necessity
standard input. for type conversion in
public
client classand
programs BinaryStdIn
also to take care of operating-system conventions for representing
data.We
boolean use the following API for reading a bitstream from standard input:
readBoolean() read 1 bit of data and return as a boolean value
public class
char BinaryStdIn
readChar() read 8 bits of data and return as a char value
boolean
char readBoolean() read
readChar(int r) read1rbit
bitsofofdata
dataand
andreturn
returnasasa aboolean value
char value
char readChar()
[similar read 8(16
methods for byte (8 bits); short bitsbits);
of data and return as a char value
int (32 bits); long and double (64 bits)]
char readChar(int
boolean isEmpty() r) read bits of data
is ther bitstream and return as a char value
empty?
voidmethods
[similar for byte (8 bits); short
close() close(16
thebits); int (32 bits); long and double (64 bits)]
bitstream
boolean isEmpty() is the
API for static methods that bitstream
read empty? on standard input
from a bitstream
void close() close the bitstream
A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan-
API for static methods that read from a bitstream on standard input
dard input is not necessarily aligned on byte boundaries. If the input stream is a single
byte, a client could read it 1 bit at a time with 8 calls to readBoolean(). The close()
Amethod
key feature
is notofessential,
the abstraction
but, isclean
for that, in marked constrast
termination, clients to StdIn, the data on stan-
should call close() to in-
Binary standard
dard output. Write bits to standard output
dicateinput
thatisnonot necessarily
more bits arealigned on byte
to be read. As boundaries. If the input
with StdIn/StdOut , we stream
use theisfollowing
a single
byte, a client could
complementary APIread
foritwriting
1 bit atbitstreams
a time with to 8standard
calls to readBoolean()
output: . The close()
method is not essential, but, for clean termination, clients should call close() to in-
public
dicate thatclass
no moreBinaryStdOut
bits are to be read. As with StdIn/StdOut, we use the following
complementary API for writingb)bitstreams to standard output:
void write(boolean write the specified bit
public class
void BinaryStdOut
write(char c) write the specified 8-bit char
void
void write(boolean
write(char c, b)
int r) write
writethe
thespecified bit
r least significant bits of the specified char
voidmethods
[similar write(char (8 bits); shortwrite
for byte c) the specified
(16 bits); int (32 8-bit
bits);char
long and double (64 bits)]
close()
void write(char
void close the rbitstream
c, int r) write least significant bits of the specified char
[similar methods
APIfor static(8
forbyte bits); short
methods (16 bits);
that write int (32 bits);
to a bitstream long and
on standard double (64 bits)]
output
12
void close() close the bitstream
Writing binary data
14
standard input, encoded with the characters 0 and 1. This program is useful for debug-
ging when
ging when working
working with
with small
small inputs.
inputs. We We use
use aa slightly
slightly more
more complicated
complicated version
version that
that
Exercise 5.5.X).
5.5.X). The
Binaryjust
dumps
just prints
prints the
client HexDump
the count
count when
when the
groups the
HexDump groups
the width
the data
width argument
data into
argument is
into 8-bit
8-bit bytes
is 00 (see
bytes and
(see Exercise
and prints
prints each
each as
as two
The similar
similar
two hexadecimal
hexadecimal
client
digits that
digits that each
each represent
represent 44 bits.
bits. The
The client
client PictureDump displays the
PictureDump displays the bits
bits in
in aa Picture
Picture..
Q. HowYou
You can
tocan download HexDump
download
examine HexDump andand PictureDump
the contents PictureDump from from the
of a bitstream? the booksite.
booksite. Typically,
Typically, we
we use
use pip-
pip-
ing and
ing and redirection
redirection at
at the
the command-line
command-line level level when
when working
working with
with binary
binary files:
files: we
we can
can
pipe the
pipe the output
output of
of an
an encoder
encoder to to BinaryDump
BinaryDump,, HexDump
HexDump,, or or PictureDump
PictureDump,, or or redirect
redirect
it to
it to aa file.
file.
Standard character
Standard character stream
character stream
stream Bitstream represented
Bitstream represented with
represented with hex
with hex digits
hex digits
digits
Standard Bitstream
% more
% more abra.txt
more abra.txt
abra.txt % java
% java HexDump
java HexDump
HexDump 4 4
4 << abra.txt
< abra.txt
abra.txt
% %
ABRACADABRA!
ABRACADABRA! 41
41 4242 52
42 52 41
52 41
41
ABRACADABRA! 41
43
43 4141 44
41 44 41
44 41
41
43
Bitstream represented as 42
42 5252 41
52 41 21
41 21
21
Bitstream represented
Bitstream as 00
represented as and
and 11
0 and characters
1 characters
characters 42
12 bytes
12 bytes
bytes
% java BinaryDump 16 < abra.txt 12
% java
% java BinaryDump
BinaryDump 16 16 << abra.txt
abra.txt
0100000101000010
0100000101000010 Bitstream represented as pixels in
0100000101000010 Bitstream represented
Bitstream represented as
as pixels in aa
pixels in Picture
a Picture
Picture
0101001001000001
0101001001000001
0101001001000001 % java PictureDump 16 6 < abra.txt
0100001101000001 % java
% java PictureDump
PictureDump 16 16 6 6 < < abra.txt
abra.txt
0100001101000001
0100001101000001
0100010001000001
0100010001000001
0100010001000001 16-by-6 pixel
pixel
0100001001010010
0100001001010010
16-by-6
window, magnified
magnified
0100001001010010 window,
0100000100100001
0100000100100001 6.5 n Data Compression 667
0100000100100001
96 bits
96 bits
bits 96 bits
96 bits
bits
96 96
Four ways
Four ways
Four to
ways to look
to look at
at aa
look at bitstream
a bitstream
bitstream
ASCII encoding. When you HexDump a bit- 0 1 2 3 4 5 6 7 8 9 A B C D E F
stream that contains ASCII-encoded charac- 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
ters, the table at right is useful for reference. 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
http://algs4.cs.princeton.edu
Run-length encoding
18
Data compression: quiz 1
A. 1 / 256
B. 1 / 16
C. 8 / 255
D. 24 / 510 = 4 / 85
E. I don't know.
19
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression
http://algs4.cs.princeton.edu
David Huffman
Variable-length codes
Issue. Ambiguity.
SOS ?
V7 ?
IAMIE ?
EEWNI ?
21
Variable-length codes
key value
Ex 1. Fixed-length code. ! 101 0 1
A 0
A
Ex 2. Append special stop characater to each codeword. B 1111 0 1
C 110
0 1
Ex 3. General prefix-free code. D 100 0 1
R 1110 D ! C
0 1
R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !
R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 22
Prefix-free codes: trie representation
key value
Trie representation
A
1
B 1111 0 1
C 110
0 1 0 1
D 100
R 1110 D ! C
0 1
R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !
R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 23
Prefix-free codes: expansion
Expansion.
・Start at root.
・Go left if bit is 0; go right if 1. Codeword table
key value
Trie representation
A
1
B 1111 0 1
C 110
0 1 0 1
D 100
R 1110 D ! C
0 1
R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !
R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 24
Prefix-free codes: compression
Compression.
・Method 1: start at leaf; follow path up to the root; print bits in reverse.
・Method 2: create ST of key-value pairs.Codeword table
key value
Trie representation
! 101 0 1
A 0
A
B 1111 0 1
C 110
0 1 0 1
D 100
R 1110 D ! C
0 1
R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !
R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 25
Data compression: quiz 2
A. PEED
0 1
B. PESDEY
E
C. SPED
0 1
D. SPEEDY D
E. I don't know. 0 1
S
0 1
P Y
26
Huffman coding overview
Static model. Use the same prefix-free code for all messages.
Dynamic model. Use a custom prefix-free code for each message.
Compression.
・Read message.
・Build best prefix-free code for message. How? [ahead]
・Write prefix-free code (as a trie).
・Compress message using prefix-free code.
Expansion.
・Read prefix-free code (as a trie) from file.
・Read compressed message and expand using trie.
27
Huffman trie node data type
28
Prefix-free codes: expansion
31
Huffman codes
Huffman algorithm:
・Count frequency freq[i] for each char i in input.
・Start with one node corresponding to each char i (with weight freq[i]).
・Repeat until single trie formed:
– select two tries with min weight freq[i] and freq[j]
– merge into single trie with weight freq[i] + freq[j]
Applications:
32
Constructing a Huffman encoding trie: Java implementation
merge two
while (pq.size() > 1) smallest tries
{
Node x = pq.delMin();
Node y = pq.delMin();
Node parent = new Node('\0', x.freq + y.freq, x, y);
pq.insert(parent);
}
33
Huffman compression summary
34
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression
http://algs4.cs.princeton.edu
36
LZW compression demo
input A B R A C A D A B R A B R A B RR A
matches A B R A C A D A B R A B R A B R A
value 41 42 52 41 43 41 44 81 83 82 88 41 80
⋮ ⋮ AB 81 DA 87
A 41 BR 82 ABR 88
B 42 RA 83 RAB 89
C 43 AC 84 BRA 8A
D 44 CA 85 ABRA 8B
⋮ ⋮ AD 86
codeword table 37
Lempel-Ziv-Welch compression
LZW compression.
・Create ST associating W-bit codewords with string keys.
・Initialize ST with codewords for single-character keys.
・Find longest string s in ST that is a prefix of unscanned part of input.
・Write the W-bit codeword associated with s. longest prefix match
・Add s + c to ST, where c is next character in the input.
Q. How to represent LZW compression code table?
A. A trie to support longest prefix match.
A 41 B 42 C 43 D 44 R 52
B 81 C 84 D 86 R 82 A 85 A 87 A 83
R 88 A 8A B 89
A 8B
38
LZW expansion demo
value 41 42 52 41 43 41 44 81 83 82 88 41 80
output A B R A C A D A B R A B R A B R A
⋮ ⋮ 81 AB 87 DA
41 A 82 BR 88 ABR
42 B 83 RA 89 RAB
43 C 84 AC 8A BRA
44 D 85 CA 8B ABRA
⋮ ⋮ 86 AD
codeword table 39
LZW expansion
66
A
B
・Read a W-bit key. 67 C
・Find associated string value in ST and write it out. 68 D
・Update ST. ⋮ ⋮
129 AB
130 BR
Q. How to represent LZW expansion code table?
131 RA
A. An array of length 2W.
132 AC
133 CA
134 AD
135 DA
136 ABR
137 RAB
138 BRA
139 ABRA
⋮ ⋮
40
Data compression: quiz 3
A. 41 42 41 42 41 42 80
B. 41 42 41 81 81
C. 41 42 81 81 41
D. 41 42 81 83 80
E. I don't know.
41
LZW tricky case: compression
input A B A B A B A
matches A B A B A B A
value 41 42 81 83 80
⋮ ⋮ AB 81
A 41 BA 82
B 42 ABA 83
C 43
D 44
⋮ ⋮
codeword table 42
LZW tricky case: expansion
value 41 42 81 83 80
need to know code for 83
output A B A B A B A
x
before it is in codeword table!
⋮ ⋮ 81 AB
41 A 82 BA
42 B 83 ABx
ABA
?
43 C
44 D
⋮ ⋮
codeword table 43
LZW implementation details
44
LZW in the real world
Unix compress, GIF, TIFF, V.42bis modem: LZW. previously under patent
zip, 7zip, gzip, jar, png, pdf: deflate / zlib. not patented
(widely used in open source)
iPhone, Wii, Apache HTTP server: deflate / zlib.
45
Lossless data compression benchmarks
1999 RK 1.89
Lossless compression.
・Represent fixed-length symbols with variable-length codes. [Huffman]
・Represent variable-length symbols with fixed-length codes. [LZW]
n
X
Theoretical limits on compression. Shannon entropy: H(X) = p(xi ) lg p(xi )
i
47