Dictionary Techniques (Lempel-Ziv Codes) : Dictionary, and Encode These Patterns by Transmitting

Dictionary Techniques (Lempel-Ziv Codes)
Aknowlegement: In this lecture I use notes of Dr Antonios

Symvonis from University of Sydney
Incorporate the structure in the data in order to

increase the amount of compression
Built a list of commonly occurring patterns, the
dictionary, and encode these patterns by transmitting
their index in the list
In block coding, the datavector is partitioned into
blocks of equal length. In Lempel-Ziv coding the
datavector is partitioned into variable-length blocks
(Lempel-Ziv parsing)
Can be static or dynamic
Based on the seminal work of Jacob Ziv and Abraham

Lempel
 LZ77
 LZ78
 LZW
Applications include:
 Unix “compress”
 V. 42 bis
 Graphics Interchange Format (GIF)
 Tape/ disc drives
 ARC, PKARC, PKZIP, LHArc, ARJ
Lawyers have been involved…

Unisys holds a patent for LZW algorithm
Microsoft was forced to remove Stacer-like
compression from MS-DOS (Version 6) after lawsuit
by Stac Electronics
The static dictionary technique
- Frequently occurring patterns are
kept in the dictionary and encoded
by their index in the list. Other
patterns are encoded by some other
less efficient method.
Appropriate when considerable prior knowledge about
the source is available
Suitable for particular applications
Digram coding
The dictionary consists of all letters of the source alphabet
followed by as many pairs of letters, called digrams, as can
be accommodated by the dictionary
Example
Assume the alphabet A {a, b, c, d , r}. Code the
sequence abracadabra based on the dictionary:
Code Entry Code Entry

0 a 4 r
1 b 5 ab
2 c 6 ac
3 d 7 ad
Sequence: a b r a c a d a b r a
Code 5 4 6 7 5 4 0
Adaptive dictionary techniques
- techniques that a d a p t to the
characteristics of the source file
Based on the work of Ziv and Lempel

LZ77 - Assumes patterns recur close together
LZ78 - Based on the entire previously coded
sequence
The LZ77 (LZ1) approach
J. Ziv, A. Lempel, A universal algorithm for data

compression, IEEE Transactions on Information
Theory, Vol. 23(3), pp. 337-343,May 1977.
Asymptotically, the performance of the algorithm
approaches the best that could be obtained by using
a scheme that had full knowledge about the statistics
of the source
The encoder examines the input sequence through a
sliding window which is partitioned into:
- Search buffer : contains portion of most recently

encoded sequence
- Look-ahead buffer : contains next portion of
sequence to be encoded
Example
search
pointer o 7
c a b r a c a d a brarr arrad
l 4
abracad
Search buffer Look-ahead
buffer
The coding process

To encode the sequence in the look - ahead buffer, the
encoder moves a search pointer back through the
search buffer until it encounters a match to the first
symbol in the look-ahead buffer.
The distance of the pointer from the look–ahead buffer
is called the offset.
The encoder then examines the symbols following the
symbol at the pointer location to see if they match
consecutive symbols in the look-ahead buffer.
The number of consecutive symbols in the search buffer
that match consecutive symbols in the look-ahead buffer
is called the length of the match.
The encoder searches the search buffer for the longest
match and encodes it with a triple o, l , c , where
o is the offset, l is the length of the match, c is the
codeword corresponding to the symbol in the look-ahead
buffer that follows the match.
Note: the string starting in the search buffer can extend
into the look-ahead buffer
The encoder repeats:

- Find the longest match
- Transmit triple o, l , c
- Advance the windows by l 1 positions
The decoding process

Similar to the coding: builds the search buffer
Faster/simpler since no searching is required
Example
Encode the sequence:
...cabracadabrarrarrad ...
size of search buffer = 7 letters
size of look-ahead buffer = 6 letters
assume that the 7 leading letters have been encoded
cabraca d abrar rarrad
0,0, c( d )
c abracad abrarr arrad
7,4, c( r )
cabrac ad abrar rarrad
3,5, c( d )
END
Example (continued)
Decode the sequence:
0,0, c( d ) , 7,4, c(r ) , 3,5, c( d )
size of search buffer = 7 letters

assume that the sequence cabraca have been decoded
0,0,c ( d )
cabraca abracad
7,4,c ( r )
abracad ad abrar
3,5,c ( d )
ad abrar
Variations of the LZ77scheme
Encode triples with variable length code

PKZip, Zip, LHArc, ARJ
Variable size of search and look-ahead buffers
Eliminate third member of triple by using a flag bit to

indicate whether what follows is the codeword for a
single letter. It also eliminate the situation we use a
triple to encode a single character (LZSS algorithm)
The Achilles’ hill of LZ77
a b c d e f g hi a b c d e f g hi a b c d e f g hi
search buffer look-ahead buffer
A periodic sequence with a period longer than a search

buffer
None of the new symbols will have match in the search
buffer and will have to be represented by separate
codewords
The LZ77 approach assumes that like patterns will

occur close together. Any pattern that recurs over a
period longer than that covered by the coder window
will not be captured.
The LZ78 algorithm solves this problem.

The LZ78 approach
J. Ziv, A_.Lempel, Compression of individual

sequences via variable-rate coding, IEEE
Transactions on Information Theory, Vol_24(5), pp.
530-536, September 1978
Problem with LZ77: assumes that like patterns occur

close together
LZ78 keeps an explicit dictionary containing “all”

distinct patterns seen during the encoding
Both the encoder and the decoder have to build the

dictionary
The input sequence is coded as a sequence of tuples

i, c , where
- i is the index corresponding to the dictionary
index that was the longest match to the input
- c is the codeword for the input character that
follows the matched portion of the input
The compression performance of Lempel-ZiV code
Let X ( X 1 , X 2 ,..., X n ) denote the datavector to be

compressed.
Let LZ ( X ) denote the length of the codeword assigned

by the Lempel-Ziv code to X .
For arbitrary j ,
LZ ( X ) log 2 log 2 n
H j (X ) cj ,
n log 2 n
where H j ( X ) is the entropy of the j-th order blocks for X .
Lempel-Ziv alghoritm yields a compression rate

approximately no worse than that of the block code,
provided the datavector is long enough relative to the
order of the block.
The LZW approach
A variation of LZ78 which avoids the transmission of

the input character that follows the “match”
The dictionary, at both the coder and the decoder
sides, initially contains all alphabet symbols
Assume that string m is the match and that a is the
input character that follows it.
The encoder repeats:

- Transmit the index of m in the dictionary
- Insert m a into the dictionary
- Built the next match starting with a
LZW decoding
A complication arises when the input sequence contains

KwKwK , where Kw is already in the dictionary ( K is a
letter of the alphabet and w a string)
The coder:
- sends code for Kw
- inserts KwK in the dictionary
- Sends the code for KwK and …
The decoder:
- On receiving the code for KwK will not yet
have added that code to the dictionary since it
does not yet know the last character to add to the
previously received string
Applications of LZW coding
Unix “compress”
9 16
- Adaptive dictionary size; 2 2 entries
- Codewords increase in length as dictionary size
increases
- When dictionary reaches maximum size:
o It performs static coding
o It monitors the compression ratio and
flashes dictionary if compression ratio
drops below a threshold
Graphics Interchange Format _GIF_
- Developed by Compuserve Information Service
- An implementation of LZW similar to “compress “
- “Unclear” future due to patent hold by Unisys for
LZW
- Works well with computer generated images
- It is an unfortunate choice for continuous tone
images
Compression over modems; V.42 bis
- Consultative Committee on International
Telephone and Telegraph CCITT
Recommendation V.42
- Operates in two modes:
 Transparent mode
No compression; used when sequence does
not contain repeating patterns (usually
previously compressed files)
 Compressed mode
LZW algorithm; forbids the transmission of
an entry immediately after its insertion into the dictionary
- Recommends periodic testing to detect if data
expansion takes place; does not specify the test
- Variable size dictionary

 Negotiated during link-time
 Minimum dictionary size 512 entries
 Recommended dictionary size 2048 entries
 When the dictionary is full the “oldest” entry
which was not encountered since its creation
is removed

- Maximum string size
 Used to reduce errors during transmission
over phone lines
 Negotiated during link-time
 Recommended string length: 6-250

Dictionary Techniques (Lempel-Ziv Codes) : Dictionary, and Encode These Patterns by Transmitting

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dictionary Techniques (Lempel-Ziv Codes) : Dictionary, and Encode These Patterns by Transmitting

Uploaded by

Copyright:

Available Formats

Dictionary Techniques (Lempel-Ziv Codes)

Aknowlegement: In this lecture I use notes of Dr Antonios

Incorporate the structure in the data in order to

Can be static or dynamic

Based on the seminal work of Jacob Ziv and Abraham

Lawyers have been involved…

Code Entry Code Entry

Based on the work of Ziv and Lempel

The LZ77 (LZ1) approach

J. Ziv, A. Lempel, A universal algorithm for data

- Search buffer : contains portion of most recently

The coding process

The encoder repeats:

The decoding process

cabraca d abrar rarrad

c abracad abrarr arrad

cabrac ad abrar rarrad

Decode the sequence:

0,0, c( d ) , 7,4, c(r ) , 3,5, c( d )

size of search buffer = 7 letters

Encode triples with variable length code

Variable size of search and look-ahead buffers

Eliminate third member of triple by using a flag bit to

A periodic sequence with a period longer than a search

The LZ77 approach assumes that like patterns will

The LZ78 algorithm solves this problem.

J. Ziv, A_.Lempel, Compression of individual

Problem with LZ77: assumes that like patterns occur

LZ78 keeps an explicit dictionary containing “all”

Both the encoder and the decoder have to build the

The input sequence is coded as a sequence of tuples

Let X ( X 1 , X 2 ,..., X n ) denote the datavector to be

Let LZ ( X ) denote the length of the codeword assigned

where H j ( X ) is the entropy of the j-th order blocks for X .

Lempel-Ziv alghoritm yields a compression rate

A variation of LZ78 which avoids the transmission of

The encoder repeats:

A complication arises when the input sequence contains

- Variable size dictionary

You might also like