You are on page 1of 62

Name of the Teacher Zohaib Hasan Khan

Mobile Number 7752846666


Email ID zhkhan@iul.ac.in
Designation Assistant Professor
University Name Integral University, Lucknow
Stream Engineering
Faculty Name Engineering
Department Name Electronics and Communication Engineering

Subject Name Data Compression


Program name B.CA
Program Duration 3 years
Subtopic Lossless and Lossy Algorithms

Content Type Presentation


Search Keywords Coding a sequence, Generating a Binary code, Static
Dictionary: Diagram Coding , The LZ77 Approach, The LZ7,
Unit3
Data Compression(CA209)
by
Zohaib Hasan Khan
Assistant Professor
Department of Electronics and Communication Engg.
Integral University, Lucknow
UNIT-III Syllabus

Arithmetic Coding: Coding a sequence, Generating a Binary code, Comparison of Arithmetic


and Huffman coding.
Dictionary Techniques: Introduction, Static Dictionary: Diagram Coding, Adaptive Dictionary:
The LZ77 Approach, The LZ78 Approach.
Applications: File Compression, Image Compression
Lossless Image Compression: Multi-resolution Approaches.
Context Based Compression: Dynamic Markov Compression.

Zhkhan@iul.ac.in 3
Coding rate is the average number of bits used to represent a symbol from a
source.
 For a given probability model, the entropy is the lowest rate at which the
source can be coded.
Huffman coding will generate whose rate is within p_max + 0. 086
Therefore, in Huffman coding, when the alphabet size is large, the amount of
deviation from the entropy is quite small, and vice versa.
One solution for this problem is blocking in Huffman coding. In which, it is
more efficient to generate codewords for groups or sequences of symbols
rather than to generate a separate codeword for each symbol in a sequence.
 In order to find the Huffman coding for a sequence of length m, we need
codewords for all possible sequences of length m.
 This causes an exponential growth in the size of the code book.

Zhkhan@iul.ac.in 4
Arithmetic coding
We need a way of assigning codewords to particular sequences with out having to
generate a codes for all sequences of that length.
 Rather than separating the input into component symbols and replacing each with a code,
arithmetic encodes the entire message with a number (tag).
 Firstly, a unique identifier or tag is generated for a sequence. Secondly, this tag is then
given a unique binary code.
Arithmetic coding is based on the concept of interval subdividing.
 – In arithmetic coding a source ensemble is represented by an interval between 0 and 1 on the real number
line.
 – Each symbol of the ensemble narrows this interval.
 – As the interval becomes smaller, the number of bits needed to specify it grows. – Arithmetic coding
assumes an explicit probabilistic model of the source.
– It uses the probabilities of the source messages to successively narrow the interval used to represent the
ensemble.
 A high probability message narrows the interval less than a low probability message, so that high probability
messages contribute fewer bits to the coded ensemble.
Zhkhan@iul.ac.in 5
 Assume we know the probabilities of each symbol of the data source,
 we can allocate to each symbol an interval with width proportional to its probability,
and each of the intervals does not overlap with others.
This can be done if we use the cumulative probabilities as the two ends of each
interval.
Therefore, the two ends of each symbol x amount to Q[x-1] and Q[x].
Symbol x is said to own the range [Q[x-1], Q[x]).
We begin with the interval [0, 1) and subdivide the interval iteratively.
For each symbol entered, the current interval is divided according to the probabilities of the
alphabet.
The interval corresponding to the symbol is picked as the interval to be further proceeded
with.
The procedure continues until all symbols in the message have been processed.
Since each symbol's interval does not overlap with others, for each possible message there is
a unique interval assigned.
 We can represent the message with the interval's two ends [L, H). In fact, taking any single
value in the interval as the encoded code Zhkhan@iul.ac.in
is enough, and usually the left end L is selected.
6
Zhkhan@iul.ac.in 7
Zhkhan@iul.ac.in 8
Zhkhan@iul.ac.in 9
Once the character probabilities are
known, the individual symbols need
to be assigned a range along a
"probability line," which is nominally
0 to 1. It doesn't matter which
characters are assigned which
segment of the range, as long as it is
done in the same manner by both
the encoder and the decoder. The
nine-character symbol set used here
would look like Figure 2.

Zhkhan@iul.ac.in 10
Each character is assigned the
portion of the 0 - 1 range that
corresponds to its probability of
appearance. Note also that the
character "owns" everything up
to, but not including the higher
number. So the letter T in fact has
the range 0.90 - 0.9999 ....

Zhkhan@iul.ac.in 11
After the first character is encoded,
we also know that the range for our
output number is bounded by the
low and high numbers. During the
rest of the encoding process, each
new symbol to be encoded will
further restrict the possible range of
the output number. The next
character to be encoded, I, owns
the range 0.50 through 0.60. If this
was the first number in our
message, we would set these as our
low- and high-range values. But I is
the second character; therefore, we
say that I owns the range
corresponding to 0.50 - 0.60 in the
new subrange of 0.2 - 0.3. This
means that the new encoded
number will have to fall somewhere
in the 50 to 60th percentile of the
currently established range.

Zhkhan@iul.ac.in 12
Zhkhan@iul.ac.in 13
Zhkhan@iul.ac.in 14
Zhkhan@iul.ac.in 15
Zhkhan@iul.ac.in 16
Zhkhan@iul.ac.in 17
Zhkhan@iul.ac.in 18
Zhkhan@iul.ac.in 19
Zhkhan@iul.ac.in 20
Zhkhan@iul.ac.in 21
Arithmetic Coding Huffman Coding
Does not need the probability distribution Need a probability distribution

No need to keep and send codeword table Need to store the codeword table

Decompression speed is slow Decompression speed is Fast


Compression Speed is low Compression speed is Fast
Compression ratio is very good Compression ratio is poor
No compressed pattern matching Compressed pattern matching
Fractional codeword length Minimum codeword length is 1 bit
Does not produce Prefix code Produce Prefix code

Zhkhan@iul.ac.in 22
Zhkhan@iul.ac.in 23
Zhkhan@iul.ac.in 24
Zhkhan@iul.ac.in 25
each symbol or group of symbols is encoded with a variable length
code, according to some probability distribution.

Huffman
Dynamic Markov Compression

based on the use of a dictionary, which can be static or dynamic, and they
code each symbol or group of symbols with an element of the dictionary.

Zhkhan@iul.ac.in
Lempel-Ziv-Welch26
Zhkhan@iul.ac.in 27
Dictionary Coding
A dictionary coder, also sometimes known as a substitution coder,
Is a class of lossless data compression algorithms which operate by searching for
matches between the text to be compressed and a set of strings contained in a data
structure (called the 'dictionary') maintained by the encoder.
When the encoder finds such a match, it substitutes a reference to the string's
position in the data structure.
This coding technique is used when we have long phrases or long
sentence.
Two Types:
I. Static Diagram Coder
II. Dynamic (Adaptive) LZ77/LZ78/LZW
Zhkhan@iul.ac.in 28
Zhkhan@iul.ac.in 29
Zhkhan@iul.ac.in 30
Zhkhan@iul.ac.in 31
Zhkhan@iul.ac.in 32
Zhkhan@iul.ac.in 33
Zhkhan@iul.ac.in 34
Lempel-Ziv-Welch (LZW)
created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by
Welch in 1984as an improved implementation of the LZ78 algorithm, published by
Lempel and Ziv in 1978
universal adaptative1 lossless data compression algorithm
builds a translation table (also called dictionary) from the text being compressed
the string translation table maps the message strings to fixed-length codes

1
The coding scheme used for the kth character of a message is based on the characteristics of the
preceding k − 1 characters in the message

Zhkhan@iul.ac.in 35
Zhkhan@iul.ac.in 36
Zhkhan@iul.ac.in 37
Zhkhan@iul.ac.in 38
Zhkhan@iul.ac.in 39
Lempel-Ziv-Welch (LZW) Compression Algorithm
 As mentioned earlier, static coding schemes require some knowledge about
the data before encoding takes place.

 Universal coding schemes, like LZW, do not require advance knowledge and
can build such knowledge on-the-fly.

 LZW is the foremost technique for general purpose data compression due to
its simplicity and versatility.

 It is the basis of many PC utilities that claim to “double the capacity of your hard
drive”

 LZW compression uses a code table, with 4096 as a common choice for the
number of table entries.
Zhkhan@iul.ac.in 40
LZW (cont'd)
 Codes 0-255 in the code table are always assigned to represent single bytes
from the input file.

 When encoding begins the code table contains only the first 256 entries, with
the remainder of the table being blanks.

 Compression is achieved by using codes 256 through 4095 to represent


sequences of bytes.

 As the encoding continues, LZW identifies repeated sequences in the data,


and adds them to the code table.

 Decoding is achieved by taking each code from the compressed file, and
translating it through the code table to find what character or characters it
represents.

Zhkhan@iul.ac.in 41
LZW Encoding Algorithm
1 Initialize table with single character strings
2 P = first input character
3 WHILE not end of input stream
4 C = next input character
5 IF P + C is in the string table
6 P=P+C
7 ELSE
8 output the code for P
9 add P + C to the string table
10 P=C
11 END WHILE
12 output code for P

Zhkhan@iul.ac.in 42
Example 1: Compression using LZW

Example 1: Use the LZW algorithm to compress the string

BABAABAAA

Zhkhan@iul.ac.in 43
Example 1: LZW Compression Step 1

BABAABAAA P=A
C=empty
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA

Zhkhan@iul.ac.in 44
Example 1: LZW Compression Step 2

BABAABAAA P=B
C=empty
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
65 A 257 AB

Zhkhan@iul.ac.in 45
Example 1: LZW Compression Step 3

BABAABAAA P=A
C=empty
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
65 A 257 AB
256 BA 258 BAA

Zhkhan@iul.ac.in 46
Example 1: LZW Compression Step 4

BABAABAAA P=A
C=empty
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
65 A 257 AB
256 BA 258 BAA
257 AB 259 ABA

Zhkhan@iul.ac.in 47
Example 1: LZW Compression Step 5

BABAABAAA P=A
C=A
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
65 A 257 AB
256 BA 258 BAA
257 AB 259 ABA
65 A 260 AA

Zhkhan@iul.ac.in 48
Example 1: LZW Compression Step 6

BABAABAAA P=AA
C=empty
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
65 A 257 AB
256 BA 258 BAA
257 AB 259 ABA
65 A 260 AA
260 AA
Zhkhan@iul.ac.in 49
LZW Decompression

 The LZW decompressor creates the same string table during decompression.

 It starts with the first 256 table entries initialized to single characters.

 The string table is updated for each character in the input stream, except the
first one.

 Decoding achieved by reading codes and translating them through the code
table being built.

Zhkhan@iul.ac.in 50
LZW Decompression Algorithm
1 Initialize table with single character strings
2 OLD = first input code
3 output translation of OLD
4 WHILE not end of input stream
5 NEW = next input code
6 IF NEW is not in the string table
7 S = translation of OLD
8 S=S+C
9 ELSE
10 S = translation of NEW
11 output S
12 C = first character of S
13 OLD + C to the string table
14 OLD = NEW
15 END WHILE Zhkhan@iul.ac.in 51
Example 2: LZW Decompression 1

Example 2: Use LZW to decompress the output sequence of


Example 1:

<66><65><256><257><65><260>.

Zhkhan@iul.ac.in 52
Example 2: LZW Decompression Step 1
<66><65><256><257><65><260> Old = 65 S = A
New = 66 C = A
ENCODER OUTPUT STRING TABLE
string codeword string
B
A 256 BA

Zhkhan@iul.ac.in 53
Example 2: LZW Decompression Step 2
<66><65><256><257><65><260> Old = 256 S = BA
New = 256 C = B

ENCODER OUTPUT STRING TABLE


string codeword string
B
A 256 BA
BA 257 AB

Zhkhan@iul.ac.in 54
Example 2: LZW Decompression Step 3
<66><65><256><257><65><260> Old = 257 S = AB
New = 257 C = A
ENCODER OUTPUT STRING TABLE
string codeword string
B
A 256 BA
BA 257 AB
AB 258 BAA

Zhkhan@iul.ac.in 55
Example 2: LZW Decompression Step 4
<66><65><256><257><65><260> Old = 65 S = A
New = 65 C = A
ENCODER OUTPUT STRING TABLE
string codeword string
B
A 256 BA
BA 257 AB
AB 258 BAA
A 259 ABA

Zhkhan@iul.ac.in 56
Example 2: LZW Decompression Step 5
<66><65><256><257><65><260> Old = 260 S = AA
New = 260 C = A
ENCODER OUTPUT STRING TABLE
string codeword string
B
A 256 BA
BA 257 AB
AB 258 BAA
A 259 ABA
AA 260 AA
Zhkhan@iul.ac.in 57
LZW: Some Notes
 This algorithm compresses repetitive sequences of data well.

 Since the codewords are 12 bits, any single encoded character will expand the
data size rather than reduce it.

 In this example, 72 bits are represented with 72 bits of data. After a reasonable
string table is built, compression improves dramatically.

 Advantages of LZW over Huffman:


 LZW requires no prior information about the input data stream.
 LZW can compress the input stream in one single pass.
 Another advantage of LZW its simplicity, allowing fast execution.

Zhkhan@iul.ac.in 58
LZW: Limitations
 What happens when the dictionary gets too large (i.e., when all the 4096 locations have been
used)?
 Here are some options usually implemented:

 Simply forget about adding any more entries and use the table as is.

 Throw the dictionary away when it reaches a certain size.

 Throw the dictionary away when it is no longer effective at compression.

 Clear entries 256-4095 and start building the dictionary again.

 Some clever schemes rebuild a string table from the last N input characters.

Zhkhan@iul.ac.in 59
Lossless Image Compression: Multi-resolution Approaches.
Image compression is a type of data compression applied to digital images, to reduce
their cost for storage or transmission.
Image compression may be lossy or lossless. Lossless compression is preferred for
archival purposes and often for medical imaging, technical drawings, clip art, or comics.
Methods for lossless compression:
Run-length encoding – used in default method in PCX and as one of possible
in BMP, TGA, TIFF
Area image compression
Predictive coding – used in DPCM
Entropy encoding – the two most common entropy encoding techniques are arithmetic
coding and Huffman coding
Adaptive dictionary algorithms such as LZW – used in GIF and TIFF
DEFLATE – used in PNG, MNG, and TIFF
Chain codes
Zhkhan@iul.ac.in 60
Context Based Compression: Dynamic Markov Compression.
developed by Gordon Cormack and Nigel Horspool (1987)
 adaptative lossless data compression algorithm
based on the modelization of the binary source to be encoded by means of a Markov chain, which
describes the transition probabilities between the symbol “0” and the symbol “1”
the built model is used to predict the future bit of a message. The predicted bit is then coded
using arithmetic coding

Dynamic Markov compression (DMC) is a lossless data compression algorithm developed by Gordon
Cormack and Nigel Horspool.It uses predictive arithmetic coding similar to prediction by partial
matching (PPM), except that the input is predicted one bit at a time (rather than one byte at a time).
DMC has a good compression ratio and moderate speed, similar to PPM, but requires somewhat more
memory and is not widely implemented. Dynamic Markov Compression is an obscure form of
compression that uses Markov chains to model the patterns represented in a file.

Zhkhan@iul.ac.in 61
Each circle represents a state, and each
arrow represents a transition. In this
example, we have two states, raining and
sunny, a perfect representation of true
weather. Each state has two possible
transitions, it can transition to itself again
or it can transition to another state. The
likelihood of each transition is defined by
a percentage representing the probability
that the transition occurs.
Now let’s say it’s sunny and we’re
following this model. According to the
model there’s a 50% chance it’s
sunny again tomorrow or a 50% chance
it’s rainy tomorrow. If it becomes rainy,
then there’s a 25% chance it’s rainy the
day after that or a 75% chance it’s sunny
Zhkhan@iul.ac.in 62
the day after that.

You might also like