Professional Documents
Culture Documents
Unit 3 CA209
Unit 3 CA209
Zhkhan@iul.ac.in 3
Coding rate is the average number of bits used to represent a symbol from a
source.
For a given probability model, the entropy is the lowest rate at which the
source can be coded.
Huffman coding will generate whose rate is within p_max + 0. 086
Therefore, in Huffman coding, when the alphabet size is large, the amount of
deviation from the entropy is quite small, and vice versa.
One solution for this problem is blocking in Huffman coding. In which, it is
more efficient to generate codewords for groups or sequences of symbols
rather than to generate a separate codeword for each symbol in a sequence.
In order to find the Huffman coding for a sequence of length m, we need
codewords for all possible sequences of length m.
This causes an exponential growth in the size of the code book.
Zhkhan@iul.ac.in 4
Arithmetic coding
We need a way of assigning codewords to particular sequences with out having to
generate a codes for all sequences of that length.
Rather than separating the input into component symbols and replacing each with a code,
arithmetic encodes the entire message with a number (tag).
Firstly, a unique identifier or tag is generated for a sequence. Secondly, this tag is then
given a unique binary code.
Arithmetic coding is based on the concept of interval subdividing.
– In arithmetic coding a source ensemble is represented by an interval between 0 and 1 on the real number
line.
– Each symbol of the ensemble narrows this interval.
– As the interval becomes smaller, the number of bits needed to specify it grows. – Arithmetic coding
assumes an explicit probabilistic model of the source.
– It uses the probabilities of the source messages to successively narrow the interval used to represent the
ensemble.
A high probability message narrows the interval less than a low probability message, so that high probability
messages contribute fewer bits to the coded ensemble.
Zhkhan@iul.ac.in 5
Assume we know the probabilities of each symbol of the data source,
we can allocate to each symbol an interval with width proportional to its probability,
and each of the intervals does not overlap with others.
This can be done if we use the cumulative probabilities as the two ends of each
interval.
Therefore, the two ends of each symbol x amount to Q[x-1] and Q[x].
Symbol x is said to own the range [Q[x-1], Q[x]).
We begin with the interval [0, 1) and subdivide the interval iteratively.
For each symbol entered, the current interval is divided according to the probabilities of the
alphabet.
The interval corresponding to the symbol is picked as the interval to be further proceeded
with.
The procedure continues until all symbols in the message have been processed.
Since each symbol's interval does not overlap with others, for each possible message there is
a unique interval assigned.
We can represent the message with the interval's two ends [L, H). In fact, taking any single
value in the interval as the encoded code Zhkhan@iul.ac.in
is enough, and usually the left end L is selected.
6
Zhkhan@iul.ac.in 7
Zhkhan@iul.ac.in 8
Zhkhan@iul.ac.in 9
Once the character probabilities are
known, the individual symbols need
to be assigned a range along a
"probability line," which is nominally
0 to 1. It doesn't matter which
characters are assigned which
segment of the range, as long as it is
done in the same manner by both
the encoder and the decoder. The
nine-character symbol set used here
would look like Figure 2.
Zhkhan@iul.ac.in 10
Each character is assigned the
portion of the 0 - 1 range that
corresponds to its probability of
appearance. Note also that the
character "owns" everything up
to, but not including the higher
number. So the letter T in fact has
the range 0.90 - 0.9999 ....
Zhkhan@iul.ac.in 11
After the first character is encoded,
we also know that the range for our
output number is bounded by the
low and high numbers. During the
rest of the encoding process, each
new symbol to be encoded will
further restrict the possible range of
the output number. The next
character to be encoded, I, owns
the range 0.50 through 0.60. If this
was the first number in our
message, we would set these as our
low- and high-range values. But I is
the second character; therefore, we
say that I owns the range
corresponding to 0.50 - 0.60 in the
new subrange of 0.2 - 0.3. This
means that the new encoded
number will have to fall somewhere
in the 50 to 60th percentile of the
currently established range.
Zhkhan@iul.ac.in 12
Zhkhan@iul.ac.in 13
Zhkhan@iul.ac.in 14
Zhkhan@iul.ac.in 15
Zhkhan@iul.ac.in 16
Zhkhan@iul.ac.in 17
Zhkhan@iul.ac.in 18
Zhkhan@iul.ac.in 19
Zhkhan@iul.ac.in 20
Zhkhan@iul.ac.in 21
Arithmetic Coding Huffman Coding
Does not need the probability distribution Need a probability distribution
No need to keep and send codeword table Need to store the codeword table
Zhkhan@iul.ac.in 22
Zhkhan@iul.ac.in 23
Zhkhan@iul.ac.in 24
Zhkhan@iul.ac.in 25
each symbol or group of symbols is encoded with a variable length
code, according to some probability distribution.
Huffman
Dynamic Markov Compression
based on the use of a dictionary, which can be static or dynamic, and they
code each symbol or group of symbols with an element of the dictionary.
Zhkhan@iul.ac.in
Lempel-Ziv-Welch26
Zhkhan@iul.ac.in 27
Dictionary Coding
A dictionary coder, also sometimes known as a substitution coder,
Is a class of lossless data compression algorithms which operate by searching for
matches between the text to be compressed and a set of strings contained in a data
structure (called the 'dictionary') maintained by the encoder.
When the encoder finds such a match, it substitutes a reference to the string's
position in the data structure.
This coding technique is used when we have long phrases or long
sentence.
Two Types:
I. Static Diagram Coder
II. Dynamic (Adaptive) LZ77/LZ78/LZW
Zhkhan@iul.ac.in 28
Zhkhan@iul.ac.in 29
Zhkhan@iul.ac.in 30
Zhkhan@iul.ac.in 31
Zhkhan@iul.ac.in 32
Zhkhan@iul.ac.in 33
Zhkhan@iul.ac.in 34
Lempel-Ziv-Welch (LZW)
created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by
Welch in 1984as an improved implementation of the LZ78 algorithm, published by
Lempel and Ziv in 1978
universal adaptative1 lossless data compression algorithm
builds a translation table (also called dictionary) from the text being compressed
the string translation table maps the message strings to fixed-length codes
1
The coding scheme used for the kth character of a message is based on the characteristics of the
preceding k − 1 characters in the message
Zhkhan@iul.ac.in 35
Zhkhan@iul.ac.in 36
Zhkhan@iul.ac.in 37
Zhkhan@iul.ac.in 38
Zhkhan@iul.ac.in 39
Lempel-Ziv-Welch (LZW) Compression Algorithm
As mentioned earlier, static coding schemes require some knowledge about
the data before encoding takes place.
Universal coding schemes, like LZW, do not require advance knowledge and
can build such knowledge on-the-fly.
LZW is the foremost technique for general purpose data compression due to
its simplicity and versatility.
It is the basis of many PC utilities that claim to “double the capacity of your hard
drive”
LZW compression uses a code table, with 4096 as a common choice for the
number of table entries.
Zhkhan@iul.ac.in 40
LZW (cont'd)
Codes 0-255 in the code table are always assigned to represent single bytes
from the input file.
When encoding begins the code table contains only the first 256 entries, with
the remainder of the table being blanks.
Decoding is achieved by taking each code from the compressed file, and
translating it through the code table to find what character or characters it
represents.
Zhkhan@iul.ac.in 41
LZW Encoding Algorithm
1 Initialize table with single character strings
2 P = first input character
3 WHILE not end of input stream
4 C = next input character
5 IF P + C is in the string table
6 P=P+C
7 ELSE
8 output the code for P
9 add P + C to the string table
10 P=C
11 END WHILE
12 output code for P
Zhkhan@iul.ac.in 42
Example 1: Compression using LZW
BABAABAAA
Zhkhan@iul.ac.in 43
Example 1: LZW Compression Step 1
BABAABAAA P=A
C=empty
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
Zhkhan@iul.ac.in 44
Example 1: LZW Compression Step 2
BABAABAAA P=B
C=empty
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
65 A 257 AB
Zhkhan@iul.ac.in 45
Example 1: LZW Compression Step 3
BABAABAAA P=A
C=empty
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
65 A 257 AB
256 BA 258 BAA
Zhkhan@iul.ac.in 46
Example 1: LZW Compression Step 4
BABAABAAA P=A
C=empty
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
65 A 257 AB
256 BA 258 BAA
257 AB 259 ABA
Zhkhan@iul.ac.in 47
Example 1: LZW Compression Step 5
BABAABAAA P=A
C=A
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
65 A 257 AB
256 BA 258 BAA
257 AB 259 ABA
65 A 260 AA
Zhkhan@iul.ac.in 48
Example 1: LZW Compression Step 6
BABAABAAA P=AA
C=empty
ENCODER OUTPUT STRING TABLE
output code representing codeword string
66 B 256 BA
65 A 257 AB
256 BA 258 BAA
257 AB 259 ABA
65 A 260 AA
260 AA
Zhkhan@iul.ac.in 49
LZW Decompression
The LZW decompressor creates the same string table during decompression.
It starts with the first 256 table entries initialized to single characters.
The string table is updated for each character in the input stream, except the
first one.
Decoding achieved by reading codes and translating them through the code
table being built.
Zhkhan@iul.ac.in 50
LZW Decompression Algorithm
1 Initialize table with single character strings
2 OLD = first input code
3 output translation of OLD
4 WHILE not end of input stream
5 NEW = next input code
6 IF NEW is not in the string table
7 S = translation of OLD
8 S=S+C
9 ELSE
10 S = translation of NEW
11 output S
12 C = first character of S
13 OLD + C to the string table
14 OLD = NEW
15 END WHILE Zhkhan@iul.ac.in 51
Example 2: LZW Decompression 1
<66><65><256><257><65><260>.
Zhkhan@iul.ac.in 52
Example 2: LZW Decompression Step 1
<66><65><256><257><65><260> Old = 65 S = A
New = 66 C = A
ENCODER OUTPUT STRING TABLE
string codeword string
B
A 256 BA
Zhkhan@iul.ac.in 53
Example 2: LZW Decompression Step 2
<66><65><256><257><65><260> Old = 256 S = BA
New = 256 C = B
Zhkhan@iul.ac.in 54
Example 2: LZW Decompression Step 3
<66><65><256><257><65><260> Old = 257 S = AB
New = 257 C = A
ENCODER OUTPUT STRING TABLE
string codeword string
B
A 256 BA
BA 257 AB
AB 258 BAA
Zhkhan@iul.ac.in 55
Example 2: LZW Decompression Step 4
<66><65><256><257><65><260> Old = 65 S = A
New = 65 C = A
ENCODER OUTPUT STRING TABLE
string codeword string
B
A 256 BA
BA 257 AB
AB 258 BAA
A 259 ABA
Zhkhan@iul.ac.in 56
Example 2: LZW Decompression Step 5
<66><65><256><257><65><260> Old = 260 S = AA
New = 260 C = A
ENCODER OUTPUT STRING TABLE
string codeword string
B
A 256 BA
BA 257 AB
AB 258 BAA
A 259 ABA
AA 260 AA
Zhkhan@iul.ac.in 57
LZW: Some Notes
This algorithm compresses repetitive sequences of data well.
Since the codewords are 12 bits, any single encoded character will expand the
data size rather than reduce it.
In this example, 72 bits are represented with 72 bits of data. After a reasonable
string table is built, compression improves dramatically.
Zhkhan@iul.ac.in 58
LZW: Limitations
What happens when the dictionary gets too large (i.e., when all the 4096 locations have been
used)?
Here are some options usually implemented:
Simply forget about adding any more entries and use the table as is.
Some clever schemes rebuild a string table from the last N input characters.
Zhkhan@iul.ac.in 59
Lossless Image Compression: Multi-resolution Approaches.
Image compression is a type of data compression applied to digital images, to reduce
their cost for storage or transmission.
Image compression may be lossy or lossless. Lossless compression is preferred for
archival purposes and often for medical imaging, technical drawings, clip art, or comics.
Methods for lossless compression:
Run-length encoding – used in default method in PCX and as one of possible
in BMP, TGA, TIFF
Area image compression
Predictive coding – used in DPCM
Entropy encoding – the two most common entropy encoding techniques are arithmetic
coding and Huffman coding
Adaptive dictionary algorithms such as LZW – used in GIF and TIFF
DEFLATE – used in PNG, MNG, and TIFF
Chain codes
Zhkhan@iul.ac.in 60
Context Based Compression: Dynamic Markov Compression.
developed by Gordon Cormack and Nigel Horspool (1987)
adaptative lossless data compression algorithm
based on the modelization of the binary source to be encoded by means of a Markov chain, which
describes the transition probabilities between the symbol “0” and the symbol “1”
the built model is used to predict the future bit of a message. The predicted bit is then coded
using arithmetic coding
Dynamic Markov compression (DMC) is a lossless data compression algorithm developed by Gordon
Cormack and Nigel Horspool.It uses predictive arithmetic coding similar to prediction by partial
matching (PPM), except that the input is predicted one bit at a time (rather than one byte at a time).
DMC has a good compression ratio and moderate speed, similar to PPM, but requires somewhat more
memory and is not widely implemented. Dynamic Markov Compression is an obscure form of
compression that uses Markov chains to model the patterns represented in a file.
Zhkhan@iul.ac.in 61
Each circle represents a state, and each
arrow represents a transition. In this
example, we have two states, raining and
sunny, a perfect representation of true
weather. Each state has two possible
transitions, it can transition to itself again
or it can transition to another state. The
likelihood of each transition is defined by
a percentage representing the probability
that the transition occurs.
Now let’s say it’s sunny and we’re
following this model. According to the
model there’s a 50% chance it’s
sunny again tomorrow or a 50% chance
it’s rainy tomorrow. If it becomes rainy,
then there’s a 25% chance it’s rainy the
day after that or a 75% chance it’s sunny
Zhkhan@iul.ac.in 62
the day after that.