You are on page 1of 43

Information & Coding Theory (ICT)

Module 1

By
Dr Akriti Nigam
Computer Science & Engineering Department
BIT, Mesra
Compression
Definition
Reduce size of data
(number of bits needed to represent data)
Benefits
Reduce storage needed
Reduce transmission cost / latency / bandwidth
Sources of Compressibility
Redundancy
Recognize repeating patterns
Exploit using
Dictionary
Variable length encoding
Human perception
Less sensitive to some information
Can discard less important data
Types of Compression
Lossless
Preserves all information
Exploits redundancy in data
Applied to general data
Lossy
May lose some information
Exploits redundancy & human perception
Applied to audio, image, video
Effectiveness of Compression
Metrics
Bits per byte (8 bits)
2 bits / byte  ¼ original size
8 bits / byte  no compression
Percentage
75% compression  ¼ original size
Effectiveness of Compression
Depends on data
Random data  hard
Example: 1001110100  ?
Organized data  easy
Example: 1111111111  110
Corollary
No universally best compression algorithm
1. Huffman Coding
Approach
Variable length encoding of symbols
Exploit statistical frequency of symbols
Efficient when symbol probabilities vary widely
Principle
Use fewer bits to represent frequent symbols
Use more bits to represent infrequent symbols

A A B A

A A B A
Huffman Coding Example

Symbol A B C D
Frequency 13% 25% 50% 12%
Original 00 01 10 11
Encoding 2 bits 2 bits 2 bits 2 bits
Huffman 110 10 0 111
Encoding 3 bits 2 bits 1 bit 3 bits

Expected size
Original  1/82 + 1/42 + 1/22 + 1/82 = 2 bits / symbol
Huffman  1/83 + 1/42 + 1/21 + 1/83 = 1.75 bits / symbol
Huffman Coding Data Structures
Binary (Huffman) tree D A
Represents Huffman code
Edge  code (0 or 1)
Leaf  symbol 1 0 B
Path to leaf  encoding
Example
C
A = “110”, B = “10”, C = “0” 1 0

Priority queue
To efficiently build binary tree 1 0
Huffman Coding Algorithm Overview
Encoding
Calculate frequency of symbols in file
Create binary tree representing “best” encoding
Use binary tree to encode compressed file
For each symbol, output path from root to leaf
Size of encoding = length of path
Save binary tree
Huffman Coding – Creating Tree
Algorithm
Place each symbol in leaf
Weight of leaf = symbol frequency
Select two trees L and R (initially leafs)
Such that L, R have lowest frequencies in tree
Create new (internal) node
Left child  L
Right child  R
New frequency  frequency( L ) + frequency( R )
Repeat until all nodes merged into one tree
Huffman Tree Construction 1

A C E H I
3 5 8 2 7
Huffman Tree Construction 2

A H C E I
3 2 5 8 7

5
Huffman Tree Construction 3

A H E I
8 7
3 2
C

5 5

10
Huffman Tree Construction 4

A H E I
8 7
3 2
C
15
5 5

10
Huffman Tree Construction 5

A H
E = 01
3 2 I = 00
1 0 C E I C = 10
5 5 8 7 A = 111
1 0 1 0 H = 110
10 15

1 0
25
Huffman Coding Example
Huffman code E = 01
I = 00
C = 10
A = 111
Input H = 110
ACE
Output
(111)(10)(01) = 1111001
Huffman Coding Algorithm Overview
Decoding
Read compressed file & binary tree
Use binary tree to decode file
Follow path from root to leaf
Huffman Decoding 1

A H
1111001
3 2

1 0 C E I
5 5 8 7

1 0 1 0
10 15

1 0
25
Huffman Decoding 2

A H
1111001
3 2

1 0 C E I
5 5 8 7

1 0 1 0
10 15

1 0
25
Huffman Decoding 3

A H
1111001
3 2

1 0 C E I
5 5 8 7 A
1 0 1 0
10 15

1 0
25
Huffman Decoding 4

A H
1111001
3 2

1 0 C E I
5 5 8 7 A
1 0 1 0
10 15

1 0
25
Huffman Decoding 5

A H
1111001
3 2

1 0 C E I
5 5 8 7 AC
1 0 1 0
10 15

1 0
25
Huffman Decoding 6

A H
1111001
3 2

1 0 C E I
5 5 8 7 AC
1 0 1 0
10 15

1 0
25
Huffman Decoding 7

A H
1111001
3 2

1 0 C E I
5 5 8 7 ACE
1 0 1 0
10 15

1 0
25
Huffman Code Properties
Prefix code
No code is a prefix of another code
Example
Huffman(“I”)  00
Huffman(“X”)  001 // not legal prefix code
Can stop as soon as complete code found
No need for end-of-code marker
Huffman Code Properties
Greedy algorithm
Chooses best local solution at each step
Combines 2 trees with lowest frequency
Still yields overall best solution
Optimal prefix code
Based on statistical frequency
2. Shannon Fano Coding
Working of Shannon Fano Coding Algorithm
1. Calculate the number of times each symbol appears and then find out the
probability of each symbol by dividing it by the total number of symbols. 
2. Now, Sort the symbols in decreasing order of their probability. 
3. Divide the symbols into two subparts, with the sum of probabilities in each part
being as close to each other as possible. 
4. Assign the value  '0' to the first subpart and '1' to the second subpart. 
5. Repeat steps 3 and 4 for each subpart until each symbol is it own. 
Ensure the probabilities are accurate; otherwise, the resulting binary code may
not be optimal for compression.
3. Arithmetic Coding

Unlike the variable-length codes described previously, arithmetic coding


generates nonblock codes.
In arithmetic coding, a one-to-one correspondence between source symbols and
code words does not exist. Instead, an entire sequence of source symbols (or
message) is assigned a single arithmetic code word.
The code word itself defines an interval of real numbers between 0 and 1.
As the number of symbols in the message increases, the interval used to
represent it becomes smaller and the number of information units (say, bits)
required to represent the interval becomes larger.
Each symbol of the message reduces the size of the interval in accordance with
its probability of occurrence.
3. Arithmetic Coding

Here, a five-symbol sequence or message, a1a2a3a3a4, from a four-symbol source is coded.


At the start of the coding process, the message is assumed to occupy the entire half-open interval
[0, 1).
This interval is initially subdivided into four regions based on the probabilities of each source
symbol.
Symbol a1, for example, is associated with subinterval [0, 0.2). Because it is the first symbol of the
message being coded, the message interval is initially narrowed to [0, 0.2).
Thus [0, 0.2) is expanded to the full height of the figure and its end points labeled by the values of
the narrowed range.
The narrowed range is then subdivided in accordance with the original source symbol probabilities
and the process continues with the next message symbol.
3. Arithmetic Coding
In this manner, symbol a2 narrows the subinterval to [0.04, 0.08), a3 further narrows it to [0.056,
0.072), and so on.
The final message symbol, which must be reserved as a special end-of-message indicator,
narrows the range to [0.06752, 0.0688).
Of course, any number within this subinterval—for example, 0.068—can be used to represent the
message.
In practice, two factors cause coding performance to fall short of the bound: (1) the addition of
the end-of-message indicator that is needed to separate one message from an- other; and (2) the
use of finite precision arithmetic.
Practical implementations of arithmetic coding address the latter problem by introducing a scaling
strategy and a rounding strategy (Langdon and Rissanen [1981]).
The scaling strategy renormalizes each subinterval to the [0, 1) range before subdividing it in
accordance with the symbol probabilities.
The rounding strategy guarantees that the truncations associated with finite precision arithmetic
do not prevent the coding subintervals from being represented accurately.
4. Lempel Ziv Welch encoding

Lempel Ziv Welch (LZW) encoding is an example of a category of algorithms called dictionary-
based encoding.
The idea is to create a dictionary (a table) of strings used during the communication session.
If both the sender and the receiver have a copy of the dictionary, then previously-encountered
strings can be substituted by their index in the dictionary to reduce the amount of information
transmitted.
Compression

In this phase there are two concurrent events: building an indexed dictionary and
compressing a string of symbols.
The algorithm extracts the smallest substring that cannot be found in the dictionary
from the remaining uncompressed string.
It then stores a copy of this substring in the dictionary as a new entry and assigns it an
index value.
Compression occurs when the substring, except for the last character, is replaced with
the index found in the dictionary.
The process then inserts the index and the last character of the substring into the
compressed string.
Example- Consider the following 4 X 4, 8 bit image of a vertical edge

39 39 126 126

39 39 126 126

39 39 126 126

39 39 126 126

Image is encoded by processing its pixels in a left-to-right, top-to-


bottom manner
Currently Recognized Pixel Being Processed Encoded Output Dictionary Location Dictionary Entry
Sequence (Code Word)

39

39 39 39 256 39-39

39 126 39 257 39-126

126 126 126 258 126-126

126 39 126 259 126-39

39 39

39-39 126 256 260 39-39-126

126 126

126-126 39 258 261 126-126-139

39 39

39-39 126

39-39-126 126 260 262 39-39-126-126

126 39

126-39 39 259 263 126-39-39

39 126

39-126 126 257 264 39-126-126

126 126
5. Run Length Coding
• Run Length Encoding is a lossless data compression algorithm. It compresses data by reducing
repetitive, and consecutive data called runs. It does so by storing the number of these runs followed
by the data.
• Before we understand RLE, let’s have a look at few examples:
• For the text AAAAAAAAAAAAAHHHEEM (19 characters), RLE will encode it to 13A3H2EM (7
characters).
• For the text AAAAHHHEEM, HAHA., it will be encoded as 4A3H2E1M1,1 1H1A1H1A1. (21 characters).
• From these examples, we see that RLE is suitable for compressing large amounts of data with a few
runs e.g., image pixel information.
• RLE is suited for compressing any type of data regardless of its information content, but the content of
the data will affect the compression ratio achieved by RLE.
5. Run Length Coding
• Although most RLE algorithms cannot achieve the high compression ratios of the more advanced
compression methods, RLE is both easy to implement and quick to execute, making it a good
alternative to either using a complex compression algorithm or leaving your image data
uncompressed.
• RLE works by reducing the physical size of a repeating string of characters. This repeating string,
called a run, is typically encoded into two bytes.
• The first byte represents the number of characters in the run and is called the run count.
• In practice, an encoded run may contain 1 to 128 or 256 characters; the run count usually
contains as the number of characters minus one (a value in the range of 0 to 127 or 255).
• The second byte is the value of the character in the run, which is in the range of 0 to 255, and is
called the run value.
Run-length
encoding
variants

You might also like