Professional Documents
Culture Documents
Module 1
By
Dr Akriti Nigam
Computer Science & Engineering Department
BIT, Mesra
Compression
Definition
Reduce size of data
(number of bits needed to represent data)
Benefits
Reduce storage needed
Reduce transmission cost / latency / bandwidth
Sources of Compressibility
Redundancy
Recognize repeating patterns
Exploit using
Dictionary
Variable length encoding
Human perception
Less sensitive to some information
Can discard less important data
Types of Compression
Lossless
Preserves all information
Exploits redundancy in data
Applied to general data
Lossy
May lose some information
Exploits redundancy & human perception
Applied to audio, image, video
Effectiveness of Compression
Metrics
Bits per byte (8 bits)
2 bits / byte ¼ original size
8 bits / byte no compression
Percentage
75% compression ¼ original size
Effectiveness of Compression
Depends on data
Random data hard
Example: 1001110100 ?
Organized data easy
Example: 1111111111 110
Corollary
No universally best compression algorithm
1. Huffman Coding
Approach
Variable length encoding of symbols
Exploit statistical frequency of symbols
Efficient when symbol probabilities vary widely
Principle
Use fewer bits to represent frequent symbols
Use more bits to represent infrequent symbols
A A B A
A A B A
Huffman Coding Example
Symbol A B C D
Frequency 13% 25% 50% 12%
Original 00 01 10 11
Encoding 2 bits 2 bits 2 bits 2 bits
Huffman 110 10 0 111
Encoding 3 bits 2 bits 1 bit 3 bits
Expected size
Original 1/82 + 1/42 + 1/22 + 1/82 = 2 bits / symbol
Huffman 1/83 + 1/42 + 1/21 + 1/83 = 1.75 bits / symbol
Huffman Coding Data Structures
Binary (Huffman) tree D A
Represents Huffman code
Edge code (0 or 1)
Leaf symbol 1 0 B
Path to leaf encoding
Example
C
A = “110”, B = “10”, C = “0” 1 0
Priority queue
To efficiently build binary tree 1 0
Huffman Coding Algorithm Overview
Encoding
Calculate frequency of symbols in file
Create binary tree representing “best” encoding
Use binary tree to encode compressed file
For each symbol, output path from root to leaf
Size of encoding = length of path
Save binary tree
Huffman Coding – Creating Tree
Algorithm
Place each symbol in leaf
Weight of leaf = symbol frequency
Select two trees L and R (initially leafs)
Such that L, R have lowest frequencies in tree
Create new (internal) node
Left child L
Right child R
New frequency frequency( L ) + frequency( R )
Repeat until all nodes merged into one tree
Huffman Tree Construction 1
A C E H I
3 5 8 2 7
Huffman Tree Construction 2
A H C E I
3 2 5 8 7
5
Huffman Tree Construction 3
A H E I
8 7
3 2
C
5 5
10
Huffman Tree Construction 4
A H E I
8 7
3 2
C
15
5 5
10
Huffman Tree Construction 5
A H
E = 01
3 2 I = 00
1 0 C E I C = 10
5 5 8 7 A = 111
1 0 1 0 H = 110
10 15
1 0
25
Huffman Coding Example
Huffman code E = 01
I = 00
C = 10
A = 111
Input H = 110
ACE
Output
(111)(10)(01) = 1111001
Huffman Coding Algorithm Overview
Decoding
Read compressed file & binary tree
Use binary tree to decode file
Follow path from root to leaf
Huffman Decoding 1
A H
1111001
3 2
1 0 C E I
5 5 8 7
1 0 1 0
10 15
1 0
25
Huffman Decoding 2
A H
1111001
3 2
1 0 C E I
5 5 8 7
1 0 1 0
10 15
1 0
25
Huffman Decoding 3
A H
1111001
3 2
1 0 C E I
5 5 8 7 A
1 0 1 0
10 15
1 0
25
Huffman Decoding 4
A H
1111001
3 2
1 0 C E I
5 5 8 7 A
1 0 1 0
10 15
1 0
25
Huffman Decoding 5
A H
1111001
3 2
1 0 C E I
5 5 8 7 AC
1 0 1 0
10 15
1 0
25
Huffman Decoding 6
A H
1111001
3 2
1 0 C E I
5 5 8 7 AC
1 0 1 0
10 15
1 0
25
Huffman Decoding 7
A H
1111001
3 2
1 0 C E I
5 5 8 7 ACE
1 0 1 0
10 15
1 0
25
Huffman Code Properties
Prefix code
No code is a prefix of another code
Example
Huffman(“I”) 00
Huffman(“X”) 001 // not legal prefix code
Can stop as soon as complete code found
No need for end-of-code marker
Huffman Code Properties
Greedy algorithm
Chooses best local solution at each step
Combines 2 trees with lowest frequency
Still yields overall best solution
Optimal prefix code
Based on statistical frequency
2. Shannon Fano Coding
Working of Shannon Fano Coding Algorithm
1. Calculate the number of times each symbol appears and then find out the
probability of each symbol by dividing it by the total number of symbols.
2. Now, Sort the symbols in decreasing order of their probability.
3. Divide the symbols into two subparts, with the sum of probabilities in each part
being as close to each other as possible.
4. Assign the value '0' to the first subpart and '1' to the second subpart.
5. Repeat steps 3 and 4 for each subpart until each symbol is it own.
Ensure the probabilities are accurate; otherwise, the resulting binary code may
not be optimal for compression.
3. Arithmetic Coding
Lempel Ziv Welch (LZW) encoding is an example of a category of algorithms called dictionary-
based encoding.
The idea is to create a dictionary (a table) of strings used during the communication session.
If both the sender and the receiver have a copy of the dictionary, then previously-encountered
strings can be substituted by their index in the dictionary to reduce the amount of information
transmitted.
Compression
In this phase there are two concurrent events: building an indexed dictionary and
compressing a string of symbols.
The algorithm extracts the smallest substring that cannot be found in the dictionary
from the remaining uncompressed string.
It then stores a copy of this substring in the dictionary as a new entry and assigns it an
index value.
Compression occurs when the substring, except for the last character, is replaced with
the index found in the dictionary.
The process then inserts the index and the last character of the substring into the
compressed string.
Example- Consider the following 4 X 4, 8 bit image of a vertical edge
39 39 126 126
39 39 126 126
39 39 126 126
39 39 126 126
39
39 39 39 256 39-39
39 39
126 126
39 39
39-39 126
126 39
39 126
126 126
5. Run Length Coding
• Run Length Encoding is a lossless data compression algorithm. It compresses data by reducing
repetitive, and consecutive data called runs. It does so by storing the number of these runs followed
by the data.
• Before we understand RLE, let’s have a look at few examples:
• For the text AAAAAAAAAAAAAHHHEEM (19 characters), RLE will encode it to 13A3H2EM (7
characters).
• For the text AAAAHHHEEM, HAHA., it will be encoded as 4A3H2E1M1,1 1H1A1H1A1. (21 characters).
• From these examples, we see that RLE is suitable for compressing large amounts of data with a few
runs e.g., image pixel information.
• RLE is suited for compressing any type of data regardless of its information content, but the content of
the data will affect the compression ratio achieved by RLE.
5. Run Length Coding
• Although most RLE algorithms cannot achieve the high compression ratios of the more advanced
compression methods, RLE is both easy to implement and quick to execute, making it a good
alternative to either using a complex compression algorithm or leaving your image data
uncompressed.
• RLE works by reducing the physical size of a repeating string of characters. This repeating string,
called a run, is typically encoded into two bytes.
• The first byte represents the number of characters in the run and is called the run count.
• In practice, an encoded run may contain 1 to 128 or 256 characters; the run count usually
contains as the number of characters minus one (a value in the range of 0 to 127 or 255).
• The second byte is the value of the character in the run, which is in the range of 0 to 255, and is
called the run value.
Run-length
encoding
variants