0% found this document useful (0 votes)
48 views63 pages

Chapter 2-Compression Techniques

Data compression reduces the number of bits needed to represent data, which is essential for efficient storage and transmission, particularly for large files like audio and video. There are two main types of compression: lossless, which retains all original data, and lossy, which sacrifices some data for higher compression rates. Various algorithms, such as Run-Length Encoding and Huffman Coding, are used to achieve data compression by identifying and eliminating redundancy.

Uploaded by

Resika Umayantha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views63 pages

Chapter 2-Compression Techniques

Data compression reduces the number of bits needed to represent data, which is essential for efficient storage and transmission, particularly for large files like audio and video. There are two main types of compression: lossless, which retains all original data, and lossy, which sacrifices some data for higher compression rates. Various algorithms, such as Run-Length Encoding and Huffman Coding, are used to achieve data compression by identifying and eliminating redundancy.

Uploaded by

Resika Umayantha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DATA COMPRESSION

TECHNIQUES
Chapter-02
What is compression?
What is data compression?

■ Data compression is the reduction in number of bits needed to represent data.


■ Compression refers to the ways in which the amount of data needed to store an
image or other file can be reduced. This can help to reduce the amount of
storage space needed to store your animation files and also to reduce the
bandwidth needed to play them back.
■ The process of coding that will effectively reduce the total number of bits needed
to represent certain information.
What is data compression?
Encoding and decoding

■ Encoding is the process of putting a sequence of characters (letters, numbers,


punctuation, and certain symbols) into a specialized format for efficient transmission or
storage.
– E.g. A → 000

■ Decoding is the opposite process – the conversion of an encoded format back into the
original sequence of characters.
– E.g. 000 → A

■ Encoding and decoding are used in data communications, networking, storage, security.

Note: Hashing and Encryption


Why do we need Data compression?
■ Data Compression becomes particularly important when we send data with high
size such as audio & video
■ Even with very fast transmission speed of data we need to send data in short
time. We need to Compress data for this purpose.
■ Virtually all form of data contain redundancy i.e. it is the amount of wasted "space"
used to transmit certain data.
■ By making use of more efficient data representation methods, redundancy can be
reduced.
■ The goal of data compression is to represent an information source (e.g. a data
file, a speech signal, an image, or a video signal) as accurately as possible using
the fewest number of bits.
Why do we need Data compression?
– Reduce storage space
– Save transmission time
– Reduce computation
Compression ratio

Uncomp𝑟𝑒𝑠𝑠𝑒𝑑 𝑠𝑖𝑧𝑒
Compression ratio =
Compressed 𝑠𝑖𝑧𝑒

• Basically higher compression ratio the better.


• Thus a representation that compresses a 10 MB file to 2MB has a compression ratio
of 10/2 = 5, often notated as an exploit ratio , 5:1 (read “five” to “one”) or as an
implicit ratio 5/1
Space savings
Comp𝑟𝑒𝑠𝑠𝑒𝑑 𝑠𝑖𝑧𝑒
Space savings= 1 −
Uncompressed 𝑠𝑖𝑧𝑒

Thus, a representation that compresses the storage size of a file from 10MB
to 2MB yields a space saving of 1 - 2/10 = 0.8, often notated as a
percentage, 80%.
Data Compression methods
Lossless compression
– Output data is exactly same as input data → no information is lost

Same as Original

– Essential for encoding computer processed data


– It uses algorithms to identify statistical redundancy in the data, and then reformats the
data to deliver the same data with much smaller size.
lossy compression

– Output data not same as input data → some information will be lost
– Acceptable for data that is only viewed or heard (image, video, sound)

Not Same as
Original

• Lossy techniques usually achieve higher compression rates than


lossless ones, but the latter are more accurate.
lossy compression

■ The best example is a videoconference where there is an acceptable amount of


frame loss in order to deliver the image in real time. People may appear jerky in
their movements, but you still have a grasp for what is happening on the other
end of the conference.
■ In the case of graphics files, some resolution may be lost in order to create a
smaller file. The loss may be in the form of color depth or graphic detail. For
example, high-resolution details can be lost if a picture is going to be displayed
on a low-resolution device. Loss is also acceptable in voice and audio
compression, depending on the desired quality.
lossy compression
• Degree of loss is usually a parameter of the compression algorithm
• Tradeoff - loss versus compression
higher compression → more loss
lower compression → less loss
Lossless vs. Lossy
Questions:

1. What does Lossy Compression do to files?


a. Increases the file size and keeps the same quality
b. Eliminates no information at all
c. Decreases the file size and keeps the same quality
d. Eliminates unnecessary information in a file to reduce file size
2. Which of the following are not in a compressed format?
a. JPEG
b. MPEG
c. Bitmap
d. MP3
Lossless compression algorithms
■ Repetitive Sequence Suppression
■ Run-Length Encoding (RLE)
■ Pattern Substitution
■ Entropy Encoding:
– Shannon-Fano Algorithm
– Huffman Coding
– Arithmetic Coding
■ Lempel-Ziv-Welch (LZW) Algorithm
Simple Repetition Suppression
■ If in a sequence a series on n successive tokens appears we can replace these with
a token and a count number of occurrences.

■ We usually need to have a special flag to denote when the repeated token appears

■ E.g.:
– 89400000000000000000000000000000000
– we can replace with 894f32, where f is the flag for zero.
Run-length encoding
■ This encoding method is frequently applied to images (or pixels in a scan line). It is a small
compression component used in JPEG compression.

■ A sequences of image elements 𝑥1 , 𝑥2 , … , 𝑥𝑛 are mapped to pairs


(𝑐1, 𝑙1), (𝑐1, 𝐼2), … , (𝑐𝑛, ln) where 𝑐𝑖 represents image intensity or color and 𝐼𝑖 is the length
of the 𝑖 𝑡ℎ run of pixels
– Not dissimilar to zero length suppression above

■ E.g.: Original Sequence : 111122233333311112222


RLE : (1,4),(2,3),(3,6),(1,4),(2,4)

■ The savings depends on the data. In the worst case (random noise) encoding is more
heavy.
Run-length encoding for text data

■ However, RLE can be used to store that same data using fewer bytes.
■ RLE for the above Example: (a,4) (b,6) (c,1) (d,4)
■ If this text is compressed using RLE we would end up with:
■ As we can see, this compressed version only requires 8 bytes - a reduction from the original 16
bytes (assuming each number is also represented using one byte).

97 4 98 6 99 1 100 5

numbers ASCII representation


Run-length encoding for text data

E.g.: Original Sequence : 1,2,3,1,1,2,5


RLE : (1,1),(2,1),(3,1),(1,2),(2,1),(5,1)

No of bytes for the original sequence = 7


No of bytes after compression = 12

Negative compression - compressed size is larger than the


uncompressed size
Run-length encoding for text data
RLE for bitmapped image data

23
Bit Level RLE (for black and white images)

Binary image, has only two values 0, and 1

24
Bit Level RLE (for black and white
images)

When compressed using bit-level RLE, the first six pixels are white so the left-most
bit in the first byte of the encoding would be 1 (underlined in the example below)
followed by the 7-bit binary number representing 6.

The second ten pixels are black (see the underlined 0 in the example below) and so
the next byte in the encoding would be:

25
Pattern Substitution
■ This is a simple form of statistical encoding.
■ Here we substitute a frequently repeating pattern(s) with a
code.
■ A simple Pattern Substitution scheme could employ
predefined code
■ Example:

26
Fixed length encoding

■ Σ = {A,B,C,D,E,F}
■ Number of bits required to represent each character = log 2 Σ = 3 bits
– E.g. A = 000, B = 001, …, H=111
■ Each character is represented by a fixed number of bits.
■ Optimum when characters appear equally likely – at similar frequency
27
Variable length encoding
■ If character appearances are NOT equally probable:
– Use shorter descriptions for characters which appear frequently
– Use longer descriptions for characters which appear rarely

■ Example: A occurs frequently than E


– Code for A = 00, and E = 1101

28
Entropy Encoding
■ A term used to denote some compression
algorithms, which works based on the probability
distribution of source symbols.

■ Entropy coding is a type of lossless coding to


compress digital data by representing
– frequently occurring patterns with few bits and
– rarely occurring patterns with many bits.

29
Information
■ Information:
𝐼𝑝 = − log 𝑏 𝑝
– 𝑝 – probability of the event happening
– 𝑏 – base
■ Unit of information is determined by base
– 𝑏 = 2 → bits
– 𝑏 = 𝑒 → nats
– 𝑏 = 10 → Hartleys
■ Base 2 is mostly used in information theory

30
Information: Certain and uncertain events

Information = − log 2 𝑝 bits


■ Certain events (p=1)
– In this case, there is no surprise upon learning that the event
occurred → we receive no information from its occurrence
(since we knew the event would occur).
■ Uncertain events (p=1/2)
– In this case, we receive exactly 1 bit of information upon
learning the event occurred

31
Information

32
The Shannon-Fano Algorithm
It is a variable length encoding scheme
A top-down approach

1. Sort the symbols according to the frequency count of their


occurrences.
2. Recursively divide the symbols into two parts, each with
approximately the same number of counts, until all parts contain
only one symbol.

An Example: coding of “HELLO”


Symbol H E L O
Count 1 1 2 1
Frequency count of the symbols in ”HELLO”.

33
The Shannon-Fano Algorithm
After sorting : LHEO

Coding Tree for HELLO by Shannon-Fano


34
The Shannon-Fano Algorithm

Table 7.1: Result of Performing Shannon-Fano on HELLO

Symbol Count Log2 (1/P) Code # of bits used

L 2 1.32 0 2
H 1 2.32 10 2
E 1 2.32 110 3
O 1 2.32 111 3
TOTAL # of bits: 10

35
The Shannon-Fano Algorithm

Another coding tree for HELLO by Shannon-Fano.

36
The Shannon-Fano Algorithm

Another Result of Performing Shannon-Fano on HELLO

Symbol Count Log2 (1/P) Code # of bits used

L 2 1.32 00 4
H 1 2.32 01 2
E 1 2.32 10 2
O 1 2.32 11 2
TOTAL # of bits: 10

37
The Shannon-Fano Algorithm
The Shannon-Fano Algorithm
The Shannon-Fano Algorithm
The Shannon-Fano Algorithm
The Shannon-Fano Algorithm
The Shannon-Fano Algorithm
The Shannon-Fano Algorithm
The Shannon-Fano Algorithm
The Shannon-Fano Algorithm
Exercise:
Consider a finite symbol stream:
ACABADADEAABBAAAEDCACDEAAABCDBBEDCBACAE

■ What is the compression ratio if you use Shannon-Fano algorithm?


■ Find out the optimal number of bits (on average)/ minimum number of bits to
represent each character in this alphabet.

46
The Shannon-Fano Algorithm

bits required before compression


Compression Ratio = bits required after Compression

312 bits
Compression Ratio =
89 bits

47
Prefix (free) codes
■ A prefix code is a code in which no codeword is a prefix of any other
codeword
– Prefix codes are uniquely decodable
– Prefix codes are instantaneously decodable
Examples:

• 00 is the code for A, and there is no other code which starts with 00.
• 10 is the code for C, and there is no other code which starts with 10.

48
Shannon-Fano - decoding
0 0 0 0 0 1 1 10

How to decode the above code?


• Start from left, scan towards right,
and assign code
• Scan 0
000001110
no letter for 0
• Scan the next letter
000001110
00 = A
• A 0 0 0 1 1 1 0 – no letter for 0
• A 0 0 0 1 1 1 0 – 00
• A A 01110
• A A B110
• A A B D
0 0 0 0 0 1 0 1 1 = AABD
49
Shannon-Fano - Exercise
A source emits symbols 𝑋𝑖 , 1 ≤ 𝑖 ≤ 6, in the Binary Coded Decimal (BCD)
format with probabilities 𝑃(𝑋𝑖 ) as given in the Table.

1. Calculate the optimum number of bits needed to represent these symbols.


2. Apply Shannon-Fano coding to the source signal characterised in Table 1.
3. What is the original symbol sequence of the Shannon-Fano coded signal
110011110000110101100?
4. What compression factor has been achieved?

50
Shannon-Fano - Exercise
1. Calculate the optimum number of bits needed to represent these symbols.

2. Apply Shannon-Fano coding to the source signal.

51
Shannon-Fano - Exercise
3. Shannon-Fano encoded sequence:

4. What compression factor has been achieved?.


Number of bits for Shannon-Fano
= 0.4x1 + 0.3x2 + 0.15x3 + 0.1x4 + 0.03x5 + 0.02x5
= 2.1
Number of bits for BCD
= (0.4 + 0.3 + 0.15 + 0.1 + 0.03 + 0.02)x3
=3

52
Huffman coding
• Can we do better than Shannon-Fano?
Huffman! Always produces best binary tree for given Probabilities.
• A bottom-up approach.
• Among both of the encoding methods, the Huffman coding is more efficient and optimal than
the Shannon fano coding.
Steps to build Huffman Tree:

I. Initialization: put all the nodes in a list L, keep it sorted at all the time based on
their frequencies.
II. Repeat until the list L has more than one node left.
a. From L pick two nodes having the lowest frequencies/ probabilities, create a parent
node of them.
b. Assign the sum of the children’s frequencies/probabilities to the parent node and
insert it into L.
c. Assign code 0/1 to the two branches of the tree, and delete the children from L.
III. Assign a codeword for each leaf based on the path from the root.

53
Huffman coding
• Consider some text consisting of only 'A', 'B', 'C', 'D', and 'E' characters, and their
frequencies are 15, 7, 6, 6, 5, respectively. The following figures illustrate the steps
followed by the algorithm:

54
Huffman coding

55
Huffman coding

56
Exercise:
■ Qu1: Draw the binary tree using huffman Code

Character A B C D E
Frequency 17 12 12 27 32
■ Answer 1:

Character A B C D E
Frequency 17 12 12 27 32
■ Qu 2:
A source emits symbols 𝑋𝑖 , 1 ≤ 𝑖 ≤ 6, in the Binary Coded Decimal (BCD)
format with probabilities 𝑃(𝑋𝑖 ) as given in the Table.

1. Calculate the optimum number of bits needed to represent these symbols.


2. Apply Huffman coding to the source signal characterised in Table 1.
3. What is the original symbol sequence of the Huffman-coded signal
110011110000110101100?
4. What compression factor has been achieved?
Answer 2:

1. Calculate the optimum number of bits needed to represent these symbols.


2. Apply Huffman coding to the source signal.
3. What is the original symbol sequence of the Huffman-coded signal
110011110000110101100?

4. What compression factor has been achieved?.


Number of bits for Huffman coding
= 0.4x1 + 0.3x2 + 0.15x3 + 0.1x4 + 0.03x5 + 0.02x5
= 2.1
Number of bits for BCD
= (0.4 + 0.3 + 0.15 + 0.1 + 0.03 + 0.02)x3
=3
Qu3:

Consider the following sentence (you don’t need to consider the spaces between
characters).

“MADAM I AM ADAM”

1. Assume that 8-bit extended ASCII encoding is used to represent the above sentence.
Find out how many bytes are required to represent the above sentence.

2. What is the minimum/optimum number of bits needed to represent the above


sentence?

3. Calculate the number of bits that are necessary to represent the above sentence using
Huffman encoding.

4. Using Huffman encoding how would you represent the following sequence of
characters? “DAMMA”

You might also like