Chapter Five Lossless Compression

Lossless Compression
Algorithms
Chapter 5
By Temesgen T.(MSc)
Introduction
• Lossless compression algorithms are used in multimedia to reduce the
size of digital files without compromising their quality.
• These algorithms are particularly useful when dealing with data that
cannot be modified, such as medical images or legal documents,
where the information must remain intact.
Cont’d…
• If the total number of bits required to represent the data before
compression is Bo and the total number of bits required to represent
the data after compression is B1, then we define the compression
ratio as:
• Compression =Bo/b1
• The higher the compression ratio, the better the lossless
compression scheme, as long as it is computationally feasible.
Information theory
• Information theory is a field of study that deals with the
quantification, storage, and communication of information.
• It was developed by Claude Shannon in 1948, and it has since become
a fundamental part of the study of communication and computer
science.
• Here are some basic concepts in information theory:
• Bit: The basic unit of information in information theory is the bit.
• A bit is a binary digit, which means it can have one of two possible
values, usually represented as 0 or 1.
Cont’d…
• Entropy: Entropy is a measure of the amount of uncertainty or randomness in a
system.
• In science, entropy is a measure of the disorder of a system – the more entropy, the
more disorder.
• In information theory, entropy is used to measure the amount of information in a
message or signal.
• A message with high entropy contains a lot of information, while a message with low
entropy contains little information.
• Information content: The information content of a message is the amount of
information it contains.
• This is related to entropy, but it is not the same thing. Information content takes into
account the context of the message, while entropy does not.
Cont’d…
• Channel capacity: The channel capacity of a communication channel
is the maximum amount of information that can be transmitted over
the channel per unit of time.
• The channel capacity depends on the bandwidth of the channel and
the signal-to-noise ratio.
• Compression: Compression is the process of reducing the size of a
message or signal without losing information. Compression is
important for efficient storage and transmission of information.
• Error correction: Error correction is the process of detecting and
correcting errors in a message or signal.
Cont’d…
• According to the famous scientist Claude E. Shannon, of Bell Labs the
entropy of an information source with alphabet S = {SI, S2,S3….Sn} is
defined as:
• H = - Σ pi log₂ pi
• The formula works by taking the negative sum of the probability of
each symbol multiplied by the logarithm of that probability.
Cont’d…
• The logarithm is used to compress the range of probabilities and make
the formula more useful for practical purposes.
• For example, if there are only two possible symbols, each with a 50%
probability of occurrence, the entropy is 1 bit because there is only one
binary decision required to determine which symbol was transmitted.
• Overall, the entropy of information formula is a powerful tool for
understanding and quantifying the amount of information in a message
or signal.
• It has important applications in fields such as communication theory,
cryptography, and data compression.
Example
• Suppose we have a message consisting of the letters A, B, C, and D,
and each letter has an equal probability of occurring.
• Then, the entropy of the message can be calculated as follows:
• H = - Σ p(x) log₂ p(x)
• = - (1/4 * log₂(1/4) + 1/4 * log₂(1/4) + 1/4 * log₂(1/4) + 1/4 * log₂(1/4))
• = - (-0.5 - 0.5 - 0.5 - 0.5)
• = 2 bits
Example 2
• Suppose we have a message consisting of the letters A, B, C, and D, but now
the probability of each letter is not equal. Let's say the probabilities are as
follows:
• P(A) = 0.4
• P(B) = 0.3
• P(C) = 0.2
• P(D) = 0.1
• The entropy of this message can be calculated as follows:
• H = - Σ p(x) log₂ p(x) = - (0.4 * log₂(0.4) + 0.3 * log₂(0.3) + 0.2 * log₂(0.2) + 0.1
* log₂(0.1)) ≈ 1.8464 bits
Run-length coding
• Run-length coding is a lossless data compression technique that is
commonly used to reduce the size of data files.
• It works by encoding sequences of repeated data values as a single
value, followed by a count of the number of times that value occurs.
• For example, consider the following sequence of data:
• AAAAABBBCCCCCCCCCDDD
• Using run-length coding, this sequence can be compressed to:
• 5A3B9C3D
VARIABLE LENGTH CODING (VLC)
• Variable Length Coding (VLC) is a lossless or lossy data compression
technique that is used to compress data by assigning shorter codes to
frequently occurring data values and longer codes to less frequently
occurring data values.
• It is a form of entropy coding, where the length of the code assigned to each
data value is proportional to the amount of information it carries.
• In VLC, a dictionary or codebook is created that maps each input symbol to a
unique code.
• The codes are variable in length, meaning that each symbol may be encoded
using a different number of bits. The codebook is typically constructed based
on the frequency of occurrence of each symbol in the input data.
Example
• For example, consider the following input sequence of data:
• AABACBCABD
• Assume that the frequency of occurrence of each symbol in the input
sequence is as follows:
• A: 4 B: 3 C: 2 D: 1
• Using VLC, the symbols in the input sequence can be encoded using variable
length codes as follows:
• A: 0 B: 10 C: 110 D: 111
• Therefore, the compressed output for the input sequence would be:
• 000110100100111
Example2
Cont’d…
• In this example, the symbol A, which occurs most frequently in the
input sequence, is assigned the shortest code, which is 0.
• Conversely, the symbol D, which occurs least frequently, is assigned
the longest code, which is 111.
• This results in a significant reduction in the size of the encoded data,
as compared to using fixed-length codes.
• VLC is widely used in various data compression applications, including
audio and video compression, where it is used to encode the
frequency components of the data.
Cont’d…
• VLC coding (Variable Length Coding) is a lossless data compression
technique that is commonly used in multimedia applications,
including video and audio encoding.
• Here, we will study the Shannon Fano algorithm, Huffman coding, and
adaptive Huffman coding.
The Shannon-Fano algorithm
• The Shannon-Fano algorithm is a lossless data compression algorithm
that was proposed by Claude Shannon and Robert Fano in the 1940s.
• The algorithm works by assigning a variable-length code to each
symbol in the input data, with the goal of minimizing the total
number of bits required to encode the data.
Cont’d…
• Here's how the algorithm works:
1. Calculate the probability of occurrence for each symbol in the input data.
2. Sort the symbols in descending order based on their probabilities.
3. Divide the symbols into two groups, with the first group containing the symbols with
the highest probabilities and the second group containing the symbols with the lowest
probabilities.
4. Assign a '0' bit to all the symbols in the first group and a '1' bit to all the symbols in the
second group.
5. Recursively repeat steps 3-4 for each group until each group contains only one symbol.
6. The final code for each symbol is the concatenation of the bits assigned to it in each
step.
Cont’d…
• Example 1: Given five symbols A to E with their frequencies being 15,
7, 6, 6 & 5; encode them using Shannon-Fano entropy encoding
• Solution:
• Step1: Say, we are given that there are five symbols (A to E) that can
occur in a source with their frequencies being 15 7 6 6 and 5. First,
sort the symbols in decreasing order of frequency.
Cont’d…
• Step2: Divide the list into two equal halves. That is, the counts of both
halves are as close as possible to each other. Therefore, in this case we split
the list between B and C & assign 0 and 1.
• Step3: We recursively repeat the steps of splitting and assigning code until
each symbol become a code leaf on the tree.
• That is, treat each split as a list, apply splitting, and code assigning until you
are left with lists of single elements.
• Step 4: Note that we split the list containing C, D and E between C and D
because the difference between the split lists is 11 minus 6, which is 5, if we
were to have divided between D and E we would get a difference of 12-5
which is 7.
Cont’d…
Huffman Coding
• Huffman coding is a lossless data compression algorithm that was invented by
David A. Huffman in 1952.
• The basic idea behind Huffman coding is to use shorter codes to represent
frequently occurring characters in a message, and longer codes to represent less
frequently occurring characters.
• The algorithm works by first analyzing the frequency of each character in the
message that needs to be compressed.
• It then uses this information to build a binary tree, known as a Huffman tree,
where each leaf node represents a character, and the frequency of the character
determines the weight of the corresponding leaf node.
• In contradistinction to Shannon-Fano, which is top-down, the encoding steps of the
Huffman algorithm are described in bottom-up manner.
Cont’d…
• The algorithm then proceeds to build the Huffman tree by repeatedly
combining the two lowest weight nodes into a single parent node
until only one node remains, which is the root of the Huffman tree.
• During this process, the binary digit 0 is assigned to the left child of
each parent node, and the binary digit 1 is assigned to the right child.
• Once the Huffman tree is built, the next step is to generate the
Huffman codes for each character in the message.
Cont’d…
I. Create a leaf node for each unique character and build a min heap of all
leaf nodes (Min Heap is used as a priority queue. The value of frequency
field is used to compare two nodes in min heap. Initially, the least
frequent character is at root)
II. Extract two nodes with the minimum frequency from the min heap.
III. Create a new internal node with a frequency equal to the sum of the two
nodes frequencies. Make the first extracted node as its left child and the
other extracted node as its right child. Add this node to the min heap.
IV. Repeat steps#2 and #3 until the heap contains only one node. The
remaining node is the root node and the tree is complete.
Let us understand the algorithm with an example:
Example
• character Frequency
• a 5
• b 9
• c 12
• d 13
• e 16
• f 45
Cont’d…
• Step 1. Build a min heap that contains 6 nodes where each node
represents root of a tree with single node.
• Step 2 Extract two minimum frequency nodes from min heap. Add a
new internal node with frequency 5 + 9 = 14.
Now min heap contains 5 nodes
• c 12
• d 13
• Internal Node 14
• e 16
• f 45
Cont’d…
• Step 3: Extract two minimum frequency nodes from heap. Add a new
internal node with frequency 12 + 13 = 25
• Now min heap contains 4 nodes where 2 nodes are roots of trees with
single element each, and two heap nodes are root of tree with more
than one nodes
Cont’d…
• e 16
• f 45
Cont’d…
• Step 4: Extract two minimum frequency nodes. Add a new internal
node with frequency 14 + 16 = 30
Now min heap contains 3 nodes.
• f 45
Cont’d…
Now min heap contains 2 nodes.
• Now min heap contains 2 nodes.
• f 45
Cont’d…
Assigning code words for characters
Code as follows
• character code-word
• f 0
• c 100
• d 101
• a 1100
• b 1101
• e 111
Arithmetic Coding
• Arithmetic coding (AC) is a form of entropy encoding used in lossless data
compression.
• Normally, a string of characters is represented using a fixed number of
bits per character, as in the ASCII code.
• An arithmetic coding algorithm encodes an entire file as a sequence of
symbols into a single decimal number.
• The input symbols are processed one at each iteration.
• The interval derived at the end of this division process is used to decide
the code word for the entire sequence of symbols.
• Example: Arithmetic coding of the word “BELBA”
Cont’d…
Cont’d…
• UL= LL+d (ul)d (f) Where LL: lower limit, d (u, l) difference of upper and
lower & d (f) is frequency of letter For the first letter in B, the lower
limit is zero, and the upper limit is 0.4. UL= LL+d (ul)d (f)
B =0 + (0.4 – 0) 0.4= 0+0.40.4=0.16
E= 0 + (0.4 – 0) 0.6= 0+0.40.6=0.24
L= 0 + (0.4 – 0) 0.8= 0+0.40.8=0.32
A= 0 + (0.4 – 0) 1= 0+0.41=0.4
• Similar to others
Cont’d…
• A message is represented by a half-open interval [a, b) where a and b
are real numbers between a and 1.
• Initially, the interval is [0, 1).
• When the message becomes longer, the length of the interval
shortens, and the number of bits needed to represent the interval
increases.
• Suppose the alphabet is [A, B, C, D, E, F, $], in which $ is a special
symbol used to terminate the message, and the known probability
distribution is listed below.
LZW algorithm
• The LZW algorithm is a very common compression technique. This
algorithm is typically used in GIF and optionally in PDF and TIFF.
• Unix’s ‘compress’ command, among other uses. It is lossless, meaning
no data is lost when compressing.
• The algorithm is simple to implement and has the potential for very
high throughput in hardware implementations.
• It is the algorithm of the widely used Unix file compression utility
compress and is used in the GIF image format.
• The Idea relies on reoccurring patterns to save data space.
Cont’d…
• LZW is the foremost technique for general-purpose data compression
due to its simplicity and versatility.
• It is the basis of many PC utilities that claim to “double the capacity
of your hard drive”.
• LZW compression works by reading a sequence of symbols, grouping
the symbols into strings, and converting the strings into codes.
• Because the codes take up less space than the strings they replace,
we get compression
Cont’d…
• Characteristic features of LZW includes,
• LZW compression uses a code table, with 4096 as a common choice
for the number of table entries.
• Codes 0-255 in the code table are always assigned to represent single
bytes from the input file.
• When encoding begins the code table contains only the first 256
entries, with the remainder of the table being blanks.
• Compression is achieved by using codes 256 through 4095 to
represent sequences of bytes.
Cont’d…
• As the encoding continues, LZW identifies repeated sequences in the
data and adds them to the code table.
• Decoding is achieved by taking each code from the compressed file
and translating it through the code table to find what character or
characters it represents.
• Example: ASCII code. Typically, every character is stored with 8 binary
bits, allowing up to 256 unique symbols for the data.
• This algorithm tries to extend the library to 9 to 12 bits per character.
Example
• Suppose we have the following string: "ABBABABBAABABAABABAA"
• We want to compress this string using LZW.
• Step1: Dictionary:
• 0: A
• 1: B
• Step 2: Scan the input string from left to right and find the longest substring that is
already in the dictionary.
• Once you find the longest substring, add the next character to it and check if it is
already in the dictionary.
• Keep doing this until you find a substring that is not in the dictionary. Output the code
for the last substring found and add the new substring to the dictionary.
String: A B B A B A B B A A B A B A A B A
A
Index Entry Code
0 A 0
1 B 1
AB 2 2
BB 3 3
BA 4 4
ABA 5 2
ABB 6 2
BAA 7 4
ABAB 8 5
BAAB 9 7
BAA_ 10 7
Lossless Image Compression
• Lossless image compression refers to a method of reducing the size of digital
image files without losing any information or quality.
• In other words, the compressed image can be reconstructed to its original form
without any loss of data or degradation in image quality.
• Here are images compressed by lossless algorithms.
• TIFF (Tagged Image File Format): This is a popular format used for storing high-
quality images and is often used in the printing industry.
• It uses LZW (Lempel–Ziv–Welch) compression.
• PNG (Portable Network Graphics): This format is commonly used for web graphics
and images that require transparency.
• DEFLATE compression(combination of Huffman and LZ algo) used here.
Cont’d…
• BMP (Bitmap): This format is a simple, uncompressed format that is
commonly used for Windows graphics.
• GIF (Graphics Interchange Format): This format is commonly used for
animated images and has a limited color palette.
• It is compressed by LZW (Lempel–Ziv–Welch) compression
• RAW: This is a format used by many high-end digital cameras that
captures all the information from the camera's sensor and is often used
for professional photography.
• PSD (Adobe Photoshop Document): This is a proprietary format used by
Adobe Photoshop to store layered images.
The End
• Thankyou for your Attention!!!
Next Chapter:Lossy compression algorithms and Video Compression

techniques

Chapter Five Lossless Compression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter Five Lossless Compression

Uploaded by

Copyright:

Available Formats

Lossless Compression

Next Chapter:Lossy compression algorithms and Video Compression

You might also like