Data Compression (CS-442N) : Assignment On

Assignment
on
Data Compression
(CS-442N)
Submitted by:
Ankur
Gangwar
(17CS12)
Assignment - 01
Solution - 01:
Definition of Lossy Compression:
The Lossy compression method eliminates some amount of data that is not noticeable. This
technique does not allow a file to restore in its original form but significantly reduces the size.
The lossy compression technique is beneficial if the quality of the data is not your priority. It
slightly degrades the quality of the file or data but is convenient when one wants to send or
store the data. This type of data compression is used for organic data like audio signals and
images.
JPEG Compression Example:
JPEG stands for Joint Photographic Experts Group, which is the group that created the
standard. It has an extension of .JPG and .JPEG and is the most common image format
used by digital cameras and on the World Wide Web. It’s a lossy compression type for digital
images. Lossy image compression reduces file size and eliminates redundant information.
The user decides how much loss to introduce with a trade-off in storage size and quality. For
example, the compression quality is a range from 1 to 100. A lower value compresses the
raster image, but also reduces the quality. JPEG 2000 (JP2) is the newest version of JPEG.
It slightly improves image compression performance over JPEG using two different wavelet
transforms. Users can choose low to high levels of compression.
Definition of Lossless Compression:
The Lossless compression method is capable of reconstituting the original form of the data.
The quality of the data is not compromised. This technique allows a file to restore its original
form. Lossless compression can be applied to any file format can improve the performance
of the compression ratio.
LZ77 Compression Example:
LZ77 compression is a lossless image compression meaning it maintains raster values

during compression. Abraham Lempel and Jacob Ziv introduced this format in 1977 and we
still use it today. Combining the first letter from both last names (LZ) and the year it was
invented (1977), this is how the acronym (LZ77) was created. It uses the same commission
algorithm as PNG (Portable network graphic). Also, it’s the default raster GIS compression
that ArcGIS uses. The theory behind LZ77 image compression is that repeat values in a
raster image are stored according to their position and length. Instead of storing single
values for each cell, LZ77 simply stores where the value was found and how long the string
of values are.
Solution - 02:
It totally depends on our choice. In Linux we have three type of encoding available by
default:
1. gzip {filename}: Gzip compress the size of the given files using Lempel-Ziv
coding (LZ77). Whenever possible, each file is replaced by one with the extension
.gz.
2. bzip2 {filename}: bzip2 compresses files using the Burrows-Wheeler block sorting text
compression algorithm, and Huffman coding. Compression is generally considerably better
than that achieved by bzip command (LZ77/LZ78-based compressors). Whenever possible,
each file is replaced by one with the extension .bz2.
3. zip {.zip-filename} {filename-to-compress}: zip is a compression and file packaging

utility for Unix/Linux. Each file is stored in single .zip {.zip-filename} file with the
extension .zip.
Solution - 03:
As we can see that the file size is increased in all the encodings. Reason is, compression will
always add some overhead space for maintaining lookup hash table for mapping the pointers
to the location of the duplicates ! In addition to the hash table the overheads may contain
metadata for inflation. But, amount of data compressed highly outweighs the space required
by these overheads - so most of the time(when file is bigger) size of the compressed file is
smaller than the original.
Solution - 04: 20
𝑠𝑖𝑧𝑒 𝑎𝑓𝑡𝑒𝑟 =1:2
40
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 40
Compression ratio =
=
𝑠𝑖𝑧𝑒 𝑏𝑒𝑓𝑜𝑟𝑒
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑠𝑖𝑧𝑒 𝑏𝑒𝑓𝑜𝑟𝑒
Compression factor
=
= =2:1
𝑠𝑖𝑧𝑒 𝑎𝑓𝑡𝑒𝑟 20
𝑠𝑖𝑧𝑒.𝑏𝑒𝑓𝑜𝑟𝑒.𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛−𝑠𝑖𝑧𝑒.
𝑎𝑓𝑡𝑒𝑟.𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
Saving percentage %
=
40−20
=
𝑠𝑖𝑧𝑒 𝑏𝑒𝑓𝑜𝑟𝑒 𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
%= 50%
40
Solution - 05:
Run-length coding :
A run-length algorithm assigns codewords to consecutive recurrent symbols (called runs)

instead of coding individual symbols. The main idea is to replace a number of consecutive
repeating symbols by a short codeword unit containing three parts: a single symbol, a run-
length count and an interpreting indicator.
Run-length algorithms are very effective if the data source contains many runs of
consecutive symbol. The symbols can be characters in a text file, 0s or 1s in a binary file or
black-and-white pixels in an image. Although simple, run-length algorithms have been used
well in practice. For example, the so-called HDC (Hardware Data Compression) algorithm,
used by tape drives connected to IBM computer systems, and a similar algorithm used in the
IBM SNA (System Network Architecture) standard for data communications are still in use
today.
Example : GGG BCDEFG 55GHJKULM777777777777
can be compressed to r3Gr6n6BCDEFGr2n955tLMr127
Solution :
1. The first 3 Gs are read and encoded by r3G.
2. The next 6 spaces are found and encoded by r6.
3. The non-repeating symbols BCDEFG are found and encoded by n6BCDEFG.
4. The next 2 spaces are found and encoded by r2.
5. The next 9 non-repeating symbols are found and encoded by n955GHJK∪LM.

6. The next 12 ‘7’s are found and encoded by r127.
Therefore the encoded output is: r3Gr6n6BCDEFGr2n955tLMr127
Solution -
06:
Huffman code derivation for string AAABEDBBTGGG:
1. Constructing the frequency table:
A B E D T G character
3 3 1 1 1 3 frequency
2. Sort table in descending order:
A B G E D T
3 3 3 1 1 1
3. Building the binary tree:
(a) combine D,T and sort the table

A B G (DT) E
3 3 3 2 1
(b) combine E,( D T ) and sort the table

A B G ((DT)E)
3 3 3 3
(c) combine G,( ( D T ) E ) and sort the

table ( G ( ( D T ) E ) ) A B
6 3 3
(d) combine A,B and sort the

table ( G ( ( D T ) E ) ) (A
B)
6 6
(e) combine ( G ( ( D T ) E ) ), ( A B ) and sort the table

((G((DT)E))(AB))
12
4. Deriving Huffman tree:
((G((DT)E))(AB))
/0 \1
(G((DT)E)) (AB)
/0 \1 /0 \1
G ((DT)E) A B
/0 \1
(DT) E
/0 \1
D T
5. Generating Huffman code:
G 00
D 0100
T 0101
E 011
A 10
B 11
Solution - 07:
The probabilities for each character are arranged in descending order and by using Minimum
variance Huffman coding, we obtained following Huffman tree:
Therefore, the codewords generated are as
follows: A 01
B 10
C 11
D 000
E 0010
F 00110
G 001110
H 001111
Solution - 08:
letter Probability
a1 0.7
a2 0.2
a3 0.1
letter Probability
a2 0.2
a3 0.1
a1a1 0.49
a1a2 0.14
a1a3 0.7
letter Probability Codeword
a2 0.2 000
a3 0.1 001
a1a2 0.14 010
a1a3 0.7 011
a1a1a1 0.343 100
a1a1a2 0.098 101
a1a1a3 0.049 110
Solution - 09:
m = 7, n = 0, 1, 2,................22
n q r codeword
0 0 0 0000
1 0 1 0010
2 0 2 0011
3 0 3 00100
4 0 4 00101
5 0 5 00110
6 0 6 00111
7 1 0 10000
8 1 1 10010
9 1 2 10011
10 1 3 100100
11 1 4 100101
12 1 5 100110
13 1 6 100111
14 2 0 110000
15 2 1 110010
16 2 2 110011
17 2 3 1100100
18 2 4 1100101
19 2 5 1100110
20 2 6 1100111
21 3 0 1110000
22 3 1 1110010
Solution - 10:
Expanding on the huffman algorithm, Faller and Gallagher, and later Knuth and Vitter, developed a
way to perform the huffman algorithm as a one pass procedure.
Adaptive Huffman coding (also called Dynamic Huffman coding) is an adaptive coding technique
based on Huffman coding. It permits building the code as the symbols are being transmitted, having
no initial knowledge of source distribution, that allows one-pass encoding and adaptation to
changing conditions in data.The benefit of one-pass procedure is that the source can be encoded in
real time, though it becomes more sensitive to transmission errors, since just a single loss ruins the
whole code.

Data Compression (CS-442N) : Assignment On

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Compression (CS-442N) : Assignment On

Uploaded by

Copyright:

Available Formats

Assignment

Definition of Lossy Compression:

JPEG Compression Example:

Definition of Lossless Compression:

LZ77 Compression Example:

LZ77 compression is a lossless image compression meaning it maintains raster values

3. zip {.zip-filename} {filename-to-compress}: zip is a compression and file packaging

A run-length algorithm assigns codewords to consecutive recurrent symbols (called runs)

Example : GGG BCDEFG 55GHJKULM777777777777

can be compressed to r3Gr6n6BCDEFGr2n955tLMr127

2. The next 6 spaces are found and encoded by r6.

3. The non-repeating symbols BCDEFG are found and encoded by n6BCDEFG.

4. The next 2 spaces are found and encoded by r2.

5. The next 9 non-repeating symbols are found and encoded by n955GHJK∪LM.

Therefore the encoded output is: r3Gr6n6BCDEFGr2n955tLMr127

1. Constructing the frequency table:

2. Sort table in descending order:

3. Building the binary tree:

(a) combine D,T and sort the table

(b) combine E,( D T ) and sort the table

(c) combine G,( ( D T ) E ) and sort the

(d) combine A,B and sort the

(e) combine ( G ( ( D T ) E ) ), ( A B ) and sort the table

4. Deriving Huffman tree:

Therefore, the codewords generated are as

letter Probability Codeword

a1a2 0.14 010

a1a3 0.7 011

a1a1a1 0.343 100

a1a1a2 0.098 101

a1a1a3 0.049 110

You might also like