You are on page 1of 9

Assignment

on

Data Compression
(CS-442N)

Submitted by:
Ankur
Gangwar
(17CS12)
Assignment - 01

Solution - 01:

Definition of Lossy Compression:

The Lossy compression method eliminates some amount of data that is not noticeable. This
technique does not allow a file to restore in its original form but significantly reduces the size.
The lossy compression technique is beneficial if the quality of the data is not your priority. It
slightly degrades the quality of the file or data but is convenient when one wants to send or
store the data. This type of data compression is used for organic data like audio signals and
images.

JPEG Compression Example:

JPEG stands for Joint Photographic Experts Group, which is the group that created the
standard. It has an extension of .JPG and .JPEG and is the most common image format
used by digital cameras and on the World Wide Web. It’s a lossy compression type for digital
images. Lossy image compression reduces file size and eliminates redundant information.
The user decides how much loss to introduce with a trade-off in storage size and quality. For
example, the compression quality is a range from 1 to 100. A lower value compresses the
raster image, but also reduces the quality. JPEG 2000 (JP2) is the newest version of JPEG.
It slightly improves image compression performance over JPEG using two different wavelet
transforms. Users can choose low to high levels of compression.

Definition of Lossless Compression:

The Lossless compression method is capable of reconstituting the original form of the data.
The quality of the data is not compromised. This technique allows a file to restore its original
form. Lossless compression can be applied to any file format can improve the performance
of the compression ratio.

LZ77 Compression Example:

LZ77 compression is a lossless image compression meaning it maintains raster values


during compression. Abraham Lempel and Jacob Ziv introduced this format in 1977 and we
still use it today. Combining the first letter from both last names (LZ) and the year it was
invented (1977), this is how the acronym (LZ77) was created. It uses the same commission
algorithm as PNG (Portable network graphic). Also, it’s the default raster GIS compression
that ArcGIS uses. The theory behind LZ77 image compression is that repeat values in a
raster image are stored according to their position and length. Instead of storing single
values for each cell, LZ77 simply stores where the value was found and how long the string
of values are.

Solution - 02:

It totally depends on our choice. In Linux we have three type of encoding available by
default:

1. gzip {filename}: Gzip compress the size of the given files using Lempel-Ziv
coding (LZ77). Whenever possible, each file is replaced by one with the extension
.gz.
2. bzip2 {filename}: bzip2 compresses files using the Burrows-Wheeler block sorting text
compression algorithm, and Huffman coding. Compression is generally considerably better
than that achieved by bzip command (LZ77/LZ78-based compressors). Whenever possible,
each file is replaced by one with the extension .bz2.

3. zip {.zip-filename} {filename-to-compress}: zip is a compression and file packaging


utility for Unix/Linux. Each file is stored in single .zip {.zip-filename} file with the
extension .zip.

Solution - 03:
As we can see that the file size is increased in all the encodings. Reason is, compression will
always add some overhead space for maintaining lookup hash table for mapping the pointers
to the location of the duplicates ! In addition to the hash table the overheads may contain
metadata for inflation. But, amount of data compressed highly outweighs the space required
by these overheads - so most of the time(when file is bigger) size of the compressed file is
smaller than the original.

Solution - 04: 20
𝑠𝑖𝑧𝑒 𝑎𝑓𝑡𝑒𝑟 =1:2

40
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 40
Compression ratio =

=
𝑠𝑖𝑧𝑒 𝑏𝑒𝑓𝑜𝑟𝑒
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑠𝑖𝑧𝑒 𝑏𝑒𝑓𝑜𝑟𝑒

Compression factor
=
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
= =2:1
𝑠𝑖𝑧𝑒 𝑎𝑓𝑡𝑒𝑟 20
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑠𝑖𝑧𝑒.𝑏𝑒𝑓𝑜𝑟𝑒.𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛−𝑠𝑖𝑧𝑒.
𝑎𝑓𝑡𝑒𝑟.𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
Saving percentage %
=
40−20
=
𝑠𝑖𝑧𝑒 𝑏𝑒𝑓𝑜𝑟𝑒 𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
%= 50%

40

Solution - 05:

Run-length coding :

A run-length algorithm assigns codewords to consecutive recurrent symbols (called runs)


instead of coding individual symbols. The main idea is to replace a number of consecutive
repeating symbols by a short codeword unit containing three parts: a single symbol, a run-
length count and an interpreting indicator.

Run-length algorithms are very effective if the data source contains many runs of
consecutive symbol. The symbols can be characters in a text file, 0s or 1s in a binary file or
black-and-white pixels in an image. Although simple, run-length algorithms have been used
well in practice. For example, the so-called HDC (Hardware Data Compression) algorithm,
used by tape drives connected to IBM computer systems, and a similar algorithm used in the
IBM SNA (System Network Architecture) standard for data communications are still in use
today.

Example : GGG BCDEFG 55GHJKULM777777777777

can be compressed to r3Gr6n6BCDEFGr2n955tLMr127

Solution :
1. The first 3 Gs are read and encoded by r3G.

2. The next 6 spaces are found and encoded by r6.

3. The non-repeating symbols BCDEFG are found and encoded by n6BCDEFG.

4. The next 2 spaces are found and encoded by r2.

5. The next 9 non-repeating symbols are found and encoded by n955GHJK∪LM.


6. The next 12 ‘7’s are found and encoded by r127.

Therefore the encoded output is: r3Gr6n6BCDEFGr2n955tLMr127

Solution -
06:
Huffman code derivation for string AAABEDBBTGGG:

1. Constructing the frequency table:

A B E D T G character

3 3 1 1 1 3 frequency

2. Sort table in descending order:

A B G E D T

3 3 3 1 1 1

3. Building the binary tree:

(a) combine D,T and sort the table


A B G (DT) E

3 3 3 2 1

(b) combine E,( D T ) and sort the table


A B G ((DT)E)

3 3 3 3

(c) combine G,( ( D T ) E ) and sort the


table ( G ( ( D T ) E ) ) A B

6 3 3

(d) combine A,B and sort the


table ( G ( ( D T ) E ) ) (A
B)

6 6

(e) combine ( G ( ( D T ) E ) ), ( A B ) and sort the table


((G((DT)E))(AB))

12

4. Deriving Huffman tree:

((G((DT)E))(AB))
/0 \1
(G((DT)E)) (AB)
/0 \1 /0 \1
G ((DT)E) A B
/0 \1
(DT) E
/0 \1
D T
5. Generating Huffman code:

G 00

D 0100

T 0101

E 011

A 10

B 11

Solution - 07:

The probabilities for each character are arranged in descending order and by using Minimum
variance Huffman coding, we obtained following Huffman tree:

Therefore, the codewords generated are as

follows: A 01

B 10

C 11

D 000

E 0010

F 00110

G 001110

H 001111
Solution - 08:

letter Probability

a1 0.7

a2 0.2

a3 0.1

letter Probability

a2 0.2

a3 0.1

a1a1 0.49

a1a2 0.14

a1a3 0.7

letter Probability Codeword

a2 0.2 000

a3 0.1 001

a1a2 0.14 010

a1a3 0.7 011

a1a1a1 0.343 100

a1a1a2 0.098 101

a1a1a3 0.049 110

Solution - 09:

m = 7, n = 0, 1, 2,................22

n q r codeword

0 0 0 0000

1 0 1 0010

2 0 2 0011

3 0 3 00100

4 0 4 00101

5 0 5 00110
6 0 6 00111

7 1 0 10000

8 1 1 10010

9 1 2 10011

10 1 3 100100

11 1 4 100101

12 1 5 100110

13 1 6 100111

14 2 0 110000

15 2 1 110010

16 2 2 110011

17 2 3 1100100

18 2 4 1100101

19 2 5 1100110

20 2 6 1100111

21 3 0 1110000

22 3 1 1110010

Solution - 10:

Expanding on the huffman algorithm, Faller and Gallagher, and later Knuth and Vitter, developed a
way to perform the huffman algorithm as a one pass procedure.

Adaptive Huffman coding (also called Dynamic Huffman coding) is an adaptive coding technique
based on Huffman coding. It permits building the code as the symbols are being transmitted, having
no initial knowledge of source distribution, that allows one-pass encoding and adaptation to
changing conditions in data.The benefit of one-pass procedure is that the source can be encoded in
real time, though it becomes more sensitive to transmission errors, since just a single loss ruins the
whole code.

You might also like