Professional Documents
Culture Documents
on
Data Compression
(CS-442N)
Submitted by:
Ankur
Gangwar
(17CS12)
Assignment - 01
Solution - 01:
The Lossy compression method eliminates some amount of data that is not noticeable. This
technique does not allow a file to restore in its original form but significantly reduces the size.
The lossy compression technique is beneficial if the quality of the data is not your priority. It
slightly degrades the quality of the file or data but is convenient when one wants to send or
store the data. This type of data compression is used for organic data like audio signals and
images.
JPEG stands for Joint Photographic Experts Group, which is the group that created the
standard. It has an extension of .JPG and .JPEG and is the most common image format
used by digital cameras and on the World Wide Web. It’s a lossy compression type for digital
images. Lossy image compression reduces file size and eliminates redundant information.
The user decides how much loss to introduce with a trade-off in storage size and quality. For
example, the compression quality is a range from 1 to 100. A lower value compresses the
raster image, but also reduces the quality. JPEG 2000 (JP2) is the newest version of JPEG.
It slightly improves image compression performance over JPEG using two different wavelet
transforms. Users can choose low to high levels of compression.
The Lossless compression method is capable of reconstituting the original form of the data.
The quality of the data is not compromised. This technique allows a file to restore its original
form. Lossless compression can be applied to any file format can improve the performance
of the compression ratio.
Solution - 02:
It totally depends on our choice. In Linux we have three type of encoding available by
default:
1. gzip {filename}: Gzip compress the size of the given files using Lempel-Ziv
coding (LZ77). Whenever possible, each file is replaced by one with the extension
.gz.
2. bzip2 {filename}: bzip2 compresses files using the Burrows-Wheeler block sorting text
compression algorithm, and Huffman coding. Compression is generally considerably better
than that achieved by bzip command (LZ77/LZ78-based compressors). Whenever possible,
each file is replaced by one with the extension .bz2.
Solution - 03:
As we can see that the file size is increased in all the encodings. Reason is, compression will
always add some overhead space for maintaining lookup hash table for mapping the pointers
to the location of the duplicates ! In addition to the hash table the overheads may contain
metadata for inflation. But, amount of data compressed highly outweighs the space required
by these overheads - so most of the time(when file is bigger) size of the compressed file is
smaller than the original.
Solution - 04: 20
𝑠𝑖𝑧𝑒 𝑎𝑓𝑡𝑒𝑟 =1:2
40
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 40
Compression ratio =
=
𝑠𝑖𝑧𝑒 𝑏𝑒𝑓𝑜𝑟𝑒
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑠𝑖𝑧𝑒 𝑏𝑒𝑓𝑜𝑟𝑒
Compression factor
=
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
= =2:1
𝑠𝑖𝑧𝑒 𝑎𝑓𝑡𝑒𝑟 20
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑠𝑖𝑧𝑒.𝑏𝑒𝑓𝑜𝑟𝑒.𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛−𝑠𝑖𝑧𝑒.
𝑎𝑓𝑡𝑒𝑟.𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
Saving percentage %
=
40−20
=
𝑠𝑖𝑧𝑒 𝑏𝑒𝑓𝑜𝑟𝑒 𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
%= 50%
40
Solution - 05:
Run-length coding :
Run-length algorithms are very effective if the data source contains many runs of
consecutive symbol. The symbols can be characters in a text file, 0s or 1s in a binary file or
black-and-white pixels in an image. Although simple, run-length algorithms have been used
well in practice. For example, the so-called HDC (Hardware Data Compression) algorithm,
used by tape drives connected to IBM computer systems, and a similar algorithm used in the
IBM SNA (System Network Architecture) standard for data communications are still in use
today.
Solution :
1. The first 3 Gs are read and encoded by r3G.
Solution -
06:
Huffman code derivation for string AAABEDBBTGGG:
A B E D T G character
3 3 1 1 1 3 frequency
A B G E D T
3 3 3 1 1 1
3 3 3 2 1
3 3 3 3
6 3 3
6 6
12
((G((DT)E))(AB))
/0 \1
(G((DT)E)) (AB)
/0 \1 /0 \1
G ((DT)E) A B
/0 \1
(DT) E
/0 \1
D T
5. Generating Huffman code:
G 00
D 0100
T 0101
E 011
A 10
B 11
Solution - 07:
The probabilities for each character are arranged in descending order and by using Minimum
variance Huffman coding, we obtained following Huffman tree:
follows: A 01
B 10
C 11
D 000
E 0010
F 00110
G 001110
H 001111
Solution - 08:
letter Probability
a1 0.7
a2 0.2
a3 0.1
letter Probability
a2 0.2
a3 0.1
a1a1 0.49
a1a2 0.14
a1a3 0.7
a2 0.2 000
a3 0.1 001
Solution - 09:
m = 7, n = 0, 1, 2,................22
n q r codeword
0 0 0 0000
1 0 1 0010
2 0 2 0011
3 0 3 00100
4 0 4 00101
5 0 5 00110
6 0 6 00111
7 1 0 10000
8 1 1 10010
9 1 2 10011
10 1 3 100100
11 1 4 100101
12 1 5 100110
13 1 6 100111
14 2 0 110000
15 2 1 110010
16 2 2 110011
17 2 3 1100100
18 2 4 1100101
19 2 5 1100110
20 2 6 1100111
21 3 0 1110000
22 3 1 1110010
Solution - 10:
Expanding on the huffman algorithm, Faller and Gallagher, and later Knuth and Vitter, developed a
way to perform the huffman algorithm as a one pass procedure.
Adaptive Huffman coding (also called Dynamic Huffman coding) is an adaptive coding technique
based on Huffman coding. It permits building the code as the symbols are being transmitted, having
no initial knowledge of source distribution, that allows one-pass encoding and adaptation to
changing conditions in data.The benefit of one-pass procedure is that the source can be encoded in
real time, though it becomes more sensitive to transmission errors, since just a single loss ruins the
whole code.