You are on page 1of 21

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

“JNANA SANGAMA”, BELAGAVI - 590 018

A MINI PROJECT REPORT


on

“Text File Data Compression using Huffman


Encoding”
Submitted by

Suhan Acharya 4SF18IS103


Swasthik Shetty 4SF18IS109

BACHELOR OF ENGINEERING

in

INFORMATION SCIENCE & ENGINEERING

Under the Guidance of

Ms. Jayapadmini Kanchan,


Assistant Professor,
Department of ISE,
at

SAHYADRI
College of Engineering and Management
Adyar, Mangaluru - 575 007
2020 - 21
SAHYADRI
College of Engineering and Management
Adyar, Mangaluru - 575 007

Department of Information Science & Engineering

CERTIFICATE

This is to certify that the mini project entitled “Text File Data Compression us-
ing Huffman Encoding” has been carried out by Suhan Acharya (4SF18IS103)
and Swasthik Shetty (4SF18IS109) the bonafide students of Sahyadri College
of Engineering and Management, Bachelor of Engineering in Information Science
& Engineering of Visvesvaraya Technological University, Belagavi during the year
2020-21. It is certified that all corrections / suggestions indicated for internal as-
sessment have been incorporated in the report deposited in the departmental library.
The mini project report has been approved as it satisfies the academic requirements
in respect of mini project work prescribed in File Structures Laboratory with Mini
Project(18ISL67) for the said degree in sixth semester.

———————————– ——————————— ————————————–


Signature of the Guide1 Signature of the Guide2 Signature of the HOD
Ms. Jayapadmini Kanchan Mrs. Masooda Dr. Shamanth Rai

External Viva:
Examiner’s Name Signature with Date

1. . . . . . . . . . . . . . . . . . . . . . .....................

2. . . . . . . . . . . . . . . . . . . . . . .....................
SAHYADRI
College of Engineering and Management
Adyar, Mangaluru - 575 007

Department of Information Science & Engineering

DECLARATION

We hereby declare that the entire work embodied in this Mini Project Report titled
“Text File Data Compression using Huffman Encoding” has been carried
out by us at Sahyadri College of Engineering and Management, Mangaluru under
the supervision of Ms. Jayapadmini Kanchan, for Bachelor of Engineering in
Information Science & Engineering. This report has not been submitted to this
or any other University for the award of any other degree.

Suhan Acharya (4SF18IS103)

Swasthik Shetty (4SF18IS109)

Dept. of ISE, SCEM, Mangaluru


Abstract

Everyone is dependent on internet for day to day activities these days, the time re-
quired to transfer data through the network should be as minimum as possible. Trivial
task should be performed with lesser time and important task should be give higher
priority. This will create a balance in the network traffic. Usually data which are
fetched from the back end server can be of large size. These data should be com-
pressed to smaller size to reduce the time required to fetch the data from server and
also the time required to format the data to be received by the client and to be dis-
played in the client’s computer system. Huffman coding is a lossless data compression
algorithm. The idea is to assign a variable-length codes to input characters, length
of the assigned codes are based on the frequencies of corresponding characters. The
most frequent character gets the smallest code and the least frequent character gets
the largest code. The variable-length codes assigned to input characters are Prefix
Codes, that is the codes (bit sequences) are assigned in such a way that the code as-
signed to one character is not the prefix of code assigned to any other character. In the
real world scenario, when the files are fetched from the server the files which the user
receives will be compressed using Gzip technique. Gzip is based on the DEFLATE
algorithm, which is a combination of LZ77 and Huffman coding.

i
Acknowledgement

It is with great satisfaction and euphoria that we are submitting the Mini Project
Report on “Text File Data Compression using Huffman Encoding”. We have
completed it as a part of the curriculum of Visvesvaraya Technological University, Be-
lagavi for the award of Bachelor of Engineering in Information Science & Engineering.

We are profoundly indebted to our guide, Ms. Jayapadmini Kanchan, Assistant


Professor, Department of Information Science & Engineering for innumerable acts of
timely advice, encouragement and We sincerely express our gratitude.

We express our sincere gratitude to Dr. Shamanth Rai, Head and Associate Pro-
fessor, Department of Information Science & Engineering for his invaluable support
and guidance.

We sincerely thank Dr. Rajesha S, Principal, Sahyadri College of Engineering and


Management and Dr. D. L. Prabhakara, Director, Sahyadri Educational Institu-
tions, who have always been a great source of inspiration.

Finally, yet importantly, we express our heartfelt thanks to our family and friends for
their wishes and encouragement throughout the work.

Suhan Acharya (4SF18IS103)


Swasthik Shetty (4SF18IS109)

ii
Table of Contents

Abstract i

Acknowledgement ii

Table of Contents iii

List of Figures iv

1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Requirements Specification 3
2.1 Hardware Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Software Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 System Design 4
3.1 Architecture Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 Implementation 6

5 Results and Discussion 9

6 Conclusion 13

iii
List of Figures

3.1 Architecture Diagram of Huffman Coding . . . . . . . . . . . . . . . . 4


3.2 Bit structure of Compressed File . . . . . . . . . . . . . . . . . . . . 5

4.1 Pseudo code of Compression Code . . . . . . . . . . . . . . . . . . . . 6


4.2 Pseudo code of Decompression Code . . . . . . . . . . . . . . . . . . 7
4.3 Pseudo code of Making the heap for Huffman Tree . . . . . . . . . . . 8
4.4 Pseudo code of Heap Node class for node of a Huffman Tree . . . . . 8

5.1 Compression in process . . . . . . . . . . . . . . . . . . . . . . . . . . 9


5.2 Compression Complete . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.3 Compression Summary (a) . . . . . . . . . . . . . . . . . . . . . . . . 11
5.4 Compression Summary (b) . . . . . . . . . . . . . . . . . . . . . . . . 12

iv
Chapter 1

Introduction

Huffman coding is a lossless data compression algorithm. The idea is to assign


variable-length codes to input characters, length of the assigned codes are based on
the frequencies of corresponding characters. The most frequent character gets the
smallest code and the least frequent character gets the largest code. The variable-
length codes assigned to input characters are Prefix Codes, that is the codes (bit
sequences) are assigned in such a way that the code assigned to one character is not
the prefix of code assigned to any other character.
Huffman coding requires two steps, build a Huffman Tree from input characters
and traverse the Huffman Tree and assign codes to characters. We would be using
OOPs concept of Python to generate the Tree and produce the Huffman code to
encode the plain text.

1.1 Purpose
The main purpose of this project is to compress the text files into binary file in order
to reduce the file size. Usually data which are fetched from the backend server can
be of large size. These data should be compressed to smaller size to reduce the time
required to fetch the data from server and also the time required to format the data
to be received by the client and to be displayed in the client’s computer system.

1.2 Scope
In the real world scenario, when the files are fetched from the server the files which
the user receives will be compressed using gzip technique. Gzip is based on the

1
Text File Data Compression using Huffman Encoding Chapter 1

DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. This


application provides the information regarding file, like it’s actual size of the file and
the size of the file after compression. It displays he compression ratio along with the
real time progress of the compression process and estimated time for the completion
of the process.

1.3 Overview
Everyone is dependent on internet for day to day activities these days, the time
required to transfer data through the network should be as minimum as possible.
Trivial task should be performed with lesser time and important task should be give
higher priority. This will create a balance in the network traffic. This application uses
Huffman text data compression encoding, to compress text file into binary file. With
the help of tree and heap data structures, we are creating nodes which will be used
to compare and generate the binary output file. This file is compressed to a smaller
sized binary file.

Department of Information Science & Engineering, SCEM, Mangaluru Page 2


Chapter 2

Requirements Specification

2.1 Hardware Specification


• Processor : Intel(R) Core(TM) i3-1005G1 CPU @ 1.20GHz

• RAM : 4GB

• Hard Disk : 500GB

• Input Device : Standard keyboard and Mouse

• Output Device : Monitor

2.2 Software Specification


• Programming Language : Python

• IDE : Python IDE

3
Chapter 3

System Design

3.1 Architecture Diagram


If the user chooses Decompression, the system would execute the Decompression algo-
rithm by extracting required data from the compressed file. The Bit Structure of the
encoded File is given below. The size k of Huffman Code is store in the first 16bits of

Figure 3.1: Architecture Diagram of Huffman Coding

4
Text File Data Compression using Huffman Encoding Chapter 3

the compressed file, which gives the number of bits to be read into reverse mapping.
The next k-bits represent the Huffman Code mappings. After reading the Huffman

Figure 3.2: Bit structure of Compressed File

Code Mappings, then next 8bits will store the padding info of the encoded file so that
that many bits can be ignored later while decompression. Then the rest of the data
till EOF is the encoded text which would be used for constructing the original data.

Department of Information Science & Engineering, SCEM, Mangaluru Page 5


Chapter 4

Implementation

The Compression function takes in the input file path, and reads the entire file.
Then counts the frequency of occurrence of each individual character symbol. After
counting, it generates the Huffman Tree using Heap data structure. After generating

Figure 4.1: Pseudo code of Compression Code

Huffman Tree, we can store the Huffman code by parsing the tree and storing in a
dictionary. Then the code parses through the entire text, replacing each occurrence

6
Text File Data Compression using Huffman Encoding Chapter 4

of the symbol by its appropriate Huffman code. In order to decompress, we store the
Huffman Code dictionary along with the compressed file.
The decompression function does the reverse operation of Compression. The file

Figure 4.2: Pseudo code of Decompression Code

is read byte wise, and converted to bit string which is later converted to original
string, by parsing through the Huffman code reverse mapping that was included in
the beginning bytes of the compressed file.

Department of Information Science & Engineering, SCEM, Mangaluru Page 7


Text File Data Compression using Huffman Encoding Chapter 4

The heap nodes are taken two at a time and merged and pushed into the heapq
after doing appropriate comparisons with the existing heap data.

Figure 4.3: Pseudo code of Making the heap for Huffman Tree

The Heap data structure used to generate the Huffman tree has two children, and
the frequency as well as the character that node represents

Figure 4.4: Pseudo code of Heap Node class for node of a Huffman Tree

Department of Information Science & Engineering, SCEM, Mangaluru Page 8


Chapter 5

Results and Discussion

Initially, menu based user interface will be displayed asking the user to select any one
option. Once user selects the compression option the application will ask user to enter
the file path. Once the file path is pasted, the compression process will begin where

Figure 5.1: Compression in process

user can track the progress in real-time. User can also see the time remaining needed
to complete the process.

9
Text File Data Compression using Huffman Encoding Chapter 5

Once the compress process is complete, the application will display as complete on
the progress bar. After this the binary generated file will be created and the binary
data will be appended to the file.

Figure 5.2: Compression Complete

Department of Information Science & Engineering, SCEM, Mangaluru Page 10


Text File Data Compression using Huffman Encoding Chapter 5

After the compression process is complete, the summary of the compression algo-
rithm is displayed in table form. Summary contains the size of original file in bytes,

Figure 5.3: Compression Summary (a)

size of the binary file which is created after compression in bytes and the compression
factor.

Department of Information Science & Engineering, SCEM, Mangaluru Page 11


Text File Data Compression using Huffman Encoding Chapter 5

At the end, User will option to select any option. User can decompress the binary
file back into the text file. If they select the decompress option then user has to add
the path of binary file. Soon after this the decompression process starts and updates

Figure 5.4: Compression Summary (b)

the progress bar with remaining time. After the process is complete the text back is
created without any data loss.

Department of Information Science & Engineering, SCEM, Mangaluru Page 12


Chapter 6

Conclusion

Data Compression is used throughout File Structures and plays very important role
in Networking Applications where there is limited bandwidth to send data, it becomes
impossible to send lots of data without compromising speed, hence with compression,
it becomes possible to send a representation of the data that needs to be received.
Then the receiver is capable to decoding the data hence establishing loss less data
compression. For future implementations, combinations of multiple encoding algo-
rithms and analyze of different algorithms fair against different types of data files.

13
Bibliography

[1] Michael J. Folk, Bill Zoellick, Greg Riccardi: File Structures-An Object Oriented
Approach with C++, 3rd Edition, Pearson Education, 1998.

[2] K.R. Venugopal, K.G. Srinivas, P.M. Krishnaraj: File Structures Using C++,
Tata McGraw-Hill, 2008.

[3] Scot Robert Ladd: C++ Components and Algorithms, BPB Publications, 1993.

[4] Raghu Ramakrishan and Johannes Gehrke: Database Management Systems, 3rd
Edition, McGraw Hill, 2003.

[5] Wikipedia contributors, ”Huffman coding,” Wikipedia, The Free Encyclopedia,


https://en.wikipedia.org/w/index.php?title=Huffman coding

[6] Geekforgeeks - Huffman Coding https://www.geeksforgeeks.org/huffman-coding-


greedy-algo-3/

14

You might also like