Professional Documents
Culture Documents
BACHELOR OF ENGINEERING
in
SAHYADRI
College of Engineering and Management
Adyar, Mangaluru - 575 007
2020 - 21
SAHYADRI
College of Engineering and Management
Adyar, Mangaluru - 575 007
CERTIFICATE
This is to certify that the mini project entitled “Text File Data Compression us-
ing Huffman Encoding” has been carried out by Suhan Acharya (4SF18IS103)
and Swasthik Shetty (4SF18IS109) the bonafide students of Sahyadri College
of Engineering and Management, Bachelor of Engineering in Information Science
& Engineering of Visvesvaraya Technological University, Belagavi during the year
2020-21. It is certified that all corrections / suggestions indicated for internal as-
sessment have been incorporated in the report deposited in the departmental library.
The mini project report has been approved as it satisfies the academic requirements
in respect of mini project work prescribed in File Structures Laboratory with Mini
Project(18ISL67) for the said degree in sixth semester.
External Viva:
Examiner’s Name Signature with Date
1. . . . . . . . . . . . . . . . . . . . . . .....................
2. . . . . . . . . . . . . . . . . . . . . . .....................
SAHYADRI
College of Engineering and Management
Adyar, Mangaluru - 575 007
DECLARATION
We hereby declare that the entire work embodied in this Mini Project Report titled
“Text File Data Compression using Huffman Encoding” has been carried
out by us at Sahyadri College of Engineering and Management, Mangaluru under
the supervision of Ms. Jayapadmini Kanchan, for Bachelor of Engineering in
Information Science & Engineering. This report has not been submitted to this
or any other University for the award of any other degree.
Everyone is dependent on internet for day to day activities these days, the time re-
quired to transfer data through the network should be as minimum as possible. Trivial
task should be performed with lesser time and important task should be give higher
priority. This will create a balance in the network traffic. Usually data which are
fetched from the back end server can be of large size. These data should be com-
pressed to smaller size to reduce the time required to fetch the data from server and
also the time required to format the data to be received by the client and to be dis-
played in the client’s computer system. Huffman coding is a lossless data compression
algorithm. The idea is to assign a variable-length codes to input characters, length
of the assigned codes are based on the frequencies of corresponding characters. The
most frequent character gets the smallest code and the least frequent character gets
the largest code. The variable-length codes assigned to input characters are Prefix
Codes, that is the codes (bit sequences) are assigned in such a way that the code as-
signed to one character is not the prefix of code assigned to any other character. In the
real world scenario, when the files are fetched from the server the files which the user
receives will be compressed using Gzip technique. Gzip is based on the DEFLATE
algorithm, which is a combination of LZ77 and Huffman coding.
i
Acknowledgement
It is with great satisfaction and euphoria that we are submitting the Mini Project
Report on “Text File Data Compression using Huffman Encoding”. We have
completed it as a part of the curriculum of Visvesvaraya Technological University, Be-
lagavi for the award of Bachelor of Engineering in Information Science & Engineering.
We express our sincere gratitude to Dr. Shamanth Rai, Head and Associate Pro-
fessor, Department of Information Science & Engineering for his invaluable support
and guidance.
Finally, yet importantly, we express our heartfelt thanks to our family and friends for
their wishes and encouragement throughout the work.
ii
Table of Contents
Abstract i
Acknowledgement ii
List of Figures iv
1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Requirements Specification 3
2.1 Hardware Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Software Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 System Design 4
3.1 Architecture Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Implementation 6
6 Conclusion 13
iii
List of Figures
iv
Chapter 1
Introduction
1.1 Purpose
The main purpose of this project is to compress the text files into binary file in order
to reduce the file size. Usually data which are fetched from the backend server can
be of large size. These data should be compressed to smaller size to reduce the time
required to fetch the data from server and also the time required to format the data
to be received by the client and to be displayed in the client’s computer system.
1.2 Scope
In the real world scenario, when the files are fetched from the server the files which
the user receives will be compressed using gzip technique. Gzip is based on the
1
Text File Data Compression using Huffman Encoding Chapter 1
1.3 Overview
Everyone is dependent on internet for day to day activities these days, the time
required to transfer data through the network should be as minimum as possible.
Trivial task should be performed with lesser time and important task should be give
higher priority. This will create a balance in the network traffic. This application uses
Huffman text data compression encoding, to compress text file into binary file. With
the help of tree and heap data structures, we are creating nodes which will be used
to compare and generate the binary output file. This file is compressed to a smaller
sized binary file.
Requirements Specification
• RAM : 4GB
3
Chapter 3
System Design
4
Text File Data Compression using Huffman Encoding Chapter 3
the compressed file, which gives the number of bits to be read into reverse mapping.
The next k-bits represent the Huffman Code mappings. After reading the Huffman
Code Mappings, then next 8bits will store the padding info of the encoded file so that
that many bits can be ignored later while decompression. Then the rest of the data
till EOF is the encoded text which would be used for constructing the original data.
Implementation
The Compression function takes in the input file path, and reads the entire file.
Then counts the frequency of occurrence of each individual character symbol. After
counting, it generates the Huffman Tree using Heap data structure. After generating
Huffman Tree, we can store the Huffman code by parsing the tree and storing in a
dictionary. Then the code parses through the entire text, replacing each occurrence
6
Text File Data Compression using Huffman Encoding Chapter 4
of the symbol by its appropriate Huffman code. In order to decompress, we store the
Huffman Code dictionary along with the compressed file.
The decompression function does the reverse operation of Compression. The file
is read byte wise, and converted to bit string which is later converted to original
string, by parsing through the Huffman code reverse mapping that was included in
the beginning bytes of the compressed file.
The heap nodes are taken two at a time and merged and pushed into the heapq
after doing appropriate comparisons with the existing heap data.
Figure 4.3: Pseudo code of Making the heap for Huffman Tree
The Heap data structure used to generate the Huffman tree has two children, and
the frequency as well as the character that node represents
Figure 4.4: Pseudo code of Heap Node class for node of a Huffman Tree
Initially, menu based user interface will be displayed asking the user to select any one
option. Once user selects the compression option the application will ask user to enter
the file path. Once the file path is pasted, the compression process will begin where
user can track the progress in real-time. User can also see the time remaining needed
to complete the process.
9
Text File Data Compression using Huffman Encoding Chapter 5
Once the compress process is complete, the application will display as complete on
the progress bar. After this the binary generated file will be created and the binary
data will be appended to the file.
After the compression process is complete, the summary of the compression algo-
rithm is displayed in table form. Summary contains the size of original file in bytes,
size of the binary file which is created after compression in bytes and the compression
factor.
At the end, User will option to select any option. User can decompress the binary
file back into the text file. If they select the decompress option then user has to add
the path of binary file. Soon after this the decompression process starts and updates
the progress bar with remaining time. After the process is complete the text back is
created without any data loss.
Conclusion
Data Compression is used throughout File Structures and plays very important role
in Networking Applications where there is limited bandwidth to send data, it becomes
impossible to send lots of data without compromising speed, hence with compression,
it becomes possible to send a representation of the data that needs to be received.
Then the receiver is capable to decoding the data hence establishing loss less data
compression. For future implementations, combinations of multiple encoding algo-
rithms and analyze of different algorithms fair against different types of data files.
13
Bibliography
[1] Michael J. Folk, Bill Zoellick, Greg Riccardi: File Structures-An Object Oriented
Approach with C++, 3rd Edition, Pearson Education, 1998.
[2] K.R. Venugopal, K.G. Srinivas, P.M. Krishnaraj: File Structures Using C++,
Tata McGraw-Hill, 2008.
[3] Scot Robert Ladd: C++ Components and Algorithms, BPB Publications, 1993.
[4] Raghu Ramakrishan and Johannes Gehrke: Database Management Systems, 3rd
Edition, McGraw Hill, 2003.
14