You are on page 1of 15

File compression

Using Huffman coding

KABILAN D - 19IT043
GOPINATH K S - 19IT030
Problem definition
Data Transfer Speed : Even a system that can handle large
amounts of data transfer is still slowed down when a lot of users
connect to it at once.
Storage Space: File compression is intended to reduce the
storage requirements of data that provide no additional
information, such as white space on a page.
Solution methodology
MESSAGE - BCCABBDDAECCBBAEDDCC

LENGTH OF MESSAGE - 20

BINARY DATA OF A ALPHABET IS 8-BITS

SO SIZE OF THE MESSAGE IS 20*8=160 BITS

“ FROM ABOVE WE CAN SAY THAT

THE SIZE OF THE DATA IS BULKY ”


SOLUTION METHODOLOGY
WHAT HUFFMAN CODING DOES:
Huffman coding uses Greedy Algorithm to sort the characters according to
their frequencies and to assign variable-length codes to input characters,
lengths of the assigned codes are based on the frequencies of corresponding
characters. The most frequent character gets the smallest code and the least
frequent character gets the largest code.
SOLUTION METHODOLOGY THE CHARACTERS ARE SORTED IN ASCENDING ORDER AND IT USES A
GREEDY TECHNIQUE TO

● Extract two nodes with the minimum frequency from


the min heap.
● Create a new internal node with a frequency equal to
the sum of the two nodes frequencies. Make the first
extracted node as its left child and the other extracted
node as its right child. Add this node to the min heap.
● Repeat steps #2 and #3 until the heap contains only
one node. The remaining node is the root node and the
tree is complete.
SOLUTION METHODOLOGY And it Traverses the tree formed
starting from the root. Maintain an
auxiliary array. While moving to the
left child, write 0 to the array.
While moving to the right child,
write 1 to the array.
SOLUTION METHODOLOGY Now size of the message becomes
CHAR FREQ CODE BITS

A 3 001 3*3 = 9

B 5 10 5*2 = 10

C 6 11 6*2 = 12

D 4 01 4*2 = 8

E 2 000 2*3 = 6

TOTAL 45 BITS
BITS
FLOWCHART
code
import heapq
import os

class HuffmanCoding:
def __init__(self, path):
self.path = path
self.heap = []
self.codes = {}
self.reverse_mapping = {}

class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None

def __lt__(self, other):


return self.freq < other.freq

def __eq__(self, other):


if(other == None):
return False
if(not isinstance(other, HeapNode)):
return False
return self.freq == other.freq
code
def make_frequency_dict(self, text):
frequency = {}
for character in text:
if not character in frequency:
frequency[character] = 0
frequency[character] += 1
return frequency
def make_heap(self, frequency):
for key in frequency:
node = self.HeapNode(key, frequency[key])
heapq.heappush(self.heap, node)

def merge_nodes(self):
while(len(self.heap)>1):
node1 = heapq.heappop(self.heap)
node2 = heapq.heappop(self.heap)

merged = self.HeapNode(None, node1.freq + node2.freq)


merged.left = node1
merged.right = node2

heapq.heappush(self.heap, merged)

def make_codes_helper(self, root, current_code):


if(root == None):
return
code
if(root.char != None):
self.codes[root.char] = current_code
self.reverse_mapping[current_code] = root.char
return
self.make_codes_helper(root.left, current_code + "0")
self.make_codes_helper(root.right, current_code + "1")

def make_codes(self):
root = heapq.heappop(self.heap)
current_code = ""
self.make_codes_helper(root, current_code)

def get_encoded_text(self, text):


encoded_text = ""
for character in text:
encoded_text += self.codes[character]
return encoded_text

def pad_encoded_text(self, encoded_text):


extra_padding = 8 - len(encoded_text) % 8
for i in range(extra_padding):
encoded_text += "0"

padded_info = "{0:08b}".format(extra_padding)
encoded_text = padded_info + encoded_text
return encoded_text
code
def get_byte_array(self, padded_encoded_text):
if(len(padded_encoded_text) % 8 != 0):
print("Encoded text not padded properly")
exit(0)
b = bytearray()
for i in range(0, len(padded_encoded_text), 8):
byte = padded_encoded_text[i:i+8]
b.append(int(byte, 2))
return b
def compress(self):
filename, file_extension = os.path.splitext(self.path)
output_path = filename + ".bin"
with open(self.path, 'r+') as file, open(output_path, 'wb') as output:
text = file.read()
text = text.rstrip()
frequency = self.make_frequency_dict(text)
self.make_heap(frequency)
self.merge_nodes()
self.make_codes()
encoded_text = self.get_encoded_text(text)
padded_encoded_text = self.pad_encoded_text(encoded_text)
b = self.get_byte_array(padded_encoded_text)
output.write(bytes(b))
print("Compressed")
return output_path
Result and discussion

Original text

Compressed file using


huffman coding
Result and discussion

We can see that size of the file is reduced to the possible lowes size
using huffman coding

Thus File Compression is achieved through Huffman Encoding


Time and space complexity
TIME COMPLEXITY :The time complexity analysis of Huffman Coding is as follows-

● extractMin( ) is called 2 x (n-1) times if there are n nodes.


● As extractMin( ) calls minHeapify( ), it takes O(logn) time.

Thus, Overall time complexity of Huffman Coding becomes O(nlogn).

SPACE COMPLEXITY : If we have n symbol then we need to store each Symbol in Array so

Space complexity = O(n)

You might also like