You are on page 1of 116

A Project Report On

COMPRESSION &
DECOMPRESSION
Submitted in partial fulfillment of the requirement for the
Award of the degree of

Bachelor of Technology
In
Information Technology
By
RAHUL SINGH
SHAKUN GARG

0407713057, 0407713042

Dr. K.N.MODI INSTITUTE OF


ENGINEERING & TECHNOLOGY
Approved by A.I.C.T.E.
Affiliated to U. P. Technical University, Lucknow
Modianagar 201204,
(Batch: 2004-2008)

1 1
CONTENTS

ACKNOWLEDGEMENT 4
CERTIFICATE 5
LIST OF TABLES 6
LIST OF FIGURES 6
ABSTRACT 7-13

SYNOPSIS OF THE PROJECT 14-18


1 OBJECTIVE 14
2 SCOPE 14
3 DESIGN PRINCIPLE & EXPLANATION 16-17
3.1 Module Description
3.1.1 Huffman Zip
3.1.2 Encoder
3.1.3 Decoder
3.1.4 Table
3.1.5 DLNode
3.1.6 Priority Queue
3.1.7 Huffman Node
4 HARDWARE & SOFTWARE REQUIREMENTS 18

2 2
MAIN REPORT 20-118

1 Objective & Scope of the Project 20


2 Theoretical Background 23
2.1 Introduction 23
2.2 Theory 23
2.3 Definition 24-35
2.3.1 Lossless vs Lossy Compression
2.3.2 Image Compression
2.3.3 Video Compression
2.3.4 Text Compression
2.3.4 LZW Algorithm
3 Problem Statement 36
4 System analysis and design 39
4.1 Analysis 39
4.2 Design 40-48
4.2.1 System design
4.2.2 Design objective
4.2.3 Design principle
5 Stages in System Life Cycle 49
5.1 Requirement Determination 49
5.2 Requirement Specifications 49
5.3 Feasibility Analysis 50
5.4 Final Specification 50
5.5 Hardware Study 51
5.6 System Design 51
5.7 System Implementation 52
5.8 System Evaluation 52
5.9 System Modification 52
5.10 System Planning 53
6 Hardware & Software Requirement 60
7 Project Description 61
7.1 Huffman Algorithm 61

3 3
7.2 Code Construction 68
7.3 Huffing Program 68
7.4 Building Table 69
7.5 Decompressing 70
7.6 Transmission & storage of Huffman encoded data 72
8 Working of Project 73
8.1 Module & their description 73
9 Data Flow Diagram 75
10 Print Layouts 82
11 Implementation 85
12 Testing 87
12.1 Test plan 87
12.2 Terms in testing fundamentals 88
13 Conclusion 94
14 Future Enhancement & New Direction 95
14.1 New Direction 95
14.2 Scope of future work 96
14.3 Scope of future application 96
15 Source Code 97-118
16 References 119

4 4
ACKNOWLEDGEMENT

Keep away from people who try to belittle your ambitions. Small people always do that,
but the really great make you feel that you too, can become great.

We take this opportunity to express my sincere thanks and deep gratitude to all
those people who extended their wholehearted co-operation and have helped me in
completing this project successfully.

First of all, we would like to thank Mr. Gaurav Vajpai (Project Guide) for his
strict supervision, constant encouragement, inspiration and guidance, which ensure the
worthiness of my work. Working under him was an enrich experience. His inspiring
suggestions and timely guidance enabled us to perceive the various aspects of the project in a
new light.

We would also thank to Head Dept. of IT, Prof. Jaideep Kumar,H.O.D., who
guided us a lot in completing this project. We would also like to thank my parents & project
mate for guiding and encouraging me throughout the duration of the project.

We will be failing in our mission if we do not thank other people who directly or
indirectly helped us in the successful completion of this project. So, our heartfull thanks to
all the teaching and non- teaching staff of computer science and engineering department of
our institution for their valuable guidance throughout the working of this project.

RAHUL SINGH
SHAKUN GARG
MANISH SRIVASTAVA

5 5
Dr. K.N. Modi Institute of Engineering and Technology
Modinagar
Affiliated to UP Technical University, Lucknow

DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE
This is to certify that RAHUL SINGH (0407713057), SHAKUN GARG (0407713042) and

MANISH SRIVASTAVA (0407713021) of the final year B. Tech. (IT) have carried out a

project work on COMPRESSION & DECOMPRESSION under the guidance of

Mr. GAURAV VAJPAI in Department IT for the partial fulfillment of the award of the

degree of Bachelor of Technology in Computer Science & Engineering in Meerut Institute of

Engineering & Technology, Meerut (Affiliated to U.P. Technical University, Lucknow) is a

bonafide record of work done by them during the year 2007 2008.

Head of the Department: Internal Guide:

(Mr. JAIDEEP KUMAR) Mr. GAURAV VAJPAI

Head, Department of IT

6 6
LIST OF TABLES

Table No. Table Name Page No.

FILE TABLE
1

2 DETAIL TABLE

LIST OF FIGURES

Figure No. Figure Name Page No.

1. ARCHITECTURE OF NETPOD 19

2 PERT CHART 36

38
3. GANTT CHART

7 7
COMPRESSION & DECOMPRESSION

STATEMENT ABOUT THE PROBLEM

In todays world of computing, it is hardly possible to do without graphics, images

and sounds. Just by looking at the applications around us, the Internet, development of Video

CDs (Compact Disk) ,Video Conferencing, and much more, all these applications use

graphics and sound intensively.

I guess many of us have surfed the Internet, have you ever become so frustrated in waiting

for a graphics intensive web page to be opened that you stopped the transfer I bet you

have. Guess what will happened if those graphics are not compressed?

Uncompressed graphics, audio And video data consumes very large amount of physical

storage which for the case of uncompressed video, even present CD technology is unable to

handle.

WHY IS THE PARTICULAR TOPIC CHOSEN?

Files available for transfer from one host to another over a network (or via modem) are often

stored in a compressed format or some other special format well-suited to the storage medium

and/or transfer method. There are many reasons for compressing/archiving files. The more

common are:

8 8
File compression can significantly reduce the size of a file (or group of files). Smaller files

take up less storage space on the host and less time to transfer over the network, saving both

time and money

OBJECTIVE AND SCOPE OF THE PROJECT

The objective of this system is to compress and decompress files. This system will be used to

compress files , so that they may take less memory for storage and transmission from one

computer to another. This system will work in following ways:

To compress a text and image file using Huffman coding.


To decompress the compressed file to original format.
To show the compression ratio.

Our project will be able to Compress message in such a form that can be easily transmitted

over the network or from one system to another. At the receiver end after decompressing the

message receiver will get the original message. This is how effective transmission of data that

take place between sender and receiver.

Reusability:

Reusability is possible as and when we require in this application. We can update it next

version. Reusable software reduces design, coding and testing cost by amortizing effort over

several designs. Reducing the amount of code also simplifies understanding, which increases

9 9
the likelihood that the code is correct. We follow up both types of reusability: Sharing of

newly written code within a project and reuse of previously written code on new projects.

Extensibility:

This software is extended in ways that its original developers may not expect. The following

principles enhance extensibility like Hide data structure, Avoid traversing multiple links or

methods, Avoid case statements on object type and distinguish public and private operations.

Robustness:

Its method is robust if it does not fail even if it receives improper parameters. There are some

facilities like Protect against errors, Optimize after the program runs, validating arguments

and Avoid predefined limits.

Understandability:

A method is understandable if someone other than the creator of the method can understand

the code (as well as the creator after a time lapse). We use the method, which small and

coherent helps to accomplish this.

Cost-effectiveness:

Its cost is under the budget and make within given time period. It is desirable to aim for a

system with a minimum cost subject to the condition that it must satisfy all the requirements.

Scope of this document is to put down the requirements, clearly identifying the

information needed by the user, the source of the information and outputs expected

from the system

METHODOLOGY ADOPTED

The methodology used is the classic Life-cycle model the WATERFALL MODEL

10 10
HARDWARE & SOFTWARE REQUIREMENTS

HARDWARE SPECIFICATIONS:

Processor Pentium- I\II\III\higher

11 11
Ram 128 MB RAM or higher

Monitor 15 Inch (Digital) with 800 X 600 support

Keyboard 101 Keys keyboard

Mouse 2 Button Serial/ PS-2

Tools / Platform Language Used:

Language: Java

OS: Any OS such as Windows XP/98/NT/Vista

TESTING TECHNOLOGIES

Some of the commonly used Strategies for Testing are as follows:-

Unit testing
12 12
Module testing
Integration testingSystem testing
Acceptance testing
UNIT TESTING

Unit testing is the testing of a single program module in an isolated environment. Testing of

the processing procedures is the main focus

MODULE TESTING

A module encapsulates related component. So can be tested without other system modules.

INTEGERATION TESTING

Integration testing is the testing of the interface among the system modules. In other words it

ensures that the module is handles as intended.

SYSTEM TESTING

13 13
System testing is the testing of the system against its initial objectives. It is done either in a

simulated environment or in a live environment.

ACCEPTANCE TESTING

Acceptance Testing is performed with realistic data of the client to demonstrate that the

software is working satisfactorily. Testing here is focused on external behavior of the system;

the internal logic of program is not emphasized.

WHAT CONTRIBUTION WOULD THE PROJECT MAKE?

The contributions of COMPRESSION & DECOMPRESSION are as follows:

Compression is useful because it helps reduce the consumption of expensive

resources, such as hard disk space or transmission bandwidth .

It involve trade-offs between various factors, including the degree of compression,

the amount of distortion introduced (if using a lossy compression scheme), and the

computational resources required to compress and uncompress the data.

SYNOPSIS OF THE PROJECT

14 14
1. OBJECTIVE

The objective of this system is to compress and decompress files. This system will be used to

compress files , so that they may take less memory for storage and transmission from one

computer to another. This system will work in following ways:

To compress a text and image file using Huffman coding.

To decompress the compressed file to original format.

To show the compression ratio.

2. SCOPE

Our project will be able to Compress message in such a form that can be easily transmitted

over the network or from one system to another. At the receiver end after decompressing the

message receiver will get the original message. This is how effective transmission of data that

take place between sender and receiver.

Reusability:

15 15
Reusability is possible as and when we require in this application. We can update it next

version. Reusable software reduces design, coding and testing cost by amortizing effort over

several designs. Reducing the amount of code also simplifies understanding, which increases

the likelihood that the code is correct. We follow up both types of reusability: Sharing of

newly written code within a project and reuse of previously written code on new projects.

Extensibility:

This software is extended in ways that its original developers may not expect. The following

principles enhance extensibility like Hide data structure, Avoid traversing multiple links or

methods, Avoid case statements on object type and distinguish public and private operations.

Robustness:

Its method is robust if it does not fail even if it receives improper parameters. There are some

facilities like Protect against errors, Optimize after the program runs, validating arguments

and Avoid predefined limits.

Understandability:

A method is understandable if someone other than the creator of the method can understand

the code (as well as the creator after a time lapse). We use the method, which small and

coherent helps to accomplish this.

Cost-effectiveness:

Its cost is under the budget and make within given time period. It is desirable to aim for a

system with a minimum cost subject to the condition that it must satisfy all the requirements.

16 16
Scope of this document is to put down the requirements, clearly identifying the

information needed by the user, the source of the information and outputs expected

from the system

3. DESIGN PRINCIPLES & EXPLANATION

MODULE DESCRIPTION

There are following functions in project

Huffman Zip
Encoder
Decoder
Table
DLnode
Priority Queue
Huffman Node

Huffman zip is the main function which uses applet. It is used for user interface.

Encoder is the module for compressing the file. It implements Huffman algorithm for

compressing the text and image file. It first calculate the frequencies of all the occurring

symbols. Then on the basis of these frequencies it generates the priority queue. This priority

queue is used for finding the symbols with least frequencies. Now the two symbols with

lowest frequencies are deleted from the queue and a new symbol is added to the queue with

frequency equal to the sum of these two symbols. In the meanwhile we generate a tree with

leaf nodes are the two deleted node and the root node is the new node added to the queue. At

last we traverse the tree starting from the root node to the leaf node assigning 0 to the left

17 17
child and 1 to the right node. In this way we assign code to every symbol in the file. These

are binary codes then we group these binary codes and calculate the equivalent integers and

store them in the output file, which is the compressed file.

Decoder works in the reverse order as the encoder. It reads the input from the compressed file

and convert it into equivalent binary code. It has one another input the binary tree generated

in the encoding process and on the basis of these data it generates the original file. This

project is based on lossless compression.

Table is used for storing the codes of each symbol. Priority queue takes input the symbols

and there related frequencies and on the basis of these frequencies it assign priorities to each

symbol. Huffman node is used for creating the binary tree it takes input two symbol from the

priority queue and create two nodes by comparing the frequencies of these two symbol. It

places the symbol with less frequency to the left and the symbol with high frequency to the

right, it then deletes these two symbol from the priority queue and places a new symbol with

frequency equal to the sum of frequencies of these two deleted symbol. It also generate a

parent node to the two node and assign frequency equal to the sum of frequencies of the two

leaf node.

5. HARDWARE & SOFTWARE REQUIREMENTS

18 18
Existing hardware will be used:

Intel Pentium-IV

128 MB RAM

SVGA Color Monitor on PCI with 1MB RAM

101 Keys Keyboard

1 Microsoft Mouse with pad

Tools / Platform Language Used:

Language: Java

OS: Any OS such as Windows XP/98/NT,

Database: MS Access.

MAIN REPORT

19 19
OBJECTIVE AND SCOPE

The objective of this system is to compress and decompress files. This system will be used to

compress files , so that they may take less memory for storage and transmission from one

computer to another. This system will work in following ways:

To compress a text and image file using Huffman coding.

To decompress the compressed file to original format.

To show the compression ratio.

SCOPE

Our project will be able to Compress message in such a form that can be easily transmitted

over the network or from one system to another. At the receiver end after decompressing the

message receiver will get the original message. This is how effective transmission of data that

take place between sender and receiver.

Reusability:

Reusability is possible as and when we require in this application. We can update it next

version. Reusable software reduces design, coding and testing cost by amortizing effort over

several designs. Reducing the amount of code also simplifies understanding, which increases

the likelihood that the code is correct. We follow up both types of reusability: Sharing of

newly written code within a project and reuse of previously written code on new projects.

Extensibility:

20 20
This software is extended in ways that its original developers may not expect. The following

principles enhance extensibility like Hide data structure, Avoid traversing multiple links or

methods, Avoid case statements on object type and distinguish public and private operations.

Robustness:

Its method is robust if it does not fail even if it receives improper parameters. There are some

facilities like Protect against errors, Optimize after the program runs, validating arguments

and Avoid predefined limits.

Understandability:

A method is understandable if someone other than the creator of the method can understand

the code (as well as the creator after a time lapse). We use the method, which small and

coherent helps to accomplish this.

Cost-effectiveness:

Its cost is under the budget and make within given time period. It is desirable to aim for a

system with a minimum cost subject to the condition that it must satisfy all the requirements.

Scope of this document is to put down the requirements, clearly identifying the

information needed by the user, the source of the information and outputs expected from

the system

21 21
THEORETICAL BACKGROUND

Introduction

A brief introduction to information theory is provided in this section. The definitions and

assumptions necessary to a comprehensive discussion and evaluation of data compression

methods are discussed. The following string of characters is used to illustrate the concepts

defined: EXAMPLE = aa bbb cccc ddddd eeeeee fffffff gggggggg.

22 22
Theory:

The theoretical background of compression is provided by information theory (which is

closely related to algorithmic information theory) and by rate-distortion theory. These fields

of study were essentially created by Claude Shannon, who published fundamental papers on

the topic in the late 1940s and early 1950s. Doyle and Carlson (2000) wrote that data

compression "has one of the simplest and most elegant design theories in all of engineering".

Cryptography and coding theory are also closely related. The idea of data compression is

deeply connected with statistical inference.

Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy

data compression systems typically include even more stages, including, for example,

prediction, frequency transformation, and quantization.

The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for

lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed

and compression ratio, although compression can be slow. LZW (Lempel-Ziv-Welch) is used

in GIF images. LZ methods utilize a table based compression model where table entries are

substituted for repeated strings of data. For most LZ methods, this table is generated

dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g.

SHRI, LZX).

The very best compressors use probabilistic models whose predictions are coupled to an

algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and

turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to

the better-known Huffman algorithm, and lends itself especially well to adaptive data

compression tasks where the predictions are strongly context-dependent.

23 23
Definition :

In computer science and information theory, data compression or source coding is the process

of encoding information using fewer bits (or other information-bearing units) than an

unencoded representation would use through use of specific encoding schemes. For example,

this article could be encoded with fewer bits if one were to accept the convention that the

word "compression" be encoded as "comp". One popular instance of compression with which

many computer users are familiar is the ZIP file format, which, as well as providing

compression, acts as an archiver, storing many files in a single output file.

As is the case with any form of communication, compressed data communication only works

when both the sender and receiver of the information understand the encoding scheme. For

example, this text makes sense only if the receiver understands that it is intended to be

interpreted as characters representing the English language. Similarly, compressed data can

only be understood if the decoding method is known by the receiver.

Compression is useful because it helps reduce the consumption of expensive resources, such

as hard disk space or transmission bandwidth. On the downside, compressed data must be

decompressed to be viewed (or heard), and this extra processing may be detrimental to some

applications. For instance, a compression scheme for video may require expensive hardware

for the video to be decompressed fast enough to be viewed as it's being decompressed (you

always have the option of decompressing the video in full before you watch it, but this is

inconvenient and requires storage space to put the uncompressed video). The design of data

compression schemes therefore involve trade-offs between various factors, including the

degree of compression, the amount of distortion introduced (if using a lossy compression

scheme), and the computational resources required to compress and uncompress the data.

24 24
A code is a mapping of source messages (words from the source alphabet alpha) into

codewords (words of the code alphabet beta). The source messages are the basic units into

which the string to be represented is partitioned. These basic units may be single symbols

from the source alphabet, or they may be strings of symbols. For string EXAMPLE, alpha = {

a, b, c, d, e, f, g, space}. For purposes of explanation, beta will be taken to be { 0, 1 }. Codes

can be categorized as block-block, block-variable, variable-block or variable-variable, where

block-block indicates that the source messages and codewords are of fixed length and

variable-variable codes map variable-length source messages into variable-length codewords.

A block-block code for EXAMPLE is shown in Figure 1.1 and a variable-variable code is

given in Figure 1.2. If the string EXAMPLE were coded using the Figure 1.1 code, the length

of the coded message would be 120; using Figure 1.2 the length would be 30.

source message codeword source message codeword

a 000 aa 0

b 001 bbb 1

c 010 cccc 10

d 011 ddddd 11

e 100 eeeeee 100

f 101 fffffff 101

g 110 gggggggg 110

space 111 space 111

25 25
The oldest and most widely used codes, ASCII and EBCDIC, are examples of block-block

codes, mapping an alphabet of 64 (or 256) single characters onto 6-bit (or 8-bit) codewords.

These are not discussed, as they do not provide compression. The codes featured in this

survey are of the block-variable, variable-variable, and variable-block types.

When source messages of variable length are allowed, the question of how a message

ensemble (sequence of messages) is parsed into individual messages arises. Many of the

algorithms described here are defined-word schemes. That is, the set of source messages is

determined prior to the invocation of the coding scheme. For example, in text file processing

each character may constitute a message, or messages may be defined to consist of

alphanumeric and non-alphanumeric strings.

In Pascal source code, each token may represent a message. All codes involving fixed-length

source messages are, by default, defined-word codes. In free-parse methods, the coding

algorithm itself parses the ensemble into variable-length sequences of symbols. Most of the

known data compression methods are defined-word schemes; the free-parse model differs in

a fundamental way from the classical coding paradigm.

A code is distinct if each codeword is distinguishable from every other (i.e., the mapping

from source messages to codewords is one-to-one). A distinct code is uniquely decodable if

every codeword is identifiable when immersed in a sequence of codewords. Clearly, each of

these features is desirable. The codes of Figure 1.1 and Figure 1.2 are both distinct, but the

code of Figure 1.2 is not uniquely decodable. For example, the coded message 11 could be

decoded as either ddddd or bbbbbb. A uniquely decodable code is a prefix code (or prefix-free

code) if it has the prefix property, which requires that no codeword is a proper prefix of any

other codeword. All uniquely decodable block-block and variable-block codes are prefix

codes. The code with codewords { 1, 100000, 00 } is an example of a code which is uniquely

decodable but which does not have the prefix property. Prefix codes are instantaneously

26 26
decodable; that is, they have the desirable property that the coded message can be parsed into

codewords without the need for lookahead. In order to decode a message encoded using the

codeword set { 1, 100000, 00 }, lookahead is required. For example, the first codeword of the

message 1000000001 is 1, but this cannot be determined until the last (tenth) symbol of the

message is read (if the string of zeros had been of odd length, then the first codeword would

have been 100000).

A minimal prefix code is a prefix code such that if x is a proper prefix of some codeword,

then x sigma is either a codeword or a proper prefix of a codeword, for each letter sigma in

beta. The set of codewords { 00, 01, 10 } is an example of a prefix code which is not

minimal. The fact that 1 is a proper prefix of the codeword 10 requires that 11 be either a

codeword or a proper prefix of a codeword, and it is neither. Intuitively, the minimality

constraint prevents the use of codewords which are longer than necessary. In the above

example the codeword 10 could be replaced by the codeword 1, yielding a minimal prefix

code with shorter codewords. The codes discussed in this paper are all minimal prefix codes.

In this section, a code has been defined to be a mapping from a source alphabet to a code

alphabet; we now define related terms. The process of transforming a source ensemble into a

coded message is coding or encoding. The encoded message may be referred to as an

encoding of the source ensemble. The algorithm which constructs the mapping and uses it to

transform the source ensemble is called the encoder. The decoder performs the inverse

operation, restoring the coded message to its original form.

Lossless vs. lossy compression:

27 27
Lossless compression algorithms usually exploit statistical redundancy in such a way as to

represent the sender's data more concisely, but nevertheless perfectly. Lossless compression is

possible because most real-world data has statistical redundancy. For example, in English

text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q'

will be followed by the letter 'z' is very small.

Another kind of compression, called lossy data compression, is possible if some loss of

fidelity is acceptable. For example, a person viewing a picture or television video scene might

not notice if some of its finest details are removed or not represented perfectly (i.e. may not

even notice compression artifacts). Similarly, two clips of audio may be perceived as the

same to a listener even though one is missing details found in the other. Lossy data

compression algorithms introduce relatively minor differences and represent the picture,

video, or audio using fewer bits.

Lossless compression schemes are reversible so that the original data can be reconstructed,

while lossy schemes accept some loss of data in order to achieve higher compression.

However, lossless data compression algorithms will always fail to compress some files;

indeed, any compression algorithm will necessarily fail to compress any data containing no

discernible patterns. Attempts to compress data that has been compressed already will

therefore usually result in an expansion, as will attempts to compress encrypted data.

In practice, lossy data compression will also come to a point where compressing again does

not work, although an extremely lossy algorithm, which for example always removes the last

byte of a file, will always compress a file up to the point where it is empty.

A good example of lossless vs. lossy compression is the following string -- 888883333333.

What you just saw was the string written in an uncompressed form. However, you could save

space by writing it 8[5]3[7]. By saying "5 eights, 7 threes", you still have the original string,

28 28
just written in a smaller form. In a lossy system, using 83 instead, you cannot get the original

data back (at the benefit of a smaller filesize).

A small overview of different compression is presented below:

Image compression:

Image here refers to not only still images but also motion-pictures and compression is the

process used to reduce the physical size of a block of information.

Compression is simply representing information more efficiently; "squeezing the air" out of

the data, so to speak. It takes advantage of three common qualities of graphical data; they are

often redundant, predictable or unnecessary.

Today , compression has made a great impact on the storing of large volume of image data.

Even hardware and software for compression and decompression are increasingly being made

part of a computer platform. Compression does have its trade-offs. The more efficient the

compression technique, the more complicated the algorithm will be and thus, requires more

computational resources or more time to decompress. This tends to affect the speed. Speed is

not so much of an importance to still images but weighs a lot in motion-pictures. Surely you

do not want to see your favourite movies appearing frame by frame in front of you.

Most methods for irreversible, or ``lossy'' digital image compression, consist of three main

steps: Transform, quantizing and coding, as illustrated in figure

29 29
The three steps of digital image compression.

Image compression is the application of Data compression on digital images. In effect, the

objective is to reduce redundancy of the image data in order to be able to store or transmit

data in an efficient form.

Image compression can be lossy or lossless. Lossless compression is sometimes preferred for

artificial images such as technical drawings, icons or comics. This is because lossy

compression methods, especially when used at low bit rates, introduce compression artifacts.

Lossless compression methods may also be preferred for high value content, such as medical

imagery or image scans made for archival purposes. Lossy methods are especially suitable for

natural images such as photos in applications where minor (sometimes imperceptible) loss of

fidelity is acceptable to achieve a substantial reduction in bit rate.

The best image quality at a given bit-rate (or compression rate) is the main goal of image

compression. However, there are other important properties of image compression schemes:

Scalability generally refers to a quality reduction achieved by manipulation of the bitstream

or file (without decompression and re-compression). Other names for scalability are

progressive coding or embedded bitstreams. Despite its contrary nature, scalability can also

be found in lossless codecs, usually in form of coarse-to-fine pixel scans. Scalability is

especially useful for previewing images while downloading them (e.g. in a web browser) or

for providing variable quality access to e.g. databases. There are several types of scalability:

30 30
Region of interest coding Certain parts of the image are encoded with higher quality than

others. This can be combined with scalability (encode these parts first, others later).

Meta information Compressed data can contain information about the image which can be

used to categorize, search or browse images. Such information can include color and texture

statistics, small preview images and author/copyright information.

The quality of a compression method is often measured by the Peak signal-to-noise ratio. It

measures the amount of noise introduced through a lossy compression of the image.

However, the subjective judgement of the viewer is also regarded as an important, perhaps

the most important measure.

Video Compression:

A raw video stream tends to be quite demanding when it comes to storage requirements, and

demand for network capacity when being transferred between computers. Before being stored

or transferred, the raw stream is usually transformed to a representation using compression.

When compressing an image sequence, one may consider the sequence a series of

independent images, and compress each frame using single image compression methods, or

one may use specialized video sequence compression schemes, taking advantage of

similarities in nearby frames. The latter will generally compress better, but may complicate

handling of variations in network transfer speed.

Compression algorithms may be classified into two main groups, reversible and irreversible.

If the result of compression followed by decompression gives a bitwise exact copy of the

original for every compressed image, the method is reversible. This implies that no

quantizing is done, and that the transform is accurately invertible, i.e. it does not introduce

round-off errors.

31 31
When compressing general data, like an executable program file or an accounting database, it

is extremely important that the data can be reconstructed exactly. For images and sound, it is

often convenient, or even necessary to allow a certain degradation, as long as it is not too

noticeable by an observer.

Text compression:

The following methods yield two basic data compression algorithms, which produce good

compression ratios and run in linear time.

The first strategy is a statistical encoding that takes into account the frequencies of symbols

to built a uniquely decipherable code optimal with respect to the compression criterion.

Huffman method (1951) provides such an optimal statistical coding. It admits a dynamic

version where symbol counting is done at coding time. The command "compact" of UNIX

implements this version.

Ziv and Lempel (1977) designed a compression method using encoding segments. These

segments are stored in a dictionary that is built during the compression process. When a

segment of the dictionary is encountered later while scanning the original text it is substituted

by its index in the dictionary. In the model where portions of the text are replaced by pointers

on previous occurrences, the Ziv and Lempel's compression scheme can be proved to be

asymptotically optimal (on large enough texts satisfying good conditions on the probability

distribution of symbols). The dictionary is the central point of the algorithm. Furthermore, a

hashing technique makes its implementation efficient. This technique improved by Welch

(1984) is implemented by the "compress" command of the UNIX operating system.

32 32
The problems and algorithms discussed above give a sample of text processing methods.

Several other algorithms improve on their performance when the memory space or the

number of processors of a parallel machine are considered for example. Methods also extend

to other discrete objects such as trees and images.

33 33
LZW ALGORITHM

Compressor algorithm:

w = NIL;
while (read a char c) do
if (wc exists in dictionary) then
w = wc;
else
add wc to the dictionary;
output the code for w;
w = c;
endif
done
output the code for w;

Decompressor algorithm:

read a char k;
output k;
w = k;
while (read a char k) do
if (index k exists in dictionary) then
entry = dictionary entry for k;
else if (index k does not exist in dictionary && k == currSizeDict)
entry = w + w[0];
else
signal invalid code;
endif
output entry;
add w+entry[0] to the dictionary;
w = entry;
done

34 34
DEFINITION OF THE PROBLEM

Problem Statement:
In today's world of computing, it is hardly possible to do without graphics, images and sound.

Just by looking at the applications around us, the Internet, development of Video CDs

(Compact Disks), Video Conferencing, and much more, all these applications use graphics

and sound intensively.

I guess many of us have surfed the Internet; have you ever become so frustrated in waiting

for a graphics intensive web page to be opened that you stopped the transfer I bet you

have. Guess what will happened if those graphics are not compressed ?

Uncompressed graphics, audio and video data consumes very large amount of physical

storage which for the case of uncompressed video, even present CD technology is unable to

handle. Why is this so ?

CASE 1

Take for instance, if we want to display a TV-quality full motion Video, how much of

physical storage will be required ? Szuprowics states that "TV-quality video requires 720

kilobytes per frame (kbpf) displayed at 30 frames per second (fps) to obtain a full-motion

effect, which means that one second of digitised video consumes approximately 22 MB

(megabytes) of storage. A standard CD-ROM disk with 648 MB capacity and data transfer

rate of 150 KBps could only provide a total of 30 seconds of video and would take 5 seconds

35 35
to display a single frame." Based on Szuprowics's statement we can see that this is clearly

unacceptable.

Transmission of uncompressed graphics, audio and video is a problem too. Expensive cables

with high bandwidth are required to achieve satisfactory result, which is not feasible for the

general market.

CASE 2

Take for example the transmission of uncompressed audio signal over the line for one

second :

Table is based on Steinmetz and Nahrstedt (1995)

From the table we can see that for better quality of sound transmitted over the channel, both

the bandwidth and storage requirement increases, and the size is not feasible at all.

Thus, to provide feasible and cost effective solutions, most multimedia systems use

compression techniques to handle graphics, audio and video data streams.

Therefore, in this paper I will address on one specific standard of compression, JPEG. And at

the same time, I will also be going through basic compression techniques that serve as the

building blocks for JPEG.

This paper focused on three forms of JPEG image compression : 1) Baseline Lossy JPEG ,2)

Progressive and 3) Motion JPEG. Each of their algorithm; characteristics and advantages will

be gone through.

36 36
I hope that by the end of the paper, reader will gain more knowledge of JPEG, understand

how it works and not just know that it's another form of image compression standard.

SYSTEM ANALYSIS AND DESIGN

37 37
Analysis and design refers to the process of examining a business situation with the intent of

improving it through better procedures and methods.

The two main steps of development are:

Analysis

Design

ANALYSIS:

System analysis is conducted with the following objectives in mind:

Identify the users need.

Evaluate the system concept for feasibility.

Perform economic and technical analysis.

Allocate functions to hardware, software, people, and other system elements.

Establish cost and schedule constraints.

Create a system definition that forms the foundation for all subsequent engineering work.

Both hardware and software expertise are required to successfully attain the objectives listed

above.

DESIGN

The most creative and challenging phase of the system life cycle is system design. The term

design describes a final system and the process by which it is developed. It refers to the

38 38
technical specifications (analogous to the engineers blueprints) that will be applied in

implementing the candidate system. It also includes the construction of programs and

program testing. The key question here is: How should the problem be solved? The major

steps in designing are:

The first step is to determine how the output is to be produced and I what format. Samples of

the output (and input) are also presented. Second, input data and master files (data base) have

to be designed to meet the requirements of the proposed output. The operational (processing)

phases are handled through program construction and testing, including a list of the programs

needed to meet the systems objectives and complete documentation. Finally, details related

to justification of the system and an estimate of the impact of the candidate system on the

user and the organization are documented and evaluated by management as a step towards

implementation.

The final report prior to the implementation phase includes procedural flowcharts, record

layouts, report layouts, and workable plans for implementing the candidate system.

Information on personnel, money, h/w, facilities and their estimated cost must also be

available. At this point, projected costs must be close to actual cost of implementation.

In some firms, separate groups of programmers do the programming where as other firms

employ analyst-programmers that do the analysis and design as well as code programs. For

this discussion, we assume that two separate persons carry out analysis and programming.

There are certain functions, though, that the analyst must perform while programs are being

written.

39 39
SYSTEM DESIGN:

Software design sits at the technical kernel of software engineering and is applied regardless

of the software process model that is used. Beginning once software requirements have been

analyzed and specified, software design is the first of the three technical activities Design,

Code generation and Test-that are required to build and verify the software. Each activity

transforms information in a manner that ultimately results in validated computer software.

The importance of software design can be stated with a

single word-quality. Design is the place where quality is fostered in software engineering.

Design provides us with representation of software that can be assessed for quality. Design is

the only way that we can accurately translate a customers requirements into a finished

software product or system. Software design serves as the foundation for all the software

engineering and software support steps that follow. Without design we risk building an

unstable system-one that will fall when small changes are made; one that may be difficult to

test; one whose quality cannot be assessed until late in the software process, when time is

short and many dollars have already been spent.

DESIGN OBJECTIVES:

Design phase of software development deals with transforming the customer requirements as

described in the SRS document into a form implement able using a programming language.

However, we can broadly classify various design activities into two important parts:

Preliminary (or high level) design

Detailed design

40 40
During high level design, different modules and the control relationships among them are

identified and interfaces among these modules are defined. The outcome of high level design

is called the Program Structure or Software Architecture. The structure chart is used to

represent the control hierarchy in a high level design.

During detailed design, the data structure and the algorithms used by different modules are

designed. The outcome of the detailed design is usually known as the Module Specification

document.

A good design should capture all the functionality of the system correctly. It should be easily

understandable, efficient and it should be easily amenable to change that is easily

maintainable. Understandability of a design is a major factor, which is used to evaluate the

goodness of a design, since a design that is easily understandable is also easy to maintain and

change.

In order to enhance the understandability of a design, it should have the following features:

Use of consistent and meaningful names for various design components.

Use of cleanly decomposed set of modules.

Neat arrangement of modules in a hierarchy that is tree-like diagram.

Modular design is one of the fundamental principles of a good design. Decomposition of a

problem into modules facilitates taking advantage of the divide and conquers principle if

different modules are almost independent of each other then each module can be understood

separately, eventually reducing the complexity greatly.

41 41
Clean decomposition of a design problem into modules means that the modules in a software

design should display High Cohesion and Low Coupling.

The primary characteristics of clean decomposition are high cohesion and low coupling.

Cohesion is a measure of the functional strength of a module.

Coupling of a module with another module is a measure of the design of functional

independence or interaction between the two modules.

A module having high cohesion and low coupling is said to be Functional Independent of

other modules by the term functional independence we mean that a Cohesive module

performs a single task or function.

Functionally independent module has minimal interaction with other modules. Functional

independence is a key to good design primarily due to the following reasons:

Functional independence reduces error propagation. An error existing in one module does not

directly affect other modules and also any error existing in other modules does not directly

this module.

Reuse of a module is possible because each module performs some well-defined and precise

function and the interface of the module with other modules is simple and minimum

complexity of the design is reduced because different modules can be understood in isolation,

as modules are more or less independent of each other.

DESIGN PRINCIPLES:

Top-Down and Bottom-Up Strategies

42 42
Modularity

Abstraction

Problem Partitioning and Hierarchy

TOP-DOWM AND BOTTOM-UP STRATEGIES:

A system consists of components, which have components of their own; indeed a system is a

hierarchy of components. The highest-level components correspond to the total system. To

design such hierarchies there are two possible approaches: top-down and bottom-up. The top-

down approach starts from the highest-level component of the hierarchy and proceeds

through to lower levels. By contrast, a bottom-up approach starts with the lowest-level

component of the hierarchy and proceeds through progressively higher levels to the top-level

component.

Top-down design methods often result in some form of stepwise refinement. Starting from

an abstract design, in each step the design is refined to more concrete to a more concrete

level, until we reach a level where no more refinement is needed and the design can be

implemented directly. Bottom-up methods work with layers of abstraction Starting from

the very bottom, operations that provide a layer of abstraction are implemented. The

operations of this layer are then used to implement more powerful operations and a still

higher layer of abstraction, until the stage is reached where the operations supported by the

layer are those desired by the system.

43 43
MODULARITY:

The real power of partitioning comes if a system is partitioned into modules so that the

modules are solvable and modifiable separately. It will be even better if the modules are also

separately compliable. A system is considered modular if it consists of discrete components

so that each component can be implemented separately, and a change to one component has

minimal impact on other components.

Modularity is a clearly a desirable property in a system. Modularity helps in system

debugging-isolating the system problem to a component is easier if the system is modular-in

system repair-changing a part of the system is easy as it affects few other parts-and in system

building-a modular system can be easily built by putting its modules together.

ABSTRACTION :

Abstraction is a very powerful concept that is used in all-engineering disciplines. It is a tool

that permits a designer to consider a component at an abstract level without worrying about

the details of the implementation of the component. Any component or system provides some

services to its environment. An abstraction of a component describes the external behavior of

that component without bothering with the internal details that produce the behavior.

Presumably, the abstract definition of a component is much simpler than the component

itself.

44 44
There are two common abstraction mechanisms for software systems: Functional abstraction

and Data abstraction.

In functional abstraction, a module is specified by the function it performs. For example, a

module to compute the log of a value can be abstractly represented by the function log.

Similarly, a module to sort an input array can be represented by the specification of sorting.

Functional abstraction is the basis of partitioning in function- oriented approaches. That is,

when the problem is being partitioned, the overall transformation function for the system is

partitioned into smaller functions that comprise the system function. The decomposition of

this is terms of functional modules.

The second unit for abstraction is data abstraction. Data abstraction forms the basis for

object-oriented design. In using this abstraction, a system is viewed as a set of objects

providing some services. Hence, the decomposition of the system is done with respect to the

objects the system contains.

Problem Partitioning and Hierarchy:

When solving a small problem, the entire problem can be tackled at once. For solving larger

problems, the basic principles are the time-tested principle of divide and conquer. Clearly,

dividing in such a manner that all the divisions have to be conquered together is not the intent

of this wisdom. This principle, if elaborated, would mean, Divide into smaller pieces, so that

each piece can be conquered separately.

45 45
Problem partitioning, which is essential for solving a complex problem, leads to hierarchies

in the design. That is, the design produced by using problem partitioning can be represented

as a hierarchy of components. The relationship between the elements in this hierarchy can

vary depending on the method used. For example, the most common is the whole-part of

relationship. In this the system consists of some parts, each past consists of subparts, and so

on. This relationship can be naturally represented as a hierarchical structure between various

system parts. In general hierarchical structure makes it much easier to comprehend a complex

system. Due to this, all design methodologies aim to produce a design that has nice

hierarchical structures.

STAGES IN A SYSTEMS LIFE CYCLE

Requirement Determination

A system is intended to meet the needs of an organization so as to save storage capacity. Thus

the first step in the design is to specify these needs or requirements. Determining the

requirements to be met by a system in an organization. Having done this, the next step is to

determine the requirements to be met by the system. Meetings of prospective user

departments are held and, through discussions, priorities among various applications are

determined, subject to the constraints of available computer memory, bandwidth, time taken

for transferring and budget.

Requirement Specification

46 46
The top management of an organization first decides that a compression & decompression

system would be desirable to improve the operations of the organization. Once this basic

decision is taken, a system analyst is consulted. The first job of the system analyst is to

understand the existing system. During this stage he understands the various aspect of

algorithm, datastructures. Based on this he identifies what aspects of the operations of the

project need changes. The analyst discusses it and users his functions and determines the

areas where a changes can made it effective. The applications where a file transferring is

allowed is checked. It is not important to get the users involved from the initial stages of the

development of an application.

Feasibility Analysis

Having drawn up the rough specification, the next step is to check whether it is feasible to

implement the system. A feasibility study takes into account various constraints within which

the system should be implemented and operated. The resources needed for implementation

such as computing equipment, manpower and cost are estimated, based on the specifications

of users requirements. These estimates are compared with the available resources. A

comparison of the cost of the system and the benefits which will accrue is also made. This

document, known as the feasibility report, is given to the management of the organization.

Final Specifications

The developer of this s/w studies this feasibility report and suggests modifications in the

requirements, if any. Knowing the constraints on available resources, and the modified

47 47
requirements specified by the organization, the final specifications of the system to be

developed are drawn up by the system analyst. These specifications should be in a form

which can be easily understood by the users. The specification state what the system would

achieve. It does not describe how the system would do it. These specifications are given back

to the users who study them, consult their colleagues and offer suggestions to the systems

analyst for appropriate changes. These changes are incorporated by the system analyst and a

new set of applications are given back to the users. After discussions between the system

analyst and the users the final specifications are drawn up which are approved for

implementation? Along with this, criteria for system approval are specified, which will

normally include a system test plan.

Hardware Study

Based on the finalized specifications it is necessary to determine the configuration of

hardware and support software essential to execute the specified application.

System Design

The next step is to develop the logical design of the system. The inputs to the system design

phase are functional specifications of the system and details about the computer

configuration. During this phase the logic of the programs is designed, and program test

plans and implementation plan are drawn up. The system design should begin from the

objectives of the system.

48 48
System Implementation

The next phase is implementation of the system. In this phase all the programs are written,

user operational document is written, users are trained, and the system tested with operational

data.

System Evaluation

After the system has been in operation for a reasonable period, it is evaluated and a plan for

its improvement is drawn up .This is called system life cycle. The shortcomings of a system-

namely, what a user expected from the system and what he actually got-are realized only after

a system is used for a reasonable time. Similarly, the shortcomings in this system are realized

only after it is implemented and used for sometime.

System Modification

A computer-based system is a piece of software. It can be modified. Modifications will

definitely cost time and money. But users expect modifications to be made as the name

software itself implies it is soft and hence changeable.

Further, systems designed for use by clients cannot be static. These systems are intended for

real world problem. The environment in which a activity is conducted never remains static.

New changes occurred . New efficient algorithms occurred as research have been going on..

Thus a system which cannot be modified to fulfill the changing requirements of an

organization is bad. A system should be designed for change. The strength of a good

49 49
computer-based system is that it is amenable to change. A good system designer is one who

can foresee what aspects of a system would change and would design the system in a flexible

way to easily accommodate changes.

SYSTEM PLANNING

To understand system development, we need to recognize that a candidate has

a planning, just like living system or a new product. System analysis and design are keyed to

the system planning. The analyst must progress from one stage to another methodically,

answering key questions and achieving results in each stage.

RECOGNITION OF NEED

One must know what the problem is before it can be solved. The basis for a

candidate system is recognition of a need for improving an information system or procedure.

The need leads to a preliminary survey or n initial investigation to determine whether an

alternative system can solve the problem. It entails looking into the duplication of effort,

bottlenecks, inefficient existing procedure, or whether parts of the existing system would be

candidates for computerization.

FEASIBILITY STUDY:

Many feasibility studies are disillusioning for both users and analysts. First, the study often

pre supposes that when the feasibility document is being prepared, the analyst is in a position

to evaluate solutions. Second, most studies tend to overlook the confusion inherent in the

50 50
system develop the constraints and assumed attitudes .If the feasibility study is to serve as

decision document, it must answer three key questions:

Is there a new and a better way to do the job that it will benefit the user?

What are the costs and savings of the alternative(s)?

What is recommended?

The most successful system projects are not necessarily the biggest or Most visible in a

business but rather than truly meets user expectations. Most projects fail because of inflated

Expectations than for any reason.

Feasibility study is broadly divided into three parts:

Economic feasibility

Technical feasibility

Operational feasibility

1. ECONOMIC FEASIBILITY:

It is the most frequently used method for evaluating the effectiveness of a system that is

expected from the system and compares them with costs. If benefits outweigh costs then the

decision is made to design and implement the system. Otherwise, further justification or

alteration in the proposed system will have to be made if it is to have a change of being

approved. This is an ongoing effort that improves in accuracy at each phase of the system life

cycle.

So in our system we have considered these categories for the purpose of

cost/benefits analysis or economic feasibility.

51 51
1. Hardware Cost:

It relates to the actual purchase or lease of computer and peripherals (for example, printer,

disk, drive, tape unit). Determining the actual cost of the hardware is generally more difficult

when various users than for a dedicated stand-alone system share the system. In some cases,

the best way to control for this cost is to treat it as an operating cost.

In this system we are taking it as operating cost so as to minimize the cost of the initial

installation of the computer hardware.

2. Personnel Cost:

It includes EDP staff salaries and benefits (health insurance, vacation time, sick pay, pay,

etc.) as well as pay for those involved in developing the system. Cost incurred during the

development of a system is Online costs and labeled development costs. Once the system is

installed, the costs of operating and maintaining the system become recurring cost.

Facility costs are expanses incurred in the preparation of the physical site where the

application or the computer will be in operation. This includes wiring, flooring, acoustics,

lighting and air conditioning. These costs are treated as one-time costs and are incorporated in

to the overall cost estimate of the candidate system.

As our proposed system it incurred only wiring cost now a days all the sites are well

maintained such as flooring and lighting. Thus it would not go to incur extra expanse.

52 52
Operating cost includes all costs associated with the day-to-day operation of the system; the

amount depends on the number of shifts, the nature of applications, and the caliber of the

operating staff. There are various ways of covering the operating costs. One approach is to

treat the operating cost as the overhead. Another approach is to charge each authorized use

for the amount the processing they request from the system. The amount charged is based on

the computer time, staff time, and the volume of the output produced. In any case, some

accounting is necessary to determine how operating costs should be handled.

As our candidate system is not so big we require only one server and some few terminals for

data maintaining and processing of data. Their costs can be easily determined at the

installation time of the proposed system. As computer is also a machine so it also has

depreciation by using any of the depreciation methods we can determine its annual costs after

deducting the depreciation cost.

Supply cost is variable costs that increase use of paper, ribbons, disks, and the like. They

should be estimated and included in the overall cost of the system.

A system is also expected to provide benefits. The first task is to identify each benefit and

then assign a monetary value to it for cost/benefit analysis. Benefits may be tangible and

intangible, direct and indirect.

The two major benefits are improving performance and minimizing the cost of

processing. The performance category emphasizes improvements in the accuracy of or access

to information and easier access to the system by authorized users. Minimizing costs through

an efficient system error-control or reduction of staff-is a benefit that should be measured and

included on cost/benefit analysis.

53 53
This cost in our proposed system is dependent on the number of customers so sometimes it is

more or sometimes it is less. It is not very easy to estimate this cost, what we can do is to

make a rough estimate of this cost and when this system is installed at a client side we can

compare this rough estimated cost with the actual expenses incurred due to this supply cost.

2. TECHNICAL FEASIBILITY:

Technical feasibility centers on the exciting computer system (hardware, software, etc.) and

to what extent it can support the proposed edition for example, if the current computer is

operating at 80 percent capacity-an arbitrary ceiling- then running another application could

overload the system or require additional hardware. This involves financial consideration to

accommodate technical enhancements. If the budget I serious constraint, then the project is

judged not feasible.

Presently at our client side all the work is done manually so question of overload the system

performance and required an additional hardware is not raised thus our candidate system is

technically feasible.

3. OPERATIONAL FEASIBILITY:
People are inherently resistant to change, and computer has been known to facilitate change.

An estimate should be made of how strong a reaction the user staff is likely to have towards

the development of a computerized system. It is common knowledge that computer

54 54
installations have something to do with turnover, transfers, retraining, and changes in

employee hob status. Therefore, it is understandable that the introduction of a candidate

system requires special efforts to educate, sell, and train the staff on new ways of conducting

business.

There is no doubt that the people are inherently resistant to change, and computers

have been known to facilitate change. As in today's world all the work is computerized

because of computerization people only get benefits. As far as our system is concerned it is

only going to benefit the staff of the clinic in their daily routine work. There is no danger of

someone is loosing job or not get proper attention after the installation of our proposed

system. Thus our system is operationally feasible also.

REQUIREMENT ANALYSIS

Analysis is a detailed study of the various operations performed by a system and

their relationship within and outside the system. One aspect of analysis is defining the

boundaries of the system and determining whether or not a candidate system should consider

other related system. During analysis, data are collected on the available files, decision

points, and transactions handled by the present system.

Dataflow diagrams, interviews, on-site observations, and questionnaires are

examples. The interview is commonly used tool in analysis. It requires special skills and

sensitivity to the subjects being interviewed. Bias in data collection and interpretation can be

a problem, training, experience and commonsense are required for collection of the

information needed to do the analysis.

55 55
Once analysis is completed, the next step is to decide how the problem might be

solved. Thus in, system design, we move from the logical to the physical aspects of the

System Planning.

56 56
HARDWARE & SOFTWARE REQUIREMENTS

HARDWARE SPECIFICATIONS:

Processor Pentium- I\II\III\higher

Ram 128 MB RAM or higher

Monitor 15 Inch (Digital) with 800 X 600 support

Keyboard 101 Keys keyboard

Mouse 2 Button Serial/ PS-2

Tools / Platform Language Used:

Language: Java

OS: Any OS such as Windows XP/98/NT/Vista

57 57
PROJECT DESCRIPTION

What is Huffman Algorithm:

Huffman is a coding algorithm presented by David Huffman in 1952. It's an algorithm which

works with integer length codes. In fact if we want an algorithm which does integer length

codes, huffman is the best option because it's optimal.

We use huffman for example, for compressing the bytes outputted by lzp. First we have to

know the probabilities of them, we use a qsm model for that matter. Based on the

probabilities it makes the codes which then can be outputted. Decoding is more or less the

reverse process, based on the probabilities and the coded data, it outputs the decoded byte.

To make the probabilities the algorithm uses a binary tree. It stores there the symbols and

their probabilities. The position of the symbol depends on its probability. Then it assigns a

code based on its position in the tree. The codes have the prefix property and are

instantaneously decodable thus they are well suited for compression and decompression.

The Huffman compression algorithm assumes data files consist of some byte values that

occur more frequently than other byte values in the same file. This is very true for text files

and most raw gif images, as well as EXE and COM file code segments.

By analyzing, the algorithm builds a "Frequency Table" for each byte value within a file.

With the frequency table the algorithm can then build the "Huffman Tree" from the frequency

table. The purpose of the tree is to associate each byte value with a bit string of variable

length. The more frequently used characters get shorter bit strings, while the less frequent

characters get longer bit strings. Thusly the data file may be compressed.

58 58
To compress the file, the Huffman algorithm reads the file a second time, converting each

byte value into the bit string assigned to it by the Huffman Tree and then writing the bit string

to a new file. The decompression routine reverses the process by reading in the stored

frequency table (presumably stored in the compressed file as a header) that was used in

compressing the file. With the frequency table the decompressor can then re-build the

Huffman Tree, and from that, extrapolate all the bit strings stored in the compressed file to

their original byte value form.

Huffman Encoding :

Huffman encoding works by substituting more efficient codes for data and the codes are then

stored as a conversion table and passed to the decoder before the decoding process takes

place. This approach was first introduced by David Huffman in 1952 for text files and has

spawned many variations. Even CCITT (International Telegraph and Telephone Consultative

Committee) 1 dimensional encoding used for bilevel, black and white image data

telecommunications is based on Huffman encoding.

Algorithm :

Basically in Huffman Encoding each unique value is assigned a binary code, with codes

varying in length. Shorter codes are then used for more frequently used values. These codes

are then stored into a conversion table and passed to the decoder before any decoding is done.

So how does the decoder starts assigning codes to the values ?

Let's imagine that there is this data stream that is going to be encoded by Huffman Encoding :

59 59
AAAABCDEEEFFGGGH

The frequency for each unique value that appears are as follows :

A : 4, B : 1, C : 1, D : 1, E : 3, F : 2, G : 3, H :1

Based on the frequency count the encoder can generate a statistical model reflecting the

probability that each value will appear in the data stream :

A : 0.25, B : 0.0625, C : 0.0625, D : 0.0625, E : 0.1875, F : 0.125, G : 0.1875, H : 0.0625

From the statistical model the encoder can build a minimum code for each and store it in the

conversion table. The algorithm pairs up 2 values with the least probability, in this case we

take B and C and combine their probability so as to be treated as one unique value. Along the

way each value B, C and even BC is being assigned a 0 or 1 on their branch. This means that

0 and 1 will be the least significant bits of the codes B and C respectively. From there the

algorithm compares the remaining values for another 2 values with the smallest probability

and repeat the whole process again until they extend up to form a structure of a up-side down

tree. The whole process is illustrated as on the next page.

60 60
61 61
62 62
63 63
64 64
The binary code for each of the unique value can then be known following down from the top

of the up-side down tree (most significant bit) until we reached the unique value we want

(least significant bit). Let's take for example we want to find the code for B : Follow the path

shown by the blue arrow on the diagram above, and arrive on B. Notice that beside each of

the paths we take, there is a bit value, combining each of these values which we came across,

and we will get the code for B : 1000. The same approach is then used to find all of the

unique values, and their codes are then stored in the conversion table.

Code Construction:

To assign codes you need only a single pass over the symbols, but before doing that you need

to calculate where the codes for each codelength start. To do so consider the following: The

longest code is all zeros and each code differs from the previous by 1 (I store them such that

the last bit of the code is in the least significant bit of a byte/word).

In the example this means:

Codes with length 4 start at 0000

Codes with length three start at (0000+4*1)>>1 = 010. There are 4 codes with length

4 (that is where the 4 comes from), so the next length 4 code would start at 0100. But

since it shall be a length 3 code we remove the last 0 (if we ever remove a 1 there is a

bug in the codelengths).

Codes with length 2 start at (010+2*1)>>1 = 10.

Codes with length 1 start at (10+2*1)>>1 = 10.

65 65
Codes with length 0 start at (10+0*1)>>1 = 1. If anything else than 1 is start for the

codelength 0 there is a bug in the codelengths!

Then visit each symbol in alphabetical sequence (to ensure the second condition) and assign

the startvalue for the codelength of that symbol as code to that symbol. After that increment

the startvalue for that codelength by 1.

Maximum Length of a Huffman Code:

Apart from the ceil(log2(alphabetsize)) boundary for the nonzero bits in this particular

canonical huffman code it is useful to know the maximum length a huffman code can reach.

In fact there are two limits which must both be fulfilled.

No huffman code can be longer than alphabetsize-1. Proof: it is impossible to construct a

binary tree with N nodes and more than N-1 levels.

The maximum length of the code also depends on the number of samples you use to derive

your statistics from; the sequence is as follows (the samples include the fake samples to give

each symbol a nonzero probability!):

The Compression or Huffing Program:

To compress a file (sequence of characters) you need a table of bit encodings, e.g., an ASCII

table, or a table giving a sequence of bits that's used to encode each character. This table is

constructed from a coding tree using root-to-leaf paths to generate the bit sequence that

encodes each character.

Assuming you can write a specific number of bits at a time to a file, a compressed file is

made using the following top-level steps. These steps will be developed further into sub-

steps, and you'll eventually implement a program based on these ideas and sub-steps.

66 66
Build a table of per-character encodings. The table may be given to you, e.g., an ASCII table,

or you may build the table from a Huffman coding tree.

Read the file to be compressed (the plain file) and process one character at a time. To process

each character find the bit sequence that encodes the character using the table built in the

previous step and write this bit sequence to the compressed file.

Building the Table for Compression:

To build a table of optimal per-character bit sequences you'll need to build a Huffman coding

tree using the greedy Huffman algorithm. The table is generated by following every root-to-

leaf path and recording the left/right 0/1 edges followed. These paths make the optimal

encoding bit sequences for each character.

There are three steps in creating the table:

1 Count the number of times every character occurs. Use these counts to create an initial

forest of one-node trees. Each node has a character and a weight equal to the number of times

the character occurs.

2 Use the greedy Huffman algorithm to build a single tree. The final tree will be used in the

next step.

3 Follow every root-to-leaf path creating a table of bit sequence encodings for every

character/leaf.

67 67
Header Information:

You must store some initial information in the compressed file that will be used by the

uncompression/unhuffing program. Basically you must store the tree used to compress the

original file. This tree is used by the uncompression program.

There are several alternatives for storing the tree. Some are outlined here, you may explore

others as part of the specifications of your assignment.

Store the character counts at the beginning of the file. You can store counts for every

character, or counts for the non-zero characters. If you do the latter, you must include

some method for indicating the character, e.g., store character/count pairs.

You could use a "standard" character frequency, e.g., for any English language text

you could assume weights/frequencies for every character and use these in

constructing the tree for both compression and uncompression.

You can store the tree at the beginning of the file. One method for doing this is to do a

pre-order traversal, writing each node visited. You must differentiate leaf nodes from

internal/non-leaf nodes. One way to do this is write a single bit for each node, say 1

for leaf and 0 for non-leaf. For leaf nodes, you will also need to write the character

stored. For non-leaf nodes there's no information that needs to be written, just the bit

that indicates there's an internal node.

Decompressing:

68 68
Decompression involves re-building the Huffman tree from a stored frequency table (again,

presumable in the header of the compressed file), and converting its bit streams into

characters. You read the file a bit at a time. Beginning at the root node in the Huffman Tree

and depending on the value of the bit, you take the right or left branch of the tree and then

return to read another bit. When the node you select is a leaf (it has no right and left child

nodes) you write its character value to the decompressed file and go back to the root node for

the next bit.

Transmission and storage of Huffman-encoded Data:

If your system is continually dealing with data in which the symbols have similar frequencies

of occurence, then both encoders and decoders can use a standard encoding table/decoding

tree. However, even text data from various sources will have quite different characteristics.

For example, ordinary English text will have generally have 'e' at the root of the tree, with

short encodings for 'a' and 't', whereas C programs would generally have ';' at the root, with

short encodings for other punctuation marks such as '(' and ')' (depending on the number and

length of comments!). If the data has variable frequencies, then, for optimal encoding, we

have to generate an encoding tree for each data set and store or transmit the encoding with the

data. The extra cost of transmitting the encoding tree means that we will not gain an overall

benefit unless the data stream to be encoded is quite long - so that the savings through

compression more than compensate for the cost of the transmitting the encoding tree also.

WORKING OF PROJECT:

MODULE & THEIR DESCRIPTION :-

69 69
There are following functions in project

Huffman Zip

Encoder

Decoder

Table

DLNode

Priority Queue

Huffman Node

Huffman zip is the main function which uses applet. It is used for user interface. Encoder is

the module for compressing the file. It implements Huffman algorithm for compressing the

text and image file. It first calculate the frequencies of all the occurring symbols. Then on the

basis of these frequencies it generates the priority queue. This priority queue is used for

finding the symbols with least frequencies. Now the two symbols with lowest frequencies are

deleted from the queue and a new symbol is added to the queue with frequency equal to the

sum of these two symbols. In the meanwhile we generate a tree with leaf nodes are the two

deleted node and the root node is the new node added to the queue. At last we traverse the

tree starting from the root node to the leaf node assigning 0 to the left child and 1 to the right

node. In this way we assign code to every symbol in the file. These are binary codes then we

group these binary codes and calculate the equivalent integers and store them in the output

file, which is the compressed file.

70 70
Decoder works in the reverse order as the encoder. It reads the input from the compressed file

and convert it into equivalent binary code. It has one another input the binary tree generated

in the encoding process and on the basis of these data it generates the original file. This

project is based on lossless compression.

Table is used for storing the codes of each symbol. Priority queue takes input the symbols and

there related frequencies and on the basis of these frequencies it assign priorities to each

symbol. Huffman node is used for creating the binary tree it takes input two symbol from the

priority queue and create two nodes by comparing the frequencies of these two symbol. It

places the symbol with less frequency to the left and the symbol with high frequency to the

right, it then deletes these two symbol from the priority queue and places a new symbol with

frequency equal to the sum of frequencies of these two deleted symbol. It also generate a

parent node to the two node and assign frequency equal to the sum of frequencies of the two

leaf node.

DATA FLOW DIAGRAM

When solving a small problem, the entire problem can be tackled at once. For solving larger

problems, the basic principles the time-tested principle of divide and conquer. Clearly,

71 71
dividing in such a manner that all the divisions have to be conquered together is not the intent

of this wisdom. This principle, if elaborated, would mean divide into smaller pieces, so that

each piece can be conquered separately.

Problem partitioning, which is essential for solving a complex problem, leads to hierarchies

in the design. That is, the design produced by using problem partitioning can be represented

as a hierarchy of components. The relationship between the elements in this hierarchy can

vary depending on the method used. For example, the most common is the whole-part of

relationship. In this the system consists of some parts, each past consists of subparts, and so

on. This relationship can be naturally represented as a hierarchical structure between various

system parts. In general hierarchical structure makes it much easier to comprehend a complex

system. Due to this, all design methodologies aim to produce a design that has nice

hierarchical structures.

The DFD was first designed by Larry Constantine as a way of expressing system

requirements in a graphical form; this led to a modular design.

A DFD, also known as bubble chart, has the purpose of clarifying system requirements and

identifying major transformations that will become programs in system design. So it is the

starting point of the design phase that functionally decomposes the requirement specifications

down to the lowest level of detail. A DFD consists of series of bubbles joined by lines

represent data flows in the system.

DFD SYMBOLS

72 72
In the DFD, there are four symbols.

1 A square defines a source (originator) or destination of system data.

2 An arrow identifies data flow- data in motion. It is a pipeline through which information

flows.

3 A circle or a bubble (some people use an oval bubble) represents a process that

transforms incoming data flows(s) into outgoing data flow(s).

4 An open rectangle is a data store-data at rest , or a temporary repository of data .

SYMBOLS MEANING

Source or destination of data

Data flow

Process that transform data flow

73 73
Data Store

CONSTRUCTING DFD

Several rule of thumb are used in drawing D F Ds:

1 Processes should be named and numbered for easy reference. Each name should be

representative of the process.

2 The direction of flow is from top to bottom and from left to right. Data traditionally flow

from the source (upper left corner) to the destination (lower right corner), although they may

flow back to a source. One way to indicate this is to draw a long flow line back to the source.

An alternative way is to repeat the source symbol as a destination. Since it is used more than

once in the DFD, it is marked with a short diagonal in the lower right corner.

3 When a process is exploded into lower-level details, they are numbered.

4 The names of data sources and destinations are written in capital letters. Process and data

flows names have the first letter of each word capitalized.

HOW DETAILED SHOULD A DFD BE?

The DFD is designed to aid communication. If it contains dozens of processes and data stores

it gets too unwieldy. The rule thumb is to explode the DFD to a functional level, so that the

next sublevel does not exceed 10 processes. Beyond that, it is best to take each function

separately and expand it show the explosion of the single process. If a user wants to know

74 74
what happens within a given process, then the detailed explosion of that process may be

shown.

A DFD typically shows the minimum contents of data elements that flow in and out.

A leveled set has a starting DFD, which is a very abstract representation of the system,

identifying the major inputs and outputs and the major processes in the system. Then each

process is refined and a DFD is drawn for the process. In other words, a bubble DFD is

expanded into a DFD during refinement. For the hierarchy to be consistent, it is important

that the net inputs and outputs of the DFD for a process are the same as the inputs and outputs

of the process are the same as the inputs and the outputs of the process in the higher level

DFD. This refinement stops if each bubble can be easily identified or understood. It should be

pointed out that during refinement, though the net input and output are preserved, a

refinement of the data might also occur. That is , a unit of data may be broken into its

components for processing when the detailed DFD for a process is being drawn .So , as the

process are decomposed, data decomposition also occurs.

The DFD methodology is quite effective, especially when the required design is unclear the

analyst need a notational language for communication. The DFD is easy to understand for

communication. The DFD is easy to understand after a brief orientation.

The main problem however is the large number of iterations that often are required to arrives

at the most accurate and complete solution.

DATA FLOW DIAGRAM

75 75
The DFD helps to understand the functioning & module used in the coding . It describe easily

flow and store of the data.What variable are given in input & flow of data in the program &

the final output. Here we are referencing some DFDs which helps in understanding the

program

Priority queue Huffman Node Table

76 76
Updation of
Code generator
priority queue

77 77
Print Layouts

78 78
79 79
80 80
81 81
IMPLEMENTATION:

The implementation phase is less creative than system design. It is primarily concerned with

user training, site preparation, and file conversion. When the candidate system is linked to

terminals to remote sites, the telecommunication network and test of the network along with

the system are also included under implementation.

During the implementation phase, the system actually takes physical shape

As in the other two stages, the analyst, his or her associates and the user performs many tasks

including: -

Writing, testing, debugging and documenting systems.

Converting data from the old to the new system.

Training the systems users.

Completing system documentation.

Evaluating the final system to make sure that it is fulfilling original need and that it

began operation on time and within budget.

The analyst involvement in each of these activities varies from organization to organization .

For a small organizations, specialists may work on different phases and tasks, such as

training, ordering equipment, converting data from old methods to the new or certifying the

correctness of the system.

The implementation phase with an evaluation of the system after placing it into operation

for a period of time .by then, most program errors will have shown up and most costs will

82 82
have become clear .To make sure that the system audit is a last check or review of a system

to ensure that it meets design criteria. Evaluation forms the feedback part of the cycle that

keeps implementation going as long as the system continues operation.

Ordering and installing any new hardware required by the system.

Developing operating procedures for the computer center staff.

Establishing a maintenance procedure to repair and enhance the system.

During the final testing user acceptance is tested followed by user training. Depending on the

nature of the system, extensive user training may be required. Conversion usually takes place

at about the same time the user is being trained or later

In the extreme, the programmer is falsely viewed as some who ought to be isolated from

other aspects of system development. Programming is itself design work, however. The initial

parameters of the candidates system should be modified as a result of programming efforts.

Programming provides a reality test for the assumptions maid by the analyst it is therefore

a mistake to exclude programmers from the initial system design.

System testing checks the readiness and accuracy of the system to access update and retrieve

data from new files. Once the program becomes available test data are read into the computer

and processed against the file provide for testing in most conversions a parallel run is

conducted where the new system runs simultaneously with the old system this method though

costly provides added assurance against errors in the candidate system.

83 83
TEST PLAN

A test plan is a service delivery agreement. It is a quality assurances way of communicating

to developer, the client, and the rest of the team, this is what can be expected.

The key point of test plan is:

Introduction:
Summarizes key features and expectations of software along with testing approach.

Scope:
It includes a description of text types.

Risks and assumptions:


This part should define a risk to the testing phase, such as criteria that could suspend

testing.

Testing schedules and cycles:


States when testing will be completed and the number of expected cycles.

Test resources:
Specifies testers and bug fixers.

Some special terms in Testing


Fundamental

Error:

84 84
The term Error is used in two different ways. It refers to difference between the

actual output of the software and the correct output. In this interpretation, error is an essential

measure of the difference actual and ideal output. Error is also used to refer to human action

that results in software containing a defect or fault.

Fault:
Fault is a condition that causes a system to fail in performing its required function. A

fault is a basic reason for software malfunction and is synonymous with the commonly used

term 'Bug'.

Failure:
Failure is the inability of a system or component to perform a required function

according to its specifications. A software failure occurs if the behavior if the software is

different from the specified behavior. Failure may be caused due to functional or

performance reasons.

Some of the commonly used Strategies for Testing are as follows:-

Unit Testing

Module testing

Integration testing

85 85
System testing
Acceptance testing
Unit Testing :
The term 'Unit Testing' comprises the set of tests performed by an

individual programmer prior to the integration of the unit into a larger system. The situation

is illustrated as follows:

Coding & Unit Integration


Debugging Testing Testing

A program unit is usually small enough, so the programmer who developed it can

test it in great detail, and certainly in greater detail than will be possible when the unit

is integrated into an evolving software product. In unit testing, the programs are tested

separately, independent of each other. Since the check is done at the program level, it is

also called Program Testing.

Module Testing :

A module encapsulates related component. So can be tested without other system modules.

Subsystem testing :

86 86
Subsystem testing may be independently designed and implemented. Common

problems such as sub-system interface mistakes can be checked and can concentrate on it

in this phase.

There are four categories of tests that a programmer will typically perform on a program

unit:

Functional Tests

Performance Test

Stress Test

Structure Test

Functional Test :

Functional test cases involves exercising the code with nominal input values for which

expected results are known, as well as boundary values (minimum values, maximum

values, and values on and just outside the functional boundaries) and special values.

Performance Test :

Performance testing determines the amount of execution time spent in various

parts of the unit, program throughput, response time, and device utilization by the

program unit. A certain amount of performance tuning may be done during testing,

87 87
however, caution must be exercised to avoid expending too much effort on fine tuning

of a program unit that contributes little to the overall performance of the entire system.

Performance testing is most productive at the subsystem and system levels.

Stress Test :

Stress tests are those tests designed to intentionally break the unit. A great deal can be

learned about the strengths and limitations of a program by examining the manner in which

a program unit breaks.

Structure Test :

Structure tests are concerned with exercising the internal logic of a program and traversing

particular execution paths. Some authors refer collectively to functional performance and

stress testing as black box testing, while structure testing is referred to as white box or

glass box testing. The major activities in structural testing are deciding which path to

exercise, deriving test data to exercise those paths, determining the test coverage criterion to

be used and executing the test cases on some modules and subsystems. This mix

alleviates many of the problems encountered in pure top-down testing and retains the

advantages of top-down integration at the subsystem and system level.

Automated tools used in integration testing include module drivers, test data generators,

environment simulators, and a management facility to allow easy configuration and

reconfiguration of system elements. Automated modules drivers perm it specification of

test cases (both input and expected results) in a descriptive language. The driver tool

88 88
then calls the routine using specified test cases, compares actual with the expected results,

and reports discrepancies.

Some module drivers also provide program stubs for top-down testing. Test cases

are written for the stub, and when the stub is invoked by the routine being tested, the

drivers examine the input parameters to the stub and return the corresponding outputs to

the routine. Automated test drivers include AUT, MTS, TEST MASTER and TPL.

Test data generators are of two varieties; those that generate files of random

data values according to some predefined format, and those that generate test data for

particular execution paths. In the latter category, symbolic executors such as ATTEST can

sometimes be used to driver a set of test data that will force program execution to follow a

particular control path.

Environment simulators are sometimes used during integration and

acceptance testing to simulate the operating environment in which the software will

function. Simulators are used in situation in which operation of the actual environment

is impractical. Examples of simulators are PRIM (GAL75) for emulating, machines

that do not exist, and the Saturn Flight Program Simulators for simulating live flight tests

cases, and measuring the coverage achieved when the test cases are exercised.

System Testing

System testing involves two kinds of activities:

Integration testing

Acceptance testing

89 89
Strategies for integrating software components into a functioning product include the

bottom-up strategy, the top-down strategy, and the sandwich strategy. Careful planning and

scheduling are required to ensure that modules will be available for integration into

the evolving software product when needed. The integration strategy dictates the order in

which modules must be available, and thus exerts a strong influence on the order in

which modules are written, debugged, and unit tested.

Acceptance testing involves planning & execution of functional tests, performance

tests, and stress tests to verify that the implemented system satisfies its requirements.

Acceptance tests are typically performed by quality assurance and/or customer

organizations.

90 90
CONCLUSIONS

Data compression is a topic of much importance and many applications. Methods of data

compression have been studied for almost four decades. This paper has provided an overview

of data compression methods of general utility. The algorithms have been evaluated in terms

of the amount of compression they provide, algorithm efficiency, and susceptibility to error.

While algorithm efficiency and susceptibility to error are relatively independent of the

characteristics of the source ensemble, the amount of compression achieved depends upon the

characteristics of the source to a great extent.

Semantic dependent data compression techniques are special-purpose methods designed to

exploit local redundancy or context information. A semantic dependent scheme can usually

be viewed as a special case of one or more general-purpose algorithms. It should also be

noted that algorithm HUFFMAN CODING & DECODING is a general-purpose technique

which exploits locality of reference, a type of local redundancy.

Susceptibility to error is the main drawback of each of the algorithms presented here.

Although channel errors are more devastating to adaptive algorithms than to static ones, it is

possible for an error to propagate without limit even in the static case. Methods of limiting

the effect of an error on the effectiveness of a data compression algorithm should be

investigated.

FUTURE ENHANCEMENT & NEW DIRECTIONS

91 91
NEW DIRECTIONS:

Data compression is still very much an active research area. This section suggests

possibilities for further study.

The discussion of illustrates the susceptibility to error of the codes presented in this survey.

Strategies for increasing the reliability of these codes while incurring only a moderate loss of

efficiency would be of great value. This area appears to be largely unexplored. Possible

approaches include embedding the entire ensemble in an error-correcting code or reserving

one or more codewords to act as error flags. For Huffman encoding & decoding it may be

necessary for receiver and sender to verify the current code mapping.

Another important research topic is the development of theoretical models for data

compression which address the problem of local redundancy. Models based on Huffman

coding may be exploited to take advantage of interaction between groups of symbols.

Entropy tends to be overestimated when symbol interaction is not considered. Models which

exploit relationships between source messages may achieve better compression than

predicted by an entropy calculation based only upon symbol probabilities.

SCOPE FOR FUTURE WORK:

92 92
Since this system has been generated by using Object Oriented programming, there are every

chances of reusability of the codes in other environment even in different platforms. Also its

present features can be enhanced by some simple modification in the codes so as to reuse it in

the changing scenario.

SCOPE OF FUTHER APPLICATION:

We can implement easily this application. Reusability is possible as and when we require in

this application. We can update it next version. We can add new features as and when we

require. There is flexibility in all the modules.

SOURCE CODE

HuffmanZip.java

93 93
import javax.swing.*;
import java.io.*;
import java.awt.*;
import java.awt.event.*;

public class HuffmanZip extends JFrame


{

private JProgressBar bar;


private JButton enc,dec,center;
private JLabel title;
private JFileChooser choose;
private File input1,input2;
private Encoder encoder;
private Decoder decoder;
private ImageIcon icon;

public HuffmanZip()
{

super("Zip utility V1.1");

// Container con=getContentPane();
Container c=getContentPane();
enc=new JButton("Encode");
dec=new JButton("Decode");
center=new JButton();
title=new JLabel(" Zip Utility V1.1 ");
choose=new JFileChooser();

icon=new ImageIcon("huff.jpg");
center.setIcon(icon);

enc.addActionListener(

new ActionListener()
{
public void actionPerformed(ActionEvent e)
{
int f=choose.showOpenDialog(HuffmanZip.this);
if (f==JFileChooser.APPROVE_OPTION)

94 94
{

input1=choose.getSelectedFile();
encoder=new Encoder(input1);

HuffmanZip.this.setTitle("Compressing.....");
encoder.encode();

JOptionPane.showMessageDialog(null,encoder.getSummary(),"Summary",JOp
tionPane.INFORMATION_MESSAGE);
HuffmanZip.this.setTitle("Zip utility v1.1");

}
}
}

);

dec.addActionListener(

new ActionListener()
{
public void actionPerformed(ActionEvent e)
{
int f=choose.showOpenDialog(HuffmanZip.this);
if (f==JFileChooser.APPROVE_OPTION)
{
input2=choose.getSelectedFile();
decoder=new Decoder(input2);
decoder.decode();

HuffmanZip.this.setTitle("Decompressing.....");

JOptionPane.showMessageDialog(null,decoder.getSummary(),"Summary",JOp
tionPane.INFORMATION_MESSAGE);
HuffmanZip.this.setTitle("Zip utility v1.1");

}
}
}

);

95 95
//c.add(bar,BorderLayout.SOUTH);
c.add(dec,BorderLayout.EAST);
c.add(enc,BorderLayout.WEST);
c.add(center,BorderLayout.CENTER);
c.add(title,BorderLayout.NORTH);

setSize(250,80);
setVisible(true);
}

public static void main(String args[])


{
HuffmanZip g=new HuffmanZip();
g.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
}

Encoder.java
import java.io.*;

96 96
import javax.swing.*;

public class Encoder


{

private static String code[],summary="";


private int totalBytes=0;
private int count=0;
private File inputFile;
private File outputFile ;
private FileOutputStream C;
private ObjectOutputStream outF;
private BufferedOutputStream outf;
private FileInputStream in1;
private BufferedInputStream in;
private boolean done=false;

public Encoder(File inputFile)


{
this.inputFile=inputFile;
}
public void encode()
{

int freq[]=new int[256];

for(int i=0;i<256;i++)
{
freq[i]=0;
}

// File inputFile = new File(JOptionPane.showInputDialog("Enter the input


file name"));
try
{
in1 = new FileInputStream(inputFile);
in=new BufferedInputStream(in1);
}
catch(Exception eee)
{

97 97
}

try
{
System.out.println(" "+in.available());
totalBytes=in.available();
int mycount=0;

in.mark(totalBytes);

while (mycount<totalBytes)
{
int a=in.read();
mycount++;
freq[a]++;
}
in.reset();
}
catch(IOException eofexc)
{
System.out.println("error");

HuffmanNode tree=new HuffmanNode(),one,two;


PriorityQueue q=new PriorityQueue();

try
{

for(int j=0;j<256;j++)
{

// System.out.println("\n"+byteval[j]+" "+freq[j]+" prob


"+probablity[j]+"int value"+toInt(byteval[j]));
if (freq[j]>0)
{
HuffmanNode t=new
HuffmanNode("dipu",freq[j],j,null,null,null);
q.insertM(t);
}

98 98
}

//create tree....................................

while (q.sizeQ()>1)
{
one=q.removeFirst();
two=q.removeFirst();
int f1=one.getFreq();
int f2=two.getFreq();
if (f1>f2)
{
HuffmanNode t=new HuffmanNode(null,
(f1+f2),0,two,one,null);
one.up=t;
two.up=t;
q.insertM(t);
}
else
{
HuffmanNode t=new HuffmanNode(null,
(f1+f2),0,one,two,null);
one.up=t;
two.up=t;
q.insertM(t);
}

tree =q.removeFirst();

}
catch(Exception e)
{
System.out.println("Priority Queue error");
}
code=new String[256];
for(int i=0;i<256;i++)
code[i]="";

traverse(tree);

Table rec=new Table(totalBytes,inputFile.getName());


for(int i=0;i<256;i++)
{
rec.push(freq[i]);

99 99
if(freq[i]==0)
continue;
// System.out.println(""+i+" "+code[i]+" ");
}

// System.out.println("size of table"+rec.recSize());

//create tree ends...........................

// System.out.println("\n total= "+totalBytes+"\n probablity="+d);


int wrote=0,csize=0;
int recordLast=0;

try
{

outputFile = new File(inputFile.getName()+".hff");


C=new FileOutputStream(outputFile);
outF=new ObjectOutputStream(C);
outf=new BufferedOutputStream(C);
outF.writeObject(rec);
String outbyte="";

while (count<totalBytes)
{
outbyte+=code[in.read()];
count++;
if (outbyte.length()>=8)
{
int k=toInt(outbyte.substring(0,8));
csize++;
outf.write(k);
outbyte=outbyte.substring(8);
}
}

while(outbyte.length()>8)
{
csize++;
int k=toInt(outbyte.substring(0,8));

100 100
outf.write(k);
outbyte=outbyte.substring(8);
}
if((recordLast=outbyte.length())>0)
{
while(outbyte.length()<8)
outbyte+=0;
outf.write(toInt(outbyte));
csize++;
}
outf.write(recordLast);

outf.close();
}
catch(Exception re)
{
System.out.println("Error in writng....");
}

float ff=(float)csize/((float)totalBytes);
System.out.println("Compression "+recordLast+" ratio"+csize+"
"+(ff*100)+" %");

summary+="File name : "+ inputFile.getName();


summary+="\n";

summary+="File size : "+totalBytes+" bytes.";


summary+="\n";

summary+="Compressed size : "+ csize+" bytes.";


summary+="\n";

summary+="Compression ratio: "+(ff*100)+" %";


summary+="\n";

done=true;

private void traverse(HuffmanNode n)


{

if (n.lchild==null&&n.rchild==null)
{

101 101
HuffmanNode m=n;
int arr[]=new int[20],p=0;
while (true)
{
if (m.up.lchild==m)
{
arr[p]=0;
}
else
{
arr[p]=1;
}
p++;
m=m.up;
if(m.up==null)
break;
}
for(int j=p-1;j>=0;j--)
code[n.getValue()]+=arr[j];
}
// System.out.println("Debug3");
if(n.lchild!=null)
traverse(n.lchild);
if(n.rchild!=null)
traverse(n.rchild);
}

private String toBinary(int b)


{
int arr[]=new int[8];
String s="";
for(int i=0;i<8;i++)
{
arr[i]=b%2;
b=b/2;

}
for(int i=7;i>=0;i--)
{
s+=arr[i];
}
return s;
}

private int toInt(String b)


{

102 102
int output=0,wg=128;
for(int i=0;i<8;i++)
{
output+=wg*Integer.parseInt(""+b.charAt(i));
wg/=2;
}
return output;
}

public int lengthOftask()


{
return totalBytes;
}
public int getCurrent()
{
return count;
}
public String getSummary()
{
String temp=summary;
summary="";
return temp;
}
public boolean isDone()
{
return done;
}
}

Decoder.java

103 103
import java.io.*;
import javax.swing.*;

public class Decoder {

private int totalBytes=0,mycount=0;


private int freq[],arr=0;
private String summary="";
private File inputFile;
private Table table;

private FileInputStream in1;


private ObjectInputStream inF;
private BufferedInputStream in;

private File outputFile ;


private FileOutputStream outf;

public Decoder(File file)


{
inputFile=file;
}

public void decode()//throws Exception


{

freq=new int[256];
for(int i=0;i<256;i++)
{
freq[i]=0;
}

// File inputFile = new File(JOptionPane.showInputDialog("Enter the


input File name"));

try
{
in1 = new FileInputStream(inputFile);
inF=new ObjectInputStream(in1);
in=new BufferedInputStream(in1);

// int arr=0;
table=(Table)(inF.readObject());

104 104
outputFile = new File(table.fileName());
outf=new FileOutputStream(outputFile);

summary+="File name : "+ table.fileName();


summary+="\n";
}
catch(Exception exc)
{
System.out.println("Error creating file");
JOptionPane.showMessageDialog(null,"Error"+"\nNot a
valid < hff > format
file.","Summary",JOptionPane.INFORMATION_MESSAGE);

System.exit(0);
}

HuffmanNode tree=new HuffmanNode(),one,two;


PriorityQueue q=new PriorityQueue();

try
{

//creating priority queue.................

for(int j=0;j<256;j++)
{
int r =table.pop();
// System.out.println("Size of table "+r+" "+j);

if (r>0)
{
HuffmanNode t=new
HuffmanNode("dipu",r,j,null,null,null);
q.insertM(t);
}
}

//create tree....................................

while (q.sizeQ()>1)
{
one=q.removeFirst();
two=q.removeFirst();
int f1=one.getFreq();

105 105
int f2=two.getFreq();
if (f1>f2)
{
HuffmanNode t=new HuffmanNode(null,
(f1+f2),0,two,one,null);
one.up=t;
two.up=t;
q.insertM(t);
}
else
{
HuffmanNode t=new HuffmanNode(null,
(f1+f2),0,one,two,null);
one.up=t;
two.up=t;
q.insertM(t);
}

tree =q.removeFirst();

}
catch(Exception exc)
{
System.out.println("Priority queue exception");
}

String s="";

try
{
mycount=in.available();
while (totalBytes<mycount)
{
arr=in.read();
s+=toBinary(arr);
while (s.length()>32)
{

for(int a=0;a<32;a++)
{
int
wr=getCode(tree,s.substring(0,a+1));
if(wr==-1)continue;

106 106
else
{
outf.write(wr);
s=s.substring(a+1);
break;
}

}
totalBytes++;

}
s=s.substring(0,(s.length()-8));
s=s.substring(0,(s.length()-8+arr));

int counter;
while (s.length()>0)
{
if(s.length()>16)counter=16;
else counter=s.length();
for(int a=0;a<counter;a++)
{
int
wr=getCode(tree,s.substring(0,a+1));
if(wr==-1)continue;
else
{
outf.write(wr);
s=s.substring(a+1);
break;
}
}
}

outf.close();

}
catch(IOException eofexc)
{
System.out.println("IO error");
}

107 107
summary+="Compressed size : "+ mycount+" bytes.";
summary+="\n";

summary+="Size after decompressed :


"+table.originalSize()+" bytes.";
summary+="\n";

private int getCode(HuffmanNode node,String decode)


{

while (true)
{
if (decode.charAt(0)=='0')
{
node=node.lchild;
}
else
{
node=node.rchild;
}
if (node.lchild==null&&node.rchild==null)
{
return node.getValue();
}
if(decode.length()==1)break;
decode=decode.substring(1);

}
return -1;
}

public String toBinary(int b)


{
int arr[]=new int[8];
String s="";
for(int i=0;i<8;i++)
{
arr[i]=b%2;
b=b/2;

}
for(int i=7;i>=0;i--)
{

108 108
s+=arr[i];
}
return s;
}
public int toInt(String b)
{
int output=0,wg=128;
for(int i=0;i<8;i++)
{
output+=wg*Integer.parseInt(""+b.charAt(i));
wg/=2;
}
return output;
}
public int getCurrent()
{
return totalBytes;
}
public int lengthOftask()
{
return mycount;
}
public String getSummary()
{
return summary;
}
}

DLnode.java
public class DLNode

109 109
{
private DLNode next,prev;
private HuffmanNode elem;

public DLNode()
{
next=null;
prev=null;
elem=null;
}
public DLNode(DLNode next,DLNode prev,HuffmanNode elem)
{
this.next=next;
this.prev=prev;
this.elem=elem;
}

public DLNode getNext()


{
return next;
}
public DLNode getPrev()
{
return prev;
}
public void setNext(DLNode n)
{
next=n;
}
public void setPrev(DLNode n)
{
prev=n;
}
public void setElement(HuffmanNode o)
{
elem=o;
}
public HuffmanNode getElement()
{
return elem;
}

}
HuffmanNode.java
import java.io.*;

110 110
public class HuffmanNode implements Serializable
{

public HuffmanNode rchild,lchild,up;


private String code;
private int freq;
private int value;
public HuffmanNode(String bstring,int freq,int value,HuffmanNode
lchild,HuffmanNode rchild,HuffmanNode up)
{
code=bstring;
this.freq=freq;
this.value=value;
this.lchild=lchild;
this.rchild=rchild;
this.up=up;
}
public HuffmanNode()
{
code="";
freq=0;
value=0;
lchild=null;
rchild=null;
}
public int getFreq()
{
return freq;
}
public int getValue()
{
return value;
}
public String getCode()
{
return code;
}

]
PriorityQueue.java
public class PriorityQueue

111 111
{

private DLNode head,tail;


private int size=0;
private int capacity;
private HuffmanNode obj[];
public PriorityQueue(int cap)
{
head=new DLNode();
tail=new DLNode();
head.setNext(tail);
tail.setPrev(head);
capacity=cap;
obj=new HuffmanNode[capacity];
}
public PriorityQueue()
{
head=new DLNode();
tail=new DLNode();
head.setNext(tail);
tail.setPrev(head);
capacity=1000;
obj=new HuffmanNode[capacity];
}
public void insertM(HuffmanNode o)throws Exception
{
if (size==capacity)
throw new Exception("Queue is full");

if (head.getNext()==tail)
{
DLNode d=new DLNode(tail,head,o);
head.setNext(d);
tail.setPrev(d);
}
else
{
DLNode n=head.getNext();
HuffmanNode CurrenMax=null;
int key=o.getFreq();
while (true)
{

if (n.getElement().getFreq()>key)
{

112 112
DLNode second=n.getPrev();

DLNode huf=new DLNode(n,second,o);


second.setNext(huf);
n.setPrev(huf);
break;
}
if (n.getNext()==tail)
{
DLNode huf=new DLNode(tail,n,o);
n.setNext(huf);
tail.setPrev(huf);
break;
}
n=n.getNext();
}
}
size++;
}

public HuffmanNode removeFirst() throws Exception


{

if(isEmpty())
throw new Exception("Queue is empty");

HuffmanNode o=head.getNext().getElement();
DLNode sec=head.getNext().getNext();
head.setNext(sec);
sec.setPrev(head);
size--;
return o;
}
public HuffmanNode removeLast() throws Exception
{
if(isEmpty())
throw new Exception("Queue is empty");
DLNode d=tail.getPrev();
HuffmanNode o=tail.getPrev().getElement();
tail.setPrev(d.getPrev());
d.getPrev().setNext(tail);
size--;
return o;
}

113 113
public boolean isEmpty()
{
if(size==0)return true;
return false;
}
public int sizeQ()
{
return size;
}
public HuffmanNode first()throws Exception
{
if(isEmpty())
throw new Exception("Stack is empty");
return head.getNext().getElement();
}

public HuffmanNode Last()throws Exception


{
if(isEmpty())
throw new Exception("Stack is empty");
return tail.getPrev().getElement();
}
}

Table.java

114 114
import java.io.*;

class Table implements Serializable


{
private String FileName;
private int fileSize,arr[],size=0,front=0;
public Table(int fileSize,String FileName)
{
arr=new int[256];
this.FileName=FileName;
this.fileSize=fileSize;
}
public void push(int c)
{
if(size>256)
System.out.println("Error in record");
arr[size]=c;
size++;
}
public int originalSize()
{
return fileSize;
}
public int pop()
{
if(size<1)
System.out.println("Error in record");
int rt=arr[front++];
size--;
return rt;
}

public String fileName()


{
return FileName;
}
public int recSize()
{
return size;
}
}

REFERENCES

115 115
TITLE AUTHOR

1. Data compression Khalid Sayood


2. Data compression Mark Nelson
3. Foundations of I.T D.S yadav
4. Complete Reference Java Herbert Schildt
5. OOPS in java E Balagurusamy
6. Java programming Krishnamoorthy
7. Software Engineering Pressman
8. Software Engineering Pankaj Jalote

WEBSITES:-

1. http://www.google.com

2. http://www.wikipedia.org

3.http://www.nist.gov

ENCLOSED:

Soft copy of the project in C.D.

116 116