Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Performance Evaluation Of Data Compression Techniques Versus Different Types Of Data

Performance Evaluation Of Data Compression Techniques Versus Different Types Of Data

Ratings: (0)|Views: 13|Likes:
Published by ijcsis
Data Compression plays an important role in the age of information technology. It is now very important a part of everyday life. Data compression has an important application in the areas of file storage and distributed systems. Because real world files usually are quit redundant, compression can often reduce the file sizes considerably, this in turn reduces the needed storage size and transfer channel capacity. This paper surveys a variety of data compression techniques spanning almost fifty years of research. This work illustrates how the performance of data compression techniques is varied when applying on different types of data. In this work the data compression techniques: Huffman, Adaptive Huffman and arithmetic, LZ77, LZW, LZSS, LZHUF, LZARI and PPM are tested against different types of data with different sizes. A framework for evaluation the performance is constructed and applied to these data compression techniques.
Data Compression plays an important role in the age of information technology. It is now very important a part of everyday life. Data compression has an important application in the areas of file storage and distributed systems. Because real world files usually are quit redundant, compression can often reduce the file sizes considerably, this in turn reduces the needed storage size and transfer channel capacity. This paper surveys a variety of data compression techniques spanning almost fifty years of research. This work illustrates how the performance of data compression techniques is varied when applying on different types of data. In this work the data compression techniques: Huffman, Adaptive Huffman and arithmetic, LZ77, LZW, LZSS, LZHUF, LZARI and PPM are tested against different types of data with different sizes. A framework for evaluation the performance is constructed and applied to these data compression techniques.

More info:

Categories:Topics
Published by: ijcsis on Jan 06, 2014
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

04/26/2014

pdf

text

original

 
 
 PERFORMANCE EVALUATION OF DATA COMPRESSION TECHNIQUES VERSUS  DIFFERENT TYPES OF DATA
Doa'a Saad El-Shora
 Faculty of Computers and Informatics Zagazig University Zagazig, Egypt
 Nabil Aly Lashin
 
Ehab Rushdy Mohamed
 Faculty of Computers and Informatics Zagazig University Zagazig, Egypt
Ibrahim Mahmoud El- Henawy
 Faculty of Computers and Informatics Faculty of Computers and Informatics Zagazig University Zagazig University Zagazig, Egypt Zagazig, Egypt
Abstract 
 — 
 Data Compression plays an important role in the age of information technology. It is now very important a part of everyday life. Data compression has an important application in the areas of file storage and distributed systems. Because real world files usually are quit redundant, compression can often reduce the file sizes considerably, this in turn reduces the needed storage size and transfer channel capacity. This paper surveys a variety of data compression techniques spanning almost fifty years of research. This work illustrates how the performance of data compression techniques is varied when applying on different types of data.
 
In this work the data compression techniques: Huffman, Adaptive Huffman and arithmetic, LZ77, LZW, LZSS, LZHUF, LZARI and PPM are tested against different types of data with different sizes. A framework for evaluation the performance is constructed and applied to these data compression techniques.
I.
 
I
 NTRODUCTION
Data compression is the art or the science of representing information in compact form [1]. This compact form is created  by identifying and using structures that exist in the data. Data can be characters in text files, numbers that are samples of speech or image waveforms, or sequences of numbers that are generated by other processes. There are two major families of compression techniques when considering the possibility of reconstructing exactly the original source [1], [4]: 1. Lossless compression techniques.
 
2. Lossy compression techniques. Figure 1. Lossless compression techniques Figure 2. Lossy compression techniques The development of data compression techniques for a variety of data can be divided into two phases. The first phase is usually referred to as modeling. In this phase, try to extract information about any redundancy that exists in the data and describe the redundancy in the form of model. The second  phase is called coding, in which the difference between the data and the model are encoded, generally using a binary alphabet. Having a good model for the data can be useful in estimating the entropy of the source and lead to more efficient compression techniques. There are several types of models: 1.
 
Physical model. 2.
 
Probability model. 3.
 
Markov model. Physical Model used when knowing something about the  physics of the data generation process. For example, in speech-related applications However, the physics of data generation is simply too complicated for developing a model. Probability Model is the simplest statistical model for the source is to assume that each letter that is generated by the source is independent of every other letter, and each occurs with the same probability. Markov model is one of the most  popular ways of representing dependence in the data,  particularly useful in text compression, where the probability of the next letter is heavily influenced by the preceding letters.
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 11, No. 12, December 201373http://sites.google.com/site/ijcsis/ ISSN 1947-5500
 
 II.
 
MEASURE
 
OF
 
PERFORMANCE A compression technique can be evaluated in a number of different ways.
 
Measuring the complexity of the technique.
 
The memory required to implement the technique.
 
How fast the technique performs on a given machine.
 
The amount of compression.
 
How closely the reconstruction resembles the original. In this work the Performance evaluation of data compression techniques concentrated on the last two criteria. A very logical way of measuring how will a compression technique compresses a given set of data is to look at the ratio of bits required to represent the data before compression to the number of bits required to represent data after compression. This ratio is called compression ratio [4]. III.
 
DATA
 
COMPRESSION
 
TECHNIQUES Compression techniques can be divided into two fundamental and distinct categories: The first techniques are called
 statistical 
 compression techniques, as they are statistical in nature. The second techniques are called
dictionary
 techniques, they are currently in wide spread use. This  popularity is more due to the fact that the dictionary techniques are faster and achieve a greater degree of compression than the statistical compression techniques [12], [13].
 
PPM, or prediction by partial matching, is an adaptive statistical modeling technique based on blending together different length context models to predict the next character in the input sequence [14]. The scheme achieves greater compression than Ziv-Lempel (LZ) dictionary based methods, which are more widely used because of their simplicity and faster execution speeds.
 A.
 
Statistical Techniques
Statistical compression techniques use the likelihood of a symbol recurring in order to reduce the number of bits needed to store the symbol.
1)
 
 Huffman Technique
A more sophisticated and efficient lossless compression
technique is known as “Huffman Coding”, in which the
characters in a data file are converted to a binary code. These codes are prefix codes and are optimum for a given models (set of probabilities). Huffman compression is based on two observations regarding optimum prefix codes: Symbols that occur more frequently (have a higher probability of occurrence) will have shorter codewords than symbols that occur less frequently. The two symbols that occur least frequently have the same length. The Huffman technique is obtained by adding a simple requirement to these two observations. This requirement is that the codewords corresponding to the two lowest probability symbols differ only in the last bit. That is
, if γ and δ are the two least probable symbols in an alphabet, and if the codeword for γ was m
 0,
the codeword for δ would be m
 
1. Here m is a string of 1s and 0s, and
 denotes concatenation. [2], [3].
2)
 
 Adaptive Huffman Technique
Huffman coding requires knowledge of the probabilities of the source sequence. If the knowledge is not available, Huffman coding becomes two
 – 
 pass procedure: the statistics are collected in the first pass, and the source is encoded in the second pass. In order to convert this technique into a one
 – 
 pass  procedure, techniques for adaptively developing the Huffman code were developed based on the statistics of the symbols already encountered. Theoretically, to encode the (
+1)
th
symbol using the statistics of the first k symbols, it is required to compute the code using Huffman coding procedure each time a symbol is transmitted. However, this would not be a very practical approach due to the large amount of computation involved. Adaptive Huffman coding solved this problem [1]. In the adaptive Huffman coding procedure, neither transmitter nor receiver knows anything about the statistics of the source sequence at the start of the transmission. The tree at both the transmitter and the receiver consists of a single node that corresponds to all symbols not yet transmitted and has a weight of zero. As transmission progresses, nodes corresponding to symbols transmitted will be added to the tree, and the tree is reconfigured using an update procedure. Before the beginning of transmission, a fixed code of each symbol is agreed upon  between transmitter and receiver [1], [4].
3)
 
 Arithmetic Technique
It is more efficient to generate codewords for groups or sequences of symbols rather than generating a separate codewords for each symbol in the sequence. However, this approach becomes impractical for obtaining Huffman codes for long sequences of symbols. In order to Huffman codes  particular sequences of length m, this needs making codewords for all possible sequences of length m. This fact causes an exponential growth in the size of the codebook. It is desirable to assign codewords to particular sequences without having to generate codes for all sequences of that length. The arithmetic coding technique fulfills these requirements. In arithmetic coding a unique identifier or tag is generated for the sequence to be encoded. This tag corresponds to a binary fraction, which becomes the binary code for the sequence [3], [4].
 
 B.
 
 Dictionary Techniques
In many applications, the output of the source consists of recurring patterns. A classic example is a text source in which certain patterns or words recur currently. Also, there are certain patterns that do not occur or with great rarity occurring. A very reasonable approach to encode these sources is to keep a list, or
dictionary
, of frequently occurring patterns. When these patterns appear in the source output, they are encoded with reference to the dictionary. If the patterns do not appear in the dictionary, then it can be encoded using other, less efficient method. In effect, the input is divided into two classes, frequently occurring patterns and infrequently occurring patterns [9], [10].
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 11, No. 12, December 201374http://sites.google.com/site/ijcsis/ ISSN 1947-5500
 
 
1)
 
 LZ77 Technique
Lempel-Ziv [1977], or LZ77 is an adaptive dictionary- based compression techniques. LZ77 exploits the fact that words and phrases within a text file are likely to be repeated. When there is repetition, they can be encoded as a pointer to an earlier occurrence, with the pointer accompanied by the number of characters to be matched. The encoder examines the input sequence through a sliding window. This window consists of two parts a
 search buffer 
 that contains a portion of the recently encoded sequence, and a
look-ahead buffer 
 that contains the next portion of the sequence to be encoded. In  practice the sizes of the two buffers are larger [15]. The code encoded as a triple (
o
,
 l 
,
c
) where
o
 is the offset (The distance of the pointer from the look-ahead buffer),
 is the length of the longest match and
c
 is the codeword corresponding to the symbol in the look-ahead buffer that follows the match. It is a very simple adaptive scheme that requires no prior knowledge of the source and seems to require no assumptions about the characteristics of the source [3], [4].
2)
 
 LZW Technique
LZW is a universal lossless data compression technique created by Abraham Lempel, Jacob Ziv , and Terry Welch[16], [17]. This technique is simple to be implemented, and has the potential for very high throughput in hardware implementations [6]. LZW compression creates a table of strings commonly occurring in the data being compressed, and replaces the actual data with references into the table. The table is formed during compression at the same time at which the data is encoded and during decompression at the same time as the data is decoded [9]. LZW is a technique for removing the necessity of encoding the second element of the pair (
i
,
c
). That is, the encoder would only send the index to the dictionary. So the dictionary has to be primed with all the letters of the source alphabet. The technique is surprisingly simple; it replaces strings of characters with single codes. It does not do any analysis of the incoming text [5].
3)
 
 LZSS Technique
This scheme is initiated by Ziv and Lempel [18], [19]. An implementation using a binary tree is proposed by Bell. The technique is quite simple: A ring buffer is kept, which initially
contains “space” characters only. Several letters are read from
the file to the buffer. Then the buffer will be searched for the longest string that matches letters just read, and its length and  position in the buffer will be sent. If the buffer size is 4096  bytes, the position can be encoded in 12 bits. If the match length is represented in four bits, the <position, length> pair is two bytes long. If the longest match is no more than two characters, then just one character is sent without encoding, and the process is restarted with the next letter. One extra bit must be sent each time to tell the decoder whether a <position, length> pair is sent or the code of the character [4].
4)
 
 LZARI Technique
In each step the LZSS technique sends either a character or a [position, length] pair. Among these, perhaps character
“e” appears more frequently than “x”, and a [position, length]
 pair of length 3 might be commoner than one of length 18. Thus, if the more frequent will be encoded in fewer bits and less frequent in more bits, the total length of the encoded text will be diminished. This compression suggests that it should use arithmetic coding, preferably of adaptive kind, along with LZSS [4], [7].
5)
 
 LZHUF Technique
LZHUF, the technique of Haruyasu Yoshizaki replaces
LZARI’s adaptive arithmetic coding with adaptive Huffman.
LZHUF encodes the most significant 6 bits of the position in its 4096-byte buffer by table lookup. More recent, and hence more probable, positions are coded in fewer bits. On the other hand, the remaining 6 bits are sent verbatim. Because Huffman coding encodes each letter into a fixed number of  bits, table lookup can be easily implemented [7].
C.
 
 PPM Techniques
PPM, or prediction by partial matching, is an adaptive statistical modeling technique based on blending together different length context models to predict the next character in the input sequence. A Series of improvements was described called PPMC that is tuned to improve compression and increase execution speed. Also the use of exclusion principle is used to improve the performance. PPM relies on arithmetic coding to obtain very good compression performance. PPM is a combination of several fixed-order context models to predict the next character in an input sequence. The prediction  probabilities for each context in the model are calculated by frequency counts, which are updated adaptively and the symbols that occurs are encoded relative to their predicated distribution using arithmetic coding [10].
1)
 
 PPMC Technique
PPMC (prediction by partial matching without exclusion) is a technique to assign probability to the escape character is called the technique C and will be as follows: at any level, with the current context, let the total number of symbols seen  previous be n
t
 and let n
d
 be the total number of distinct contexts. Then the probability of the escape character is given  by n
d
/ ( n
d
+n
t
). Any character which appeared in this context n
c
 times will have a probability n
c
/( n
d
+n
t
). The intuitive explanation of this technique, based on experimental evidence, is that if many distinct contexts are encountered, then the escape character will have higher  probability but if these distinct contexts tend to appear too many times, then the probability of the escape character decreases. The PPM technique using technique C for  probability estimation is called PPMC technique.
2)
 
 PPMC with Exclusion Technique
PPMC can be modified by using exclusion, this modification will improve the compression ratio but it is slower than the first type. Exclusion principle states that: If a
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 11, No. 12, December 201375http://sites.google.com/site/ijcsis/ ISSN 1947-5500

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->