You are on page 1of 6

International Journal of Wisdom Based Computing, Vol.

1 (2), August 2011


Dictionary Based Preprocessing Methods in Text Compression - A Survey

Rexline S.J
PhD Scholar, Department of Computer Science Bharathiar University Coimbatore, Tamil Nadu, India

Robert L
Computer Science and Information System Department Community College in Al Qwaiya Shaqra University, KSA (Government Arts College, Coimbatore, India)

AbstractIn this paper, we analyze the existing dictionary based text transformation techniques utilized for Lossless Text Compression which can bring a better compression performance of the compression algorithm like bzip2, which is one of the most popular block sorting compression algorithm. These preprocessing techniques can bring some compression at the initial preprocessing stage itself as well as retain enough context and redundancy for existing compression algorithms to bring out better results. The standard and the famous preprocessing skills are called the Burrows Wheeler Transform (BWT), Star Encoding, Length Index Preserving Transformation (LIPT), StarNT, and Word Replacement Transformation (WRT). According to these preprocessing methods, the dictionary is made in time as static one and distributes by both the encoding units and the decoding units. The usefulness of dictionary based text preprocessing skills are the less memory utilization, simple concepts, higher speed in process and the better results in the compression rates. In our paper, we have matched all the preprocessing techniques with the universal lossless compressors and we have brought the most valuable conclusions and also have given certain solutions based on techniques mentioned above. Keywords Dictionary based compression, word transformation, preprocessing, decoding, encoding.

are main parts of text compression techniques: the creation of specialized new text compression algorithm and the usage of preprocessing techniques. Preprocessing techniques which are used with the help of text compressors contribute better compression proportion. We also need to be aware of a concept that high runtime memory requirement and more time consuming are essential to preserve this method in a healthy way. Over all, when we compare this with the frequent use of the preprocessing algorithms, the expenditure of memory overhead is negligibly small. There have been number of concepts in the text processing approach: to make certain amount of changes to the original text file, shifting from one form to the other, which is more surplus and brings out apt context to the density of algorithms. After that, the transformed file can be shortened by following existing compression algorithm in the exact same method as we do the compressing for the original one with much better compression methods. This kind of method can be reversed by which the decompressed file can be brought out exactly like the original text without losing any information. The timings which we use for carrying out and running time memory usage are comparable to the backend compression algorithms. The paper is organized as follows: the next Section, section II describes the reason why the preprocessing techniques are beneficial. Section III gives an overview of few existing text preprocessing techniques like the Burrows-Wheeler Transform (BWT) techniques, the Star Transformation approach and the Length Index Preserving Transformation (LIPT), the StarNT (STARZIP) preprocessing method and finally Word Replacement Transformation (WRT) is discussed. In section IV, the experimental results measuring the performance of the preprocessing methods are given using Calgary corpus and Large Corpus. Section V contains the conclusion remarks.



Since there has been a massive development in the usage of internet, digital storage information system, transmission of text files, and embedded system usage, the text compression technique has brought much attention in the recent years. Of course, the text compression algorithms decrease the idleness in data mediations, which is certainly utilized to bank storage and communication costs. And thus, text compression is seriously considered to do the needful research at present so as to improve its methods and compressing technologies. Though there are methods which have been existing, they have not achieved theoretical compression chances. It is always expected that there can be a better compression as the days and technologies progress. There

International Journal of Wisdom Based Computing, Vol. 1 (2), August 2011




There are number of analysis that directs us to keep our eyes on preprocessing skills. First of all, all of us know that most of the English words especially, their lengths (almost 80%) have greater than three and certainly there are 1000 words which are often used in our day to day life. By analyzing a skill like setting every word with a mediation of not exceeding three characters, we can get certain results of precompression in the preprocessing step itself. Secondly, the transformed output must be compressible to the backend compression algorithms. In other way, the transformed text must come up with its root, without any major damage to the original text as well bring forth certain convinced background to the present compressors like bzip2 and PAQ6. Finally, the transformed codes can be treated as the offset of words in the transformed dictionary. In other words, words are replaced with references to where the word is located in the dictionary. Another way to replace words found in the dictionary is by replacing the original word with some other short sequence of characters, but they give better context for the backend text compression algorithms. The three ideas motivated above are the lossless text compression researches to festoon with a well setting preprocessing based compression algorithm which is explained as BWT, LIPT, Star Encoding, StarNT and WRT. III. EXISTING PREPROCESSING TECHNIQUES

transformed data to produce efficient compressed output. Before the data in BWT is compressed, it is run through the Move-To-Front coder. This is a technique that is ideal for sequences with the property that the occurrence of a character indicates it is more likely to occur immediately afterwards. The sequence of characters is converted to a list of numbers as follows: The list of characters maintained, represent characters by their position in the list. On encoding a character, it is moved to the front of the list. Thus smaller numbers are more likely to occur than larger numbers. The MTF algorithm is not very exciting, but it does a really nice job of compressing streams that have been put through the BWT transform. The output of this technique is then passing to either Huffman Coder or Arithmetic Coding.The choice of MTF coder is important and can affect the compression rate of the BWT algorithm. The order of sorting determines which contexts are close to each other in the output and thus the sort order and ordering of source alphabet can be important. The output obtained from the BWT is a series of blocks which are lexicographically arranged. But according to Chapin [13], the ordering aeioubcdfghjklmnopqrstuvwxyz and so on can be taken into consideration to improve the performance of BWT. If the file size is larger then BWT algorithm will naturally split the data into independent blocks of a predetermined size before the data is compressed. The blocks are then processed through the BWT algorithm. Since each block is independent, it should be possible to run the BWT algorithm on multiple blocks of data simultaneously and achieve speedup on a parallel machine. The separate blocks are then concatenated back together again to form the final compressed file. The sequential version of BWT will process the blocks in order. A parallel implementation will need to keep track of block ordering and write the compressed blocks back to disk in the correct order. They also explain that to achieve good compression, a block size of sufficient value must be chosen, at least 2 kilobytes. Increasing the block size also increases the effectiveness of the algorithm at least up size of several megabytes. B. Incremental Frequency Count A post BWT-stage for the burrows Wheeler Transformation The Incremental Frequency Count stage, a postBWT stage [15] is combined with a run length encoding (RLE) stage and the BWT and entropy coding stage of the algorithm instead of Move To Front algorithm. In MTF algorithm, each new character is moved to the front of the ordered list no matter how rarely the characters have been presented in the next sequences. "Incremental Frequency Count" (IFC) algorithm replaces the Move To Front algorithm. The IFC algorithm assigns a counter for each character and places in descending order. Whenever a character comes for process from the input stream, the position of the corresponding counter is yield and incremented which is recalculated and added to the

A. Burrow-Wheeler Transformation The compression algorithm bzip2 compresses the input text file using the Burrows-Wheeler Transform (BWT) block sorting algorithm and Huffman coding. The BWT[5] is a block-sorting, lossless data compression algorithm that works by reversibly permutes a block of source data and rearranges it using a sorting algorithm, then piped through a Move-T0-Front(MTF) stage, then the Run Length encoder Stage and finally an entropy encoder(Huffman coding or Arithmetic Coding). The algorithm functions by changing a string S of N characters by forming the N cyclic shifts of S, finalising them lexicographically and taking last character of each of the cyclic shifts. A string L can be formed from these out coming characters, where the ith character of L is the last character of the ith concluding rotation. The algorithm also calculates index I of the real string S in the final list of rotations. With only L and I there is enough algorithm to multiply the original string S when undoing the transformation for decompression. As we Know, the BWT can be seen as a sequence of three stages: the initial sorting stage which permutes the input text so similar contexts are grouped together, the Move-To-Front stage which converts the local symbol groups into a single global structure, and the final compression stage which takes advantage of the

International Journal of Wisdom Based Computing, Vol. 1 (2), August 2011


counter of the processed character. Thus, only one counter desires to be rearranged inside the list, which makes the process much faster than MTF. The counters are rescaled frequently in order to prevent overruns and to indicate closer symbols. It gives high throughput the same way the Move To Front (MFT) stage, and at the same time good compression rates. C. The Star Transformation The Star Encoding, which is also called a changing skill, is introduced by Kruse and Mukherjee [2]. This can be one of the ways to carry out lossless, reversible changes to a source file prior to implementing and surviving algorithm. The transformation is designed to make it easier to compress the source file. Star encoding works by making a very big dictionary of commonly used words expected in the input files. The dictionary should be set up before hand and it must be introduced to the compressor and decompressor. Every word in the dictionary has a star encoded equivalent, in which as many letters as possible are replaced by the * character. The dictionary is sectioned into many streams of sub dictionaries consisting words with length 1<n<22, because the more length of an English word is 22 letters. The following coding data is utilized in Star Transform method. The first word is represented as a sequence of n characters *.The next 52 words represented by a sequence of (n-1) characters * followed by an alphabets [a-z, AZ].The next 52 words represented by a sequence of (n-1) characters except that the letter is placed in the sequence on the second position from the right. The procedure continues in order to obtain a total of 53n unique codes. Practically, the most used words will gain the highest percentage of * characters in their encoding. If it can be carried out in a proper manner, which means the transformed file can have a big number of * characters. This must to make the transformed file more compressible than the original plain text. D. Length Index Preserving Transformation In LIPT[2], to create LIPT dictionary DLIPT, the English dictionary D needs to be sorted according to the word length and is portioned into disjoint dictionaries Di, each containing words of length, i, where i, = 1,2....n. each dictionary Di is finalised in decreasing order of the frequency of the words. The alphabets (a-z, A-Z) are followed codewords. For telling the codewords, maximum of three letters used, these alphabets enhance the redundancy and come out with a strong context to the present compression algorithm. We can easily create, utmost to the number of 52 dictionaries but according to the need of the word length. When encoding is taking place, the symbol * is considered to show the changing words and processed by the alphabets (a-z, A-Z) showing the word length and then the maximum of three letter code words are places. The Encoding steps are given as follows:

Words which are in the input can be sought in the Dictionary D. In case we get the input words in the dictionary D, its position and block number are identified and the mediating transformation at the same status and length block in DLIPT is taken care of. This transformation is then the encoding for the respective input word. If the input word is not found in dictionary D then it is transferred as it is. Once we transform the input text according to the procedure given above, then transformed text is then fed to a compressors like Bzip. The Decoding steps are given as follows: The gaining compressed text is the first decompressed utilizing the similar compressor like bzip 2 and the transformed LIPT text is recovered. Then reverse transformation is applied on this decompressed transformed text. The words with * directs nontransformed words and do not need any reverse transformation. The length character in the transformed words gives the length block and the next three characters give the offset in the prestigious block where the word is placed in that block and then there might be a capitalization mask. The words are situated in the original dictionary D and the transformed words are replaced with the respective English Dictionary words. And then the capitalization modification is followed. E. StarNT Transformation According to Dictionary-Based Multi-Corpora Text Compression system by Weifeng Sun Amar Mukherjee Nan Zhang [3], to gain a much better compression performance for the backend data compression algorithm, only letters [a..z, A..Z] which can be used to present the codeword. With the stating specified above, each word in the dictionary is set apart a mediating codeword. The first 26 words are assigned a, b, , z as their codewords. The next 26 words are assigned A, B, , Z. The 53rd word is assigned aa, 54th ab. Following this order, ZZ is assigned to the 2756th word in the Dictionary. The 2757th word is assigned aaa, the following 2758th word is assigned aab, and so on. Using this mapping mechanism, totally 52+52*52+52*52*52 = 143,364 words can be included in the Dictionary. In this change, the use of temary search trees, which give a very speedy change of encoding speed at a low capacity over head. All the more, they introduced capital conversion technique by placing the escape symbol and flag director at the end of the codewords. The concept comes up to the background dependencies and results better compression ratio over the skills. F. IDBE Transformation According to the Intelligent Dictionary Based Encoding method drawn by V.K.Govidan and B.S.Shajee Mohan [12], the dictionary is produced with multiple sources of files as input. Code words are formed using the ASCII characters 33 to 250. For first 218 words, the

International Journal of Wisdom Based Computing, Vol. 1 (2), August 2011


ASCII characters 33 to 250 as the code. The remaining words take each one permutation of two of the ASCII characters (in the range 33 - 250), in order. If there are many words left over, it can take every one permutation of three of the ASCII characters and finally if required permutation of four characters. The real code has length concatenated with the code in the table, the length serves as a marker while decoding and is represented by the ASCII characters 251 to 254 with 251 directing a code of length 1, 252 length 2 and so on. A marker character (ASCII 255) is utilized to show the absence of a space. If the character is one of the ASCII characters 251-255, then the character is shown twice so as to indicate that it is part of the text and not a marker. G. Two Level Dictionary Based Text Transformation Md. Ziaul Karim Zia, Dewan Md. Fayzur Rahman, and Chowdhury Mofizur Rahman[11] introduced a new transformation technique called Two level Text Compression Scheme,in which the codewords are two type having length 2 and 3 using the ASCII characters (33 -128). The words which are not present in the dictionaries are not changed into code words and are written between < and > symbol. Two length code words start with alphabet or keyboard symbol except punctuation and $ # < > symbols. 3 length code words start with number. There is no codeword for punctuation. As a result, the punctuation is not changed to code word and no code word starts with punctuation. Words beginning with capital letters are changed to their lower case equivalent and to denote this change the flag $ placed in front of the respective codeword. Additionally, it is necessary of using flag # to point a conversion of a full uppercase word to its lower case form. If all the letters of a word are in lower case no flag ($ #) is present before their codeword. The spaces between codewords are removed altogether to save memory. H. Word Replacement Transformation Word Replacement Transformation [4, 6] is the most recent algorithm from the presented family. Grabowski [WRT] uses only ASCII characters [128 to 255] to represent the codeword and also a promising technique invented by Taylor, trying to reduce the effect caused by the end-of-line (EOL) symbols, which hamper the context, since words are usually separated by spaces. He substitutes EOL symbols by spaces and encodes their former positions. Words are replaced with references to where the word is located in the dictionary. Moreover, Grabowski suggests ngram replacement, space stuffing between words, capital conversion by placing the flag at the beginning of the codeword instead of placing it at the end of the codword like starNT and the second letter of the word is capitalized too. These ideas give better context to the existing compressors.



In general text preprocessing algorithm gives the high compression force on average. They are specially created for that kind of textual data. If the preprocessing algorithms applied on non-textual data, then that kind of result may direct us for wrong conclusion. In order to get optimal compression results, it is better to implement the text preprocessing algorithms only on text files. The files given in Table I are considered as text files in Calgary Corpus. The other files like geo, obj1, obj2 and pic are categorized as non-textual files. In this section we keep our attention to comparing the performance of the existing preprocessing techniques using Bzip2 and PAQ6 as the backend algorithms. Our measurements have compression results in terms of average BPC (bits per characters). Table I shows that some amount of precompression took place during the transformation phase itself because the size of the transformed text is smaller than the size of the original text file. In this section, the performance of LIPT is compared with StarNT. StarNT gives better compression ratio as well as works faster than the LIPT. Moreover, the average time taken for encoding and decoding process of LIPT and StarNT is stated in Table II. The decoding process works faster than encoding process in StarNT Because of using hash function in decoding process instead of using Ternary search tree which is used in encoding process. We compared the compression ratio of bzip2+StarNT, bzip2+LIPT with the results of bzip2 and have also included in Table III and Figure 1. The BPC figures are rounded off to two decimal places and then the percentage improvement factors have been calculated using rounded BPC values. The average compression ratio using only bzip2 algorithm is 2.36 and using the bzip2 algorithm along with the LIPT technique is 2.06 which emphasizes better compression improvement. StarNT compressor shows better compression results when compared with LIPT which gives compression ratio of 1.94%. Even though StarNT outperforms LIPT, Word Replacement Transformation (WRT) Algorithm gives better compression over StarNT. To compare the preprocessing technique StarNT with WRT, the compression algorithm PAQ6 is applied on them. Table IV shows the comparison between the compression ratios obtained for StarNT algorithm and WRT algorithm with PAQ6 compressors. WRT +PAQ6 gives the overall high compression ratio compared with all other preprocessing techniques.

International Journal of Wisdom Based Computing, Vol. 1 (2), August 2011


TABLE I : COMPARISON OF ORIGINAL FILE WITH TRANSFORMED FILE File Names Bib.txt Book1.txt Book2.txt News.txt Paper1.txt Paper2.txt Progc.txt Progl.txt Progp.txt Trans.txt Bible.txt World192.txt File Size in KB 109 751 597 369 52 81 39 70 49 92 3983 2416 WRT Transformed File Size(KB) 82 477 360 270 31 48 29 54 41 81 2543 1469 Gain % 25 36 37 27 40 41 26 34 16 12 36 39



3 2 1
Figure 1. Comparative Compression Ratio TABLE II : RUNTIME COMPARISON OF LIPT AND STARNT(IN SECONDS) Corpora Corpus Calgary Canterbury Gutenburg Average LIPT
Encoding Decoding

hashing enhances the speed of the StarNT decoding process. WRT works based on the Hashing Technique which speeds up the encoding and decoding process. Hashing is faster than the ternary search tree but drawback of the hashing is that of the need of high memory. LIPT, StarNT and WRT, all these three preprocessing techniques replace the words with their references in the dictionary. The IDBE and Two level dictionary based text compression Scheme are the preprocessing methods inwhich the words found in the dictionary replaced with the shorter codewords. When the Star Encoding is compared with IDBE, IDBE uses all the ASCII characters from 33 255 for codeword generation and gives good compression rates. The method is very much appreciable for its simplicity and clarity of ideas. The table V shows the comparative study of IDBE with Star Encoding and proved that it performs well. The two level dictionary based text compression schemes also gives better compression rates that of 75% (compression ratio of 2.01 bits per input character).
TABLE IV : COMPARATIVE RESULTS OF STARNT WITH WRT(BPC) Corpus Calgary Corpus Large Corpus Bzip2+ StarNT 2.297 1.976 Bzip2+ WRT 2.225 1.851 StarNT+ PAQ6 1.828 1.688 WRT+ PAQ6 1.772 1.602

Encoding Decoding

2.5 2 1.5 1 0.5


1.66 5.7 6.89 3.75

1.45 5.56 6.22 3.58

0.42 1.26 1.68 0.89

0.18 0.85 1.12 0.54

TABLE III : COMPARATIVE COMPRESSION RESULTS OF STARNT WITH LIPT File Names bib book1 book2 News paper1 paper2 Progc prog1 Progp trans Average BPC File size Bytes 111261 768771 610856 377109 53161 82199 39611 71646 49379 93695 Bzip2 1.97 2.42 2.06 2.52 2.46 2.44 2.53 1.74 1.74 1.53 2.36 Bzip2 +LIPT 1.93 2.31 1.99 2.45 2.33 2.26 2.44 1.66 1.72 1.47 2.06 Bzip2 +StarNT 1.71 2.28 1.92 2.29 2.21 2.14 2.32 1.58 1.69 1.22 1.94


Figure 2. Compression results of StarNT with WRT

TABLE V : COMPARISON OF BWT, BWT+ STAR ENCODING AND BWT+IDBE Corpus Calgary Corpus Canterbury Corpus BWT 2.78 2.38 BWT with * Encoding 2.52 2.26 BWT with IDBE 2.39 2.08

Moreover, WRT works faster than the StarNT .The author of StarNT used a ternary search tree for maintaining the dictionary. As we discussed earlier,

The properties of the Incremental Frequency Count stage are compared to the Move To Front by their compression rates and speeds on the Calgary and large Canterbury

International Journal of Wisdom Based Computing, Vol. 1 (2), August 2011


corpora. IFC (Incremental Frequency Count) gives high throughput the same way the Move To Front (MFT) stage, and at the same time good compression rates. The Move To Front (MTF) has the advantage of being simple and fast. Table VI gives the compression rates of gzip, MFT and IFC for the large Canterbury Corpus and Table VII displays the compression rates for BWT, MTF and IFC on the example of the files of the Calgary Corpus. For a better comparison, the compression scheme is used with an IFC stage. V. CONCLUSION The size of the original files which is carefully compared with the compressed files based on the LIPT, StarNT and WRT preprocessing techniques. It is very vivid that bzip2 + StarNT could provide a better compression performance that maintains a convincing compression and decompression speed while it is compared with LIPT. StarNT preprocessing skill that uses temary search tree to accelerate changes in encoding operation and hashing method in decoding execution to hurry up the transformation. The transform dictionary created by all the preprocessing methods is a static one shared by both transform encoder and transform decoder is about 0.5MB. The StarNT works better than LIPT when is applied with backend compressor. When Compared with StarNT, WRT carries out well with larger text file. It provides 2% more compression ratio on Larger Corpora.

TABLE VII : COMPARISON OF BWT, MTF AND IFC File Names Bib.txt Book1.txt Book2.txt News.txt Paper1.txt Paper2.txt Progc.txt Progl.txt Progp.txt Trans.txt Average(BPC) BWT 2.02 2.48 2.10 2.56 2.52 2.50 2.54 1.75 1.74 1.52 2.173 MTF 1.912 2.320 1.981 2.449 2.414 2.373 2.454 1.683 1.665 1.446 2.0697 IFC 1.887 2.257 1.941 2.406 2.386 2.336 2.429 1.666 1.662 1.441 2.0411

[1] R. Franceschini, H. Kruse, N. Zhang, R. Iqbal, and A. Mukherjee, Lossless, Reversible Transformations that Improve Text Compression Ratio, Project paper, University of Central Florida, USA. 2000. F. Awan and A. Mukherjee, LIPT: A Lossless Text Transform to Improve Compression, Proceedings of International Conference on Information and Theory, April 2001. W. Sun, A. Mukherjee, N. Zhang, A Dictionary-based MultiCorpora Text Compression System, Proceedings of the 2003 IEEE Data Compression Conference, ,March 2003. P. Skibi ski, Sz. Grabowski, S. Deorowicz, Revisiting dictionarybased compression, SoftwarePractice and Experience, 2005; vol. 35, no. 15, pp. 14551476, 2005. M. Burrows, and D.J. Wheeler, A Block-sorting Lossless Data Compression Algorithm, Digital Systems Research Center Research Report 124, 1994. Przemysaw Skibiski, Two-level directory based compression, Przesmyckiego 20, 51151 Wrocaw, 30 November 2004. Abel, J., Teahan, W.: Universal Text Preprocessing for Data Compression , IEEE Trans. Computers, 54(5):497-507, 2005. Kruse H, Mukherjee A. Preprocessing Text to Improve Compression Ratios. In Storer JA, , Proceedings of the 1998 IEEE Data Compression Conference, Los Alamitos, California, 1998; Isal RYK, Moffat A, Ngai ACH. Enhanced Word-Based BlockSorting Text Compression., Proceedings of the 25th Australian Computer Science Conference, Melbourne, January 2002; 129138 Horspool N, Cormack G. Constructing Word-Based Text Compression Algorithms. In Storer JA, Cohn M, editors, Proceedings of the 1992 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, California, 1992; 6271. Md. Ziaul Karim Zia, Dewan Md. Fayzur Rahman, and Chowdhury Mofizur Rahman, Two-Level Dictionary-Based Text Compression Scheme, Proceedings of 11th International Conference on Computer and Information Technology (ICCIT 2008). V.K. Govindan, B.S. Shajee mohan, IDBE An Intelligent Dictionary Based Encoding Algorithm for Text Data Compression for High Speed Data Transmission Over Internet. Chapin, B. Higher Compression from the Burrows-Wheeler Transform with new Algorithms for the List Update Problem, Ph.D. Dissertation, University of North Texas, 2001. Umesh S. Bhadade, A.I. Trivedi, Lossless Text Compression using Dictionaries International Journal of Computer Applications,Jan 2011. Jrgen Abel, Ingenieurbro Dr. Abel GmbH,Lechstrasse , Incremental Frequency Count A post BWT-stage for the Burrows- Wheeler Compression Algorithm , Software: Practice and Experience Volume 37, Issue 3, pages 247265, March 2007





[6] [7]

3 2.5 2 1.5 1 0.5 0 Calgary Corpus Canterbury Corpus





Figure 3. Compression results TABLE VI : COMPARISON OF GZIP, MTF AND IFC File Names bible.txt E.coli World192.txt Average(BPC) GZIP 2.330 2.244 2.337 2.304 MTF 1.508 1.989 1.333 1.610 IFC 1.471 1.973 1.309 1.584