You are on page 1of 5

International Journal of Advanced Computer Science, Vol. 1, No. 6, Pp. 215-219, Dec. 2011.

Text Compression Algorithm Using Bits for Character Representation


Chouvalit Khancome
Abstract Full text compression is among the most important principles found in computer science. This principle deals with a large given text to be stored in a smallest space and an efficient algorithm to be used for compressing the text. In this article, a new algorithm of full text compression is introduced. This solution divides the source data into suitable blocks; as well as, all characters in each block are assigned their positions. Afterwards, these positions are converted into the bit form. Empirically, the new algorithm is efficient for the source data which takes several bytes per one character. The experimental results showed that the source data could be saved 11.50% in minimum; meanwhile, the maximum is 76.56%.

Manuscript
Received:
27, Sep., 2011

Revised:
24, Nov., 2011

Accepted:
15, Dec., 2011

Published:
15, Jan., 2012

Keywords
text compression, inverted index, bit-level compression, character based compression

blocks called documents. Then all characters in each document are given their positions and represented by the bit form. In theoretical results, the new algorithm is suitable for full text compression especially when using targets that take several bytes per one character. The experimental results, based on the alphabet size of 1 to 26 and 160,000 characters, showed that the original sources were compressed 11.50% in minimum and 76.56% in maximum. Additionally, the compression results were compared with free and trial versions of the popular applications such as WinRar, WinZip, TurboZip, BitZipper and Zipper X. The rest of this paper is organized as follows. Section 2 shows the principle derivation and related works. Section 3 indicates the compression structure. Section 4 presents the algorithms of compression and decompression. Section 5 shows the experimental results and section 6 is the conclusion.

1. Introduction
Full text compression is the principle which decreases the space of the source data to be stored in the compressed file. This principle needs both excellent data structure and an efficient algorithm to compress and to decompress the source data [1]-[15]. Bit-level text compression has recently emerged as a popular principle for large-scale of data compression. Its advantage is the ability to keep a large given text in less space of bits. Traditionally, several algorithms ([4], [5], [6], and [7]) are efficient solutions in the bit-level representation. These solutions are alternative ways to compare with the keyword approaches of Huffman, Ziv-Lempel, and Factor (e.g., [1], [2], [3], and [11]). Even though the bit-level principle is more important and efficient than the traditional principles, a new algorithm is always the need and the challenge of algorithm developers. This research article presents a new bit-level algorithm for full text compression. This solution shows both a new data structure and new algorithms: compression and decompression. This approach employs the bits to store the positions of character which occur in the source data. The main idea is that the source data is divided into several
This work was supported by Rajanagarindra Rajabhat University. Mr.Chouvalit Khancome, Department of Computer Science, Faculty of Science and Technology, Rajanagarindra Rajabhat University. 244 Marupong Road, Aumpher Meung, Chachoengsao province, THAILAND, 24000. E-mail: chouvalit.k@hotmail.com, sk-aran01@yahoo.com

2. Principle Derivation and Related Works


This section shows the original idea of the inverted index which is derived for accommodating a new data structure. The sub-section B illustrates the related works on text compression algorithms. A. The inverted index derivation The original idea of position is the inverted index in [8], [9], and [10]. Firstly, these sources give the way to create this structure by analyzing all keywords in all target documents. Secondly, the positions of those keywords are considered while the process of keyword analysis is produced. Afterwards, the positions, which are appeared, are posted by using the integer numbers called the posting lists. Finally, the posting lists are applied to the other data structures such as B+tree, suffix array, suffix tree, etc. Considering the details of the inverted lists construction, this principle analyzes and represents the words in all target documents by the form of <document ID, words:pos> where document ID is the indicated number of documents, words represents the keywords called the vocabulary, and pos is the occurrence position of word in the number of document. All documents are assigned as D={D1, D2, D3,,Dn} where Di is each document containing the various keywords in the various positions, and 1 i n . Implementing this idea, each document is indicated by a

216

International Journal of Advanced Computer Science, Vol. 1, No. 6, Pp. 215-219, Dec. 2011.

unique number as D1 to Dn, and the keyword w in each document is analyzed and given its occurring position. For instance, if there are the keywords wa:1, wb:2, wc:3,... then wa:1 means the keyword wa that appears at position 1. Applying the idea above, the source data (the target text) is analyzed and divided into several documents as all documents in the inverted lists idea. Then all characters in all documents are given the positions and represented in the individual bit form. Example 1 shows the original idea of dividing the blocks of character and giving their documents. Example 1. If the source data is T={aabaabccca abbbaaabbbcccaaaababcccaaaabab}, then T can be divided into four documents and given the occurring positions as follows. T={aabaabccca abbbaaabbb cccaaaabab cccaaaabab } D1 D2 D3 D4 D1= a a b a a b c c c a 1 2 3 4 5 6 7 8 9 10 D2= a b b b a a a b b b 1 2 3 4 5 6 7 8 9 10 D3= c c c a a a a b a b 1 2 3 4 5 6 7 8 9 10 D4= c c c a a a a b a b 1 2 3 4 5 6 7 8 9 10 In final step of this idea, all positions are transformed into the bit-form which is shown in the section 3. B. The Related Works Since the emergence of text compression, there have been several principles employed to accommodate the compacted source data. For instance, a classic keyword base method, the dictionary base method, the bit-level base method, and the arithmetic base method are the traditional ways to this principle. Among them, the keyword based principle and the bit-level principle have been of great interested to researchers. The classic algorithms, Huffman, Ziv-Lempel, and Factor, are employed the keywords to accommodate the compressed files. Good reviews and experimental results of these algorithms can be found in [11]. However, based on [11], these algorithms could only save a minimum space of 0.09 % and a maximum space of 73.74 %. The past few years have seen the emergence of [1], [2], [3], [12], and [13] as new keyword based algorithms. Unfortunately, these algorithms still represent the source data by analyzing the keywords and challenging the better method for storing those keywords. Presently, the bit-level principle is used for full text compression. This principle represents the source data by the bit form; for instance, [4], [5], [6], [7], [14], and [15]. These algorithms represented the source data by using several techniques of the bit forms. The low efficiency algorithm [14] used the Boolean function to represent a set of groups to bits. This solution could only save 10% of space. The more efficient

algorithm, which was used for compressing the multimedia files, was shown in [15]. This algorithm could save a maximum space of about 20%. This solution used a technique called a fixed-length Hamming (FLH). This algorithm can be called as an enhancement of the Huffman code; because, it improved the Huffman code to store the source data. A good review can be found in [7]. A more efficient algorithm [4] was called ACW(n) (Adaptive Character Word-length). ACW(n) used a technique called the binary sequence using the curtain character-to-binary format. In detailing the method, this algorithm employed the bit of ASCII code. Afterwards, the bit sequence was subdivided into n-bit length. Normally, it was used in d 256 where d is the size of alphabet. For a suitable value of n, the optimum variable of n (n=9 and n=10). The current algorithm is ACW(n,s) [7] which is called an enhanced version of algorithm [4]. Al-Bahadili and Hussain added to the new technique by using the sub-sequence of bit called s value. The best result was shown when using the value of n to equal 14, which saved 80-97%. Unfortunately, some cases showed poor performance and could only save 2% or 50% of the space (shown in [7]).

3. Bits Form Data Structure


This section shows how to keep the position by the bit form. The method consists of two phases: the inverted lists representation and the bit representation. A. The inverted lists representation All positions of each character in each document are considered and then they are represented in the form of character:<inverted lists form>. For the inverted lists form representation, it can be rewritten by the form of <occurrence position:{corresponding documents}>. For instance, if character a in the example 1 is considered then the document D1 can be represented by a:<1:{1}>,<2:{1}>,<4:{1}>,<5:{1}>, and <10:{1}>. Therefore, all documents in example 1 can be represented as shown in an example 2. Example 2. The position form of T, in example 1 is completely written in Table 1.
TABLE 1
BASIC IDEA TO INITIATE THE STRUCTURE

Characters

Position form <1:{1,2}> ,<2:{1}>, <4:{1,3,4}>, <5:{1,3,4}>, <6:{2,3,4}>, <7:{2,3,4}> ,<9:{3,4}> <10:{1}> <2:{1,2}> ,<3:{2}>, <4:{2}>, <6:{1}>, <8:{2,3,4}>, <9:{2}>, <10:{2,3,4}> <1:{3,4}> ,<2:{3,4}>, <3:{3,4}>, <7:{1}>, <8:{1}>, <9:{1}>

a b c

Afterwards, all inverted lists are represented by the bit-forms which are shown in sub-section B.
International Journal Publishers Group (IJPG)

Chouvalit Khancome: Text Compression Algorithm Using Bits for Character Representation.

217

B. The bit-form representation The bit-form is the form of character: position:{ D1Dn}. For instance, if the character a is considered (a:<1:{1,2}>) then it can be represented as a:<0001:110>. The bit-form of 0001 represents 1 in decimal and the bit-form of 110 represents the existing document numbers 1 and 2. Considering the position, it depends on the length of document D. For instance, if the bit representation uses 4 bits (e.g., 0001), then it covers all positions from 1 to 15. For the {corresponding documents}, the document numbers need to be prepared by keeping the bit which equals all numbers of documents. For example, 0000 represents the document numbers from 1 to 4 respectively. The positions of the character in example 2 are represented in the bit-form below. Example 3 Shows the bit-form in example 2. Inverted lists form : Bit-level form a : <1:{1,2}> : <0001:{ 1100}> <2:{1}> : <0010:{ 1000}> <4:{1,3,4}> : <0100:{ 1011}> <5:{1,3,4}> : <0101:{ 1011}> <6:{2,3,4}> : <0110:{ 0111}> <7:{2,3,4}> : <0111:{ 0111}> <9:{3,4}> : <1001:{ 0010}> <10:{1}> : <1010:{ 1000}> b : <2:{1,2}> : <0010:{ 1100}> <3:{2}> : <0011:{ 0100}> <4:{2}> : <0100:{ 0100}> <6:{1}> : <0110:{ 1000}> <8:{2,3,4}> : <1000:{ 0111}> <9:{2}> : <1001:{ 0100}> <10:{2,3}> : <1010:{ 0110}> c : <1:{3,4}> : <0001:{ 0011}> <2:{3,4}> : <0010:{ 0011}> <3:{3,4}> :< 0011:{ 0011}> <7:{1}> : <0111:{ 1000}> <8:{1}> : <1000:{ 1000}> <9:{1}> : <1001:{ 1000}> Empirically, the bit representation in example 3 can be shown as a=8 bytes, b=7 bytes, and c=6 bytes. The total size is 8+7+6=21 bytes from a source data of 40 bytes (using ASCII code) or 80 bytes (Using Unicode (for short) 2 byte per character). In this case the saved space is 40-21=19 which can be calculated as (40-21)/40*100=47.5% or (80-21)/80*100=73.75%. Focusing on the compression of ASCII and Unicode, Unicode uses 2 bytes per character and the percentages of compression is higher than the ASCII code. In this aspect, if one character represents more than one byte, then the efficiency of the compressed space is high. It can be said that this approach is suitable for a target that takes several bytes per one character.

A. Basic Definitions Definition 1. If there is the given text T=t1t2t3tn which needs to be compressed, then the number of alphabets, which are appeared in T, is called . Definition 2. Whenever the text T is already divided into the several blocks; a block size is denoted by Bs. Definition 3. If T is compressed and stored into the storage, then the file which stores the compressed T is called the compression file, denoted by CF. Definition 4. If all items in CF are decompressed into a new file, then that file is called the original file and denoted by OF. Definition 5. Let NB is the number of block. The temporary space which has a size *NB*Bs is called the bit temporary space denoted by TEMP. Definition 6. Each block of TEMP consists of two parts: the number of position and the occurring position of characters. The number of position and the occurring position of characters are represented the group of bits denoted by NP and OD, respectively. Example 4. If a : <1:{1,2}>= <0001:{ 1100}>, then NP=0001 and OD= 1100. Definition 7. The composition form of NP and OD is the bit form for storing in the space, denoted by BF. Example 5. Referring to the example 3, the bit representations can be shown as below. 00011100 of a is represented in BF as 28. B. Compressing and Decompressing Algorithms The compressing algorithm is driven and worked by the input text T, and then it initiates the temporary space to accommodate the bit-representation and prepares the space for storing the compressed file. The next step is that the characters are read one by one and their existences are represented by 1 in the temporary space. The algorithm is shown below.

4. Algorithms
In this section, the basic definitions and the details of algorithms are shown as follows.
International Journal Publishers Group (IJPG)

Algorithm 1: Compression Algorithm Input: T=t1t2t3tn Output: Compressed file (CF) 1. Initiate the TEMP with the size *NB*Bs and create CF for storing file 2. Put the bits for part of NP in TEMP 3. For i=1 to NB Do 4. For j=1 to Bs Do 5. Read ti,j and convert it to the position at alphabet (ti,j) in the position 6. Put the bit-represented by 1 in the alphabet position of ti,i and add into its block in TEMP

218

International Journal of Advanced Computer Science, Vol. 1, No. 6, Pp. 215-219, Dec. 2011.

in the part of OD 7. End For 8. End For 9. Compose NP and OD in TEMP to bytes 10. Write the bytes into CF
The compressing algorithm needs to read all characters in the source data which equal the alphabet size. Each round of reading needs to read from the first character to the last character which takes O(n) time where n is the length of the source data. Then the reading needs to access the each block which equals and in each alphabet needs to access NB time where is the alphabet size and NB is the number of blocks. In this step the complexity is O( +NB) . Therefore, the time complexity of reading the source data is O(n+ +NB).

alphabets. The table 2 shows the saved spaces which are compared with the popular applications such as WinRar, WinZip, TurboZip, BitZipper and Zipper (using the free versions and the trial versions). As the results, the experimental results show that the maximum saving space is 76.56%; meanwhile, the minimum saving space is 11.59%.
TABLE 2
SHOWING THE SAVED SPACES WHEN USING 160,000 BYTES

1 2 3 4 5 6 7 8 16 26

Used Space (Bytes)

Saving Space

(Bytes)

% of Saving

For decompression, the temporary space is initiated as well as the compression algorithm. Then the algorithm reads the bit one by one and writes the character into TEMP. After writing all bits into CF file, the character can be written to the disk. The algorithm is shown in algorithm 2. Algorithm 2: Decompression Algorithm Input: CF Output: Original file (OF) 1. Initiate the TEMP for compression and the table of characters for all alphabets 2. While Not end of CF Do 3. Read each byte in the CF, and convert to the block of bits 4. Analyze for writing the part of NP and OD 5. Convert NP and OD to the characters 6. Put all characters to TEMP 7. End While 8. Write TEMP to OF
The algorithm reads the bit sequence of each block and writes the characters to the temporary space which equals b bits where b is the length of block. The algorithm repeats to read from the first character to the last character which takes O( ) time in maximum. Meanwhile, the step 2 runs from the first round until the last character which equals the file size, denoted by s. Therefore, overall the time complexity of decompressing the file is O(bxsx ).

37,500 70,252 90,351 102,494 110,660 116,443 120,530 124,112 136,296 141,449

122,500 89,748 69,649 57,506 49,340 43,557 39,470 35,888 23,704 18,551

76.56 56.09 43.53 35.94 30.83 27.22 24.66 22.43 14.81 11.59

TABLE 3
COMPARING THE SAVED SPACES (%) WITH APPLICATIONS

1 2 3 4 5 6 7 8 16 26

BitCA 76.56 56.09 43.53 35.94 30.83 27.22 24.66 22.43 14.81 11.59

WinRar 97.44 79.52 74.40 66.72 64.16 59.04 56.48 53.92 41.12 33.44

WinZip 97.77 78.44 69.66 63.88 59.55 56.33 53.55 51.00 40.22 37.45

TurboZip 97.44 79.52 71.84 66.72 61.6 59.04 56.48 53.92 41.12 36.00

BitZipper 98.17 78.88 70.00 64.22 60.00 56.77 54.00 51.44 41.55 34.55

Zipper X 98.17 78.88 67.55 32.00 -1.88 -50.00 -96.66 -142.22 -178.23 -189.33

6. Conclusion
A new full text compression algorithm is proposed. This approach uses the new data structure to accommodate the source data using bit-level by dividing the given text into several blocks of characters. Then, each character is represented by 1 of bit at its position in the corresponding alphabet. This approach takes O(n+ +NB) time for the text compression, and takes O(bxsx ) for the text decompression, where n is the length of the source data, is the alphabet size and NB is the number of blocks, b is the length of block, and s is the size of the compressed file. The experimental results showed that the source data could be saved 11.50-76.56%. Furthermore, the new algorithm is efficient for the source data which takes several bytes per one character.

5. Experimental Results
The experiments were performed on a Dell Vostro 3400 notebook with Intel CORE i5 M560, 2.67 GHz, 4 GB of RAM, and running Windows 7 Professional (32-bits) as an operating system. Netbeans 6.9.1, employing the java compiler version 1.6 updated 22, was used to write the programs. The data for testing is 160,000 bytes, and the block size is 15. As well as, the consists of 1 to 26

International Journal Publishers Group (IJPG)

Chouvalit Khancome: Text Compression Algorithm Using Bits for Character Representation.

219

References
[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

A. Mofat & R.Y.K. Isal, "Word-based text compression using the burrows-wheeler transform," (2005) Information Processing and Management, vol. 4, no. 5, pp. 1175-1192. J. Adiego & P.D.L. Feunte, "On the use of words as source alphabet symbols in PPM," (2006) IEEE In Proceedings of Data Compression Conference, pp. 435. J. Lnsk & M. emlika, "Text compression: Syllables," (2005) In Proceedings of the Dateso Workshop on Database, Texts, Specifications and Objects, pp. 32-45. H. Al-Bahadili & S.M. Hussain, "An adaptive character wordlength algorithm for data compression," (2008) Computers & Mathematics with Applications, vol. 55, no. 6, pp. 1250-1256. S. Nofal, "Bit-level text compression, " (2007) In Proceedings of the International Conference on Digital Communications and Computer Applications, Jordan, 1, pp. 486-488. A. Rababa, "An Adaptive Bit-Level Text Compression Scheme Based on the HCDC Algorithm," (2008) M.Sc., dissertation, Amman Arab University for Graduate Studies, Jordan. H. Al-Bahadili & S. M. Hussain, "A Bit-level Text Compression Scheme Based on the ACW Algorithm," (2010) International Journal of Automation and Computing, vol. 7 no. 1, pp. 123-131. C. Monz & M.D. Rijke, "Inverted Index Construction," (2006) Available: http://staff.science.uva.nl /~christof/ courses/ir/transparencies/clean-w-05.pdf. O.R. Zaane, "CMPUT 391: Inverted Index for Information Retrieval," (2001) University of Alberta. Available: http://www.cs.ualberta.ca/~zaiane/courses /cmput39-03. R.B. Yates & B.R. Neto, "Mordern Information Retrieval," (1999) The ACM press. A Division of the Association for Computing Machinery, pp. 191-227. M. Crochemore & W. Rytter, "Text Algorithms," (2010) Available: http://monge.univ-mlv.fr/~mac/ REC/text-algorithms.pdf. R.Y.K. Isal & A. Moffat, "Word-Based Block-Sorting Text Compression," (2001) ACSC '01: Australasian conference on Computer science , IEEE Computer Society, 24, pp. 92-99. R.Y.K. Isal, A. Moffat, & A.C.H. Ngai, "Enhanced Word-Based Block-Sorting Text Compression," )2002( ACSC '02: Proceedings of the twenty-fifth Australasian conference on Computer science, Australian Computer Society, Australia, 25, vol. 4, pp. 129-137. G. Caire, S. Shamai, & S. Verdu, Noiseless data compression with low density parity check codes, (2004) Advances in Network Information Theory, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, P. Gupta, G. Kramer, A. J. Van Wijingaarden, Ed, vol. 66, pp. 263-284. A. Sharieh, An enhancement of Huffman coding for the compression of multimedea file, (2004) Transactions of Engineering Computing and Technology, vol. 3, no. 1, pp. 303-305.

International Journal Publishers Group (IJPG)