You are on page 1of 6

Content-Based Textual Big Data Analysis and

Compression
Fei Gao Ananya Dutta Jiangjiang Liu
Department of Computer Science, Department of Computer Science, Department of Computer Science,
Lamar University Lamar University Lamar University
fgao@lamar.edu adutta@lamar.edu jliu@lamar.edu

ABSTRACT encode and decode in such a way to obtain effective consequences.


With the growing enhancement of technology and the Internet, The basic characteristic of compression is that it takes sequences of
the number of people who are using the Internet is increasing characters or strings in any form, such as ASCII, and transforms the
daily. Users are engaged in web searching and accessing sequences into new characters or symbols in bits where the
different types of websites, such as social media, banking, etc. information remains the same but is reduced in length. The important
As a result, a large volume of data is being generated in every target area for compression is data storage and transmission. This
day. It is necessary to load this data for analysis purposes. fact is also true for our personal computer system where we can save
However, memory space and transmission time are the most memory and reduce the load of input and output channels (e.g. data
important factors of limited processing. In most cases, we only bus, address bus, etc.).
need to extract the important textual data from these vast raw
datasets. In this work, we propose content-based compression
1.1 Big Data & Compression
(CBC) for textual data analysis on the basis of the Huffman “Big Data” is a very significant term commonly used in the area of
Code. The data is pre-analyzed to find very high frequency computer technology. The volume of data is growing rapidly. It is
words and then a shorter symbol is inserted to replace those equally valuable for both business and scientific research. This
words. This compression approach is performed in an effort to motivates us to innovate new types of observations, measurements,
maintain the original format of the data so that, compressed and storage mechanisms. By analyzing data, we can find the patterns
data structure could be completely transparent to Hadoop behind it and then design different types of decision making
platform. The algorithm is evaluated on a set of real world data applications that can be used effectively in business, healthcare,
sets (e.g. Amazon movie review, food review, etc.) and a 52.4% public service, and security sectors. However, constantly increasing
average data size reduction is obtained from the experiment. data size causes constant pressure for analytical platforms that are
Though this gain may seem modest, this can be supplementary designed only for big data analyses, such as Hadoop and Spark,
to all other compression optimization techniques. Furthermore, which work under the limits of computational power, storage
the proposed technique can be effectively applied for the big capacity, etc. Therefore, if we can effectively implement the
data optimization purpose. compression technique, we will enhance the efficiency of our
analytical tools and algorithms by working on that compressed data.
CCS Concepts 1.2 Benefits of Data Compression
• Information systems → Data management systems →
Compression technique speeds up the data transfer across the
Data structures → Data layout → Data compression network to and from the disk. By compressing information, we can
solve our storage capacity limitation issue. Performance can also be
Keywords enhanced by achieving optimization in data compression in [2].
Compression; text-based encoding; Huffman Tree Algorithm; Substantial energy and power savings can be obtained through
Hadoop compressing data. We can also be benefited in cost reduction by
dealing with less data. The savings potential in data compression
1. INTRODUCTION comprises a better scale in terms of size, speed, and power
Compression can be viewed as an encoding technique where
consumption of implementation (which will be used to do
the data contents can be represented in a special way that
compression and decompression) than uncompressed data processing.
satisfy a purpose of encoding. Furthermore, compression is a
Hence, these overheads will continue to decrease over time.
subdivision of information theory [1] which is focused on
Permission to make digital or hard copies of all or part of this work for 1.3 Compression Issues
personal or classroom use is granted without fee provided that copies There are currently several compression methods available, such as
are not made or distributed for profit or commercial advantage and 7zip, lzw, and snappy. However, these algorithms can squeeze the
that copies bear this notice and the full citation on the first page. data contents at block level and disregard the order of information.
Copyrights for components of this work owned by others than ACM Thus, the data format and structure are jumbled, and compressed
must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
data may lose its original information.
requires prior specific permission and/or a fee. Request permissions The popular 7zip, bzip, gzip, and zip compression methods use a
from Permissions@acm.org.
dictionary for their compression. Therefore, even though the
ICCBD '18, September 8–10, 2018, Charleston, SC, USA
© 2018 Association for Computing Machinery. compression ratio is good for a very large file in those methods,
ACM ISBN 978-1-4503-6540-6/18/09…$15.00 carrying a dictionary adds overhead that can be observed during the
compression for a comparatively small file.
DOI: https://doi.org/10.1145/3277104.3277107

7
1.4 Our Contribution compression for I/O performance was studied and proposed to
The target of this work is to design a content-based alleviate I/O bottleneck for big data [7,8].
compression method for big textual data analysis in a local
system. We have extended the Huffman Code and replaced a
3. CBC DESIGN AND MPLEMENTATION
word/text with a symbol after performing OR and bitwise left
shift operations on a Huffman encoded string, while
3.1 CBC Compression Criteria
We were motivated by the content-aware compression technique and
maintaining the original properties of the input file, such as
focused to solve their issue with generality [3]. Since preserving the
data ordering, format, structure, and punctuation.
original data format and structure are two of the objectives, any
The principle behind our implementation is to translate any meta-characters must not be encoded and should remain unchanged
type of data into a code as well as the special characters. If we during the compression.
can substitute the word with a shorter code and keep all other
regular expressions unaffected, then the compressed data 3.2 Word Count Job
structure will be completely transparent to any analytic When we pass an input file, we need to calculate the frequency of
platform, such as Hadoop. Thus, for any textual data analysis every word. These frequencies will be used as the weightage for each
programs, such as sentiment, semantic analysis, and Page Rank word to build a Huffman tree. Moreover, we have also designed
calculation, we can implement them on the compressed data as partial-compression where we can establish a constraint depending
well. Our compression algorithm is a loss-less data on these frequencies. The constraint in our partial-compression is as
compression scheme and case-insensitive to the strings. follows “Add up the frequencies of the words, calculate the
percentage, and choose how many percentage of
The organization of this paper is as follows: Section 2
describes related work in the data compression techniques, high frequency words that you want to select for partial
which includes previous data compression techniques. Section compression”.
3 presents the implementation of our proposed algorithm on
data compression. Section 4 illustrates the experimental results 3.2.1. Data Structure
based on applying our algorithm and analysis of the results. In Our main objective is to design an algorithm that can support small
Section 5 we delineate the future works that can be done as an and large volumes of data sets. Thus, we need to select data structure
extension to the present work. in such a way that it can support unlimited number of data. We have
used java TreeMap data structure for this purpose. By using
2. RELATED WORKS TreeMap data structure, we can efficiently store data in key/value
In this chapter, we provide a brief overview of prior related pairs. It allows rapid retrieval and achieves the mechanism of log (n)
works on the data compression techniques of potential big data time complexity. It also guarantees that the elements will be sorted in
analysis so that we can obtain some ideas about the past and ascending order rather than any other data structure.
current ongoing research works in this field.
3.2.2. Regular Expression
A more recent work on content-based compression for big data When we count frequency for a given input file, we take only the
was proposed by [3]. They proposed a new approach called the words and keep the special characters in their place as in original
Content-aware Partial Compression (CaPC), which uses a form. Therefore, words with and without special characters are the
dictionary-based approach. Basically, it consists of two same for counting frequency. Moreover, we do not count the
independent parts. All recent compression methods destroy the frequency of these special characters.
original data structure and format after compression, but CaPC
will maintain the original data structure even after the 3.2.3. Customized Sorting
compression. This guarantees that the Map Reduce program After we continuously update the TreeMap with frequency count, we
can work with CaPC-compressed data directly without will perform an additional sorting mechanism to sort in ascending
decompression. Its experimental results show 30% average order after all the words are included with their frequencies in the
data size reduction. And on a Hadoop cluster for some I/O TreeMap data structure.
intensive jobs it increases up to 32% performance.
3.3 Huffman Tree Encoding Mechanism
A survey studied compression algorithms that have been Our content-based compression is based on Huffman encoding,
implemented in Hadoop package recently [4]. This survey which uses variable-length code for high frequency symbols[6].
depicts the brief overview of the most currently used
compression methods: LZO, GZIP, BZIP2, LZ4, and Snappy. 3.3.1. Data structure
They also have performed a comparison among these methods We have used a customized Linked List to support large data sets for
in the Hadoop platform for a very large size of data. compression. We access first two elements of a linked list to make
the first two leaf nodes of the tree, then we reset the next head of a
A decision algorithm was developed to help Map-Reduce users linked list, omitting the first two elements as they are already
identify when and where to use compression [5]. Their accessed.
experiment results show that compression can improve
performance and energy efficiency for Map-Reduce workloads. Every combined parent node has compared with other elements in
Therefore, compression provides 35-60% energy savings for the linked list. The built Huffman Tree has been stored in a hash map
reading heavy jobs as well as jobs with highly compressible with encoded words.
data. For some jobs, using compression enhances energy
savings up to 60%. Based on their measurements, they 3.3.2. Partial and Full Encoding Methods
construct an algorithm which examines per-job data To implement a partial compression technique, we have summed up
characteristics and IO patterns. It also determines the field in all the frequency counts then calculated the percentage. We have
which to use compression and the exact time. Also, taken three measurements as threshold values, i.e. 90%, 50%, 60%.

8
We send these threshold values as an encoding limit to the improvement. We have decided to keep the special characters in the
Huffman Tree algorithm. This indicates that input files can be original input structure, so we did not encode and compress any
compressed up to 90%, 50%, or 60% of high frequency words. special character.
Excluding this variant, we are able to apply the full Huffman
encoding and compress an entire input file. 3.5 Decompression Mechanism
For decompression, after we fetch any symbol from the compressed
3.4 Partial Compression file, we send it to the system which will return the bits representation
3.4.1. Partial Compression of that symbol. During compression, we keep track of how many
If we proceed with the following random example, “quick words are compressed in a given symbol. According to the count, we
brown fox quick brown fox jumped over a lazy lazy dog,” divide the bits obtain from the system. If we obtain excessive zeros,
then with the 90% partial Huffman encoding we get the we will truncate those zeroes following AND and bitwise left shift
following encoded sequence: “100 brown 01 111 01 jumped operation. Thus, we can receive the actual encoded string.
101 110 00 00 110”. Our next approach for compression is Furthermore, we send the encoding string to the Huffman tree
divided into the following steps: structure. In the Huffman tree, it traverse from root to the leaf to find
the actual word and return to system. Decompressed text for y € ?
• First, when we obtain an encoded word, we start mentioned in above example is “Hi how are you?”
counting. If there is no encoded word, we will use OR
and the bitwise left shift operation. Furthermore, we pad We need to take special care at the time of decompression for
with zeroes to make it 8 bits long due to the system’s partially compressed files due to the technique. This technique is
required word length. When we send it to the system used to identify whether the present string is compressed or not. So
background, it generates a character or symbol to store in we have extended our decompression method as follows:
a file. When decompression technique fetches any character, meanwhile, it
• If we obtain a non-encoded word in any place of the checks the corresponding value stored in another file (array list,
sentence, we will directly write it into the files. In this linked list, and other storage data structure). For the non-encoded
case, the count will be zero. word, corresponding value stored in that file is zero. So whenever we
fetch zero for any given string, we can decide that the string is not
• If we obtain consecutive numbers of encoded words then encoded. Therefore, we do not need to send it to Huffman Tree.
use OR and bitwise left shift operation, then we can
concatenate all the encoded words up to 8 bits in length. Consequently, decompressed text for partially compressed string `
Next, we use a count to keep record of the encoded brown ªjumped ü mentioned in above example is “quick brown
words, which will help during compression. Then we fox quick brown fox jumped over a lazy lazy dog.”
send this concatenate string to the system and it will
generate one symbol and store it to the file. Hence, the 4. RESULTS ANALYSIS AND
key factor works for the compression we will receive: COMPARISON
“01 111 01.”–These are the encoded strings for three We analyzed in various ways for the efficiency of different
different words, but we will get only one symbol by our compression methods. Then we measured the relative time
compression technique. “01 111 01” are the encoded complexity, the space complexity, processing time (i.e. how fast the
strings for “fox quick brown fox” respectively. These algorithm performs), how closely the reconstruction matches the
words take total 18*8 bit/character = 144 bits to original data, and the amount of compression gained. In this work,
represent. However, through our compression, we will we will be mainly concerned with the last two criteria. Let us take a
take only 8 bits instead of 144 bits. Hence, the total brief look at what those are.
improvement is ((144-8)/144) * 100 = 94%.
A very logical way of measuring how well a compression algorithm
• If we obtain only one encoded string at the last of the compresses a given set of data is to look at the ratio. We are
sentence, we need follow the first step. calculating it by:
After performing the partial compression, we receive `brown
ªjumpedü as our partially compressed string in the compressed (1)
file. Compression ratio is the ratio of how much compressed data is
3.4.2. Full Compression achieved from the original file size. The lower compression ratio will
For full compression, if we proceed with the following indicate the higher performance. In this section, we compared
example, “Hi how are you?” the full encoded string we will get compression ratio for our compression mechanism with others.
from Huffman encoding is “01 111 00 110?” Now we can send
this encoded string for full compression. It will follow the next
4.1 Results Analysis of CBC (Partial and Full)
step: Compression Technique with Huffman
The encoded string will be divided into an 8 bits’ length string
Encoding
using OR and bitwise left shift operation and sent to the system. In our work, we have focused on both partial and full compression
The system will generate symbols to represent these strings. techniques and their performances. Furthermore, we have compared
For the above sentence, the compressed code will be y€?. So our partial and full compression performance with Huffman
we can see that the original sentence takes 15 encoding for different sized data files. We categorized the data files
characters*8bits/character = 120 bits to represent. Whereas by in four parts: very small, small, medium, and large in each category.
our compression, it will only take 3 characters*8bits/character
= 24 bits to represent and has ((120-24)/120) *100 = 80% total

9
When we compared the compression ratio of our CBC partial Table 1. Average compression time and ratio
and full compression with Huffman encoding technique, we Compression Average Compression Average
received the following result. Method Time (millisecond) Compression Ratio
Figures 1 & 2 show that five different ranges of very small CBC(Full) 21.91 0.37
data files, the CBC partial and full compression display a lower CBC(90%) 18.16 0.45
compression ratio compared to the Huffman encoding CBC(70%) 16.21 0.63
technique. Then, we analyzed the CBC partial and full CBC(50%) 13.85 0.77
compression technique with the Huffman encoding technique Huffman 48.9 0.97
for small sized files. We have categorized the small file sizes
in five different range and this is discussed as follows.
Therefore, we can summarize that our partial and full CBC
CBC-90% and CBC-full compression demonstrate a better compression technique are better in both compression ratio and
performance than Huffman encoding. Besides, CBC-70% and execution time than Huffman encoding technique.
CBC-60% partial encoding methods give us a faster execution
speed compared to the existing Huffman encoding technique 4.2 Results Analysis of CBC Full Compression
shown in Table 1.
Technique with Other Compression Techniques
As from our previous analysis, though partial and full CBC
Comparison of Compression Ratio of Different
compression methods are beneficial for better performance, the CBC
Algorithms (Very Small Size File) full compression method is much better than the CBC partial
2.5
compression. Thus, we decided to analyze the CBC full compression
Compression Ratio

2 scheme with other popular compression schemes.


1.5

0.5

0
Huffman CBC-50% CBC-70% CBC-90% CBC-Full

Original File Size = 150 bytes Original File Size = 300 bytes
Original File Size = 500 bytes Original File Size=700 bytes
Original File Size=1000 bytes

Figure 1. Comparison of compression ratio for very small


size data files

Figure 3. Comparison of compression ratio for medium size data


files

Figure 2. Comparison of compression ratio for small size


data files

Figure 4. Comparison of compression ratio for large size data


files
We have compared CBC full compression technique with 7zip,
Bzip2, Gzip, LZW, and Tar in five different sizes of data files. We
can see from the result, though the CBC full compression could not
give as small of a compression ratio as 7zip, Bzip2, and Gzip, it is

10
still giving a low compression ratio and better performance
than Tar and almost the same performance as LZW
compression schemes. 5. CONCLUSION AND FUTURE WORKS
In our compression method, we basically targeted text data for
4.3 Results Analysis for Repeating effective data analysis due to frequent use. Thus, we implemented
Compression for Very Large Data Files our experiment only with text-based unstructured real time raw data
The research is focused on implementing an algorithm to sets and limited our works by textual compression technique. In the
reduce data file size for data that varies from gigabyte size to future we can extend our work for application specific image, audio,
petabyte. In fact, we could not gain optimal compressed size or video data compression.
by implementing the compression methods, such as 7zip, bzip2, Generally textual data does not contain many special characters that
etc. Besides, they cannot perform well repeatedly due to require numbers of bits that are needed to be compressed. We did not
overwriting and increasing the size. For example, when we account for any specific special character in our compression method
compress one file using the 7zip method twice, the compressed due to keeping format, so the data structure was not destroyed.
file size remains the same. If we apply the bzip2 on that 7 However, we may utilize special characters for compression in our
zipped compressed file, it will increase the size of the existing future consideration of files such as xml. In the future, we can focus
compressed file. So now we will experiment with whether or on compression and decompression time by optimizing our
not we can reduce file size further with the CBC full algorithm and programming structure.
compression following with other popular compression
methods. Until now, compression technique was vastly accomplished and
effectively used for memory utilization and reducing the
Figure 5 shows that executing the CBC full compression along transmission time. We have many compression techniques which
with other compression gives a much better result than when give us the best performance in both ways. But compression and
performed alone. decompression times and analysis are adding an overhead for a big
data file, about which we should be concerned. In our study, we have
designed a compression methodology which will take care of these
three dimensions, such as memory space, transmission time, and
analyzation time all at the same time. Our proposed model is limited
to compress and analyze text-based data, and it sacrifices the
computation time to compress for achieving a better compression
performance and direct access to the compressed data. As the
frequent word is replaced by a symbol in our compression, the code
does not contain or match any internal information about that word.
Hence, it supports the high-level data abstraction as well. In
conclusion, we obtained an extremely good performance of
compression from small to medium sized file and a moderate
performance for a large data file.

6. REFERENCES
[1] Lelewer, Debra A and D. S. Hirschberg. 1987. “Data
Figure 5. Comparison of compression ratio for repeated Compression.” ACM Computing Surveys (September, 1987):
compression 261-296, Volume 19 Issue 3.
doi:10.1145/45072.45074.http://dl.acm.org/citation.cfm?id=450
4.4 Analysis of Supported Compression 74.K. Elissa, “Title of paper if known,” unpublished.
Methods in Hadoop with CBC [2] Thirunavukarasu, B., V. M. Sudhahar, U. VasanthaKumar, T.
There are four compression methods available in Hadoop to Kalaikumaran, and S. Karthik . 2014. “Compressed Data
work on a vast data set. Among them, three can give a good Transmission Among Nodes in BigData.” American Journal of
compression, but they cannot provide processing time and Engineering Research (AJER) (2014): 209-212, Volume-03,
execution speed benefit except storage utilization before Issue-06, e-ISSN : 2320-0847. p-ISSN : 2320-0936.
decompression.
[3] Dong, Dapeng and J. Herbert. 2014. “Content-aware Partial
Table 2 shows a 52.4% reduction in compression size for Compression for Big Textual Data Analysis Acceleration.”
performance analysis done in Hadoop. 2014 IEEE 6th International Conference on Cloud Computing
Technology and Science, Singapore (2014): 320 – 325.
Table 2. Compression result of Map-Reduce program in Accessed December 15-18, 2014. doi:
Hadoop 10.1109/CloudCom.2014.76.
Compression File Size Can Execute Map- [4] Lovalekar, Sampada. 2014. “A Survey on Compression
Technique Reduced Reduce Program Algorithms in Hadoop.” International Journal on Recent and
on Hadoop Innovation Trends in Computing and Communication (2014):
7zip 77% No 479 – 482, Volume: 2 Issue: 3, ISSN: 2321-8169.
Bzip2 77.6% No [5] Chen, Yanpei, A. S. Ganapathi, and R. H. Katz. 2010. “To
Gzip 67.6% No
Compress or Not To Compress - Compute vs. IO tradeoffs for
CaPC 30% Yes
CBC 52.4% Yes
MapReduce Energy Efficiency.” Proceedings of the first ACM
SIGCOMM workshop on Green networking - Green

11
Networking (2010): 23-28. Accessed March 29, Heterogeneous Source Mining Algorithms, Systems,
2010.Technical Report No. UCB/EECS-2010 36. Programming Models and Applications - BigMine '12 (2012):
doi:10.1145/1851290.1851296. 45-52. Accessed December 15-18, 2014. doi:
[6] Huffman, David. “A Method for the Construction of 10.1109/CloudCom.2014.76.
Minimum-Redundancy Codes” Proceedings of the IRE 40 [8] Zou, Hongbo, Y. Yu, W. Tang, and H. M. Chen. 2014.
(9): 1098-1101. http://doi:10.1109/JRPROC.1952.273898. “Improving I/O Performance with Adaptive Data Compression
[7] Xue, Zhenghua, J. Li , Y. Zhang , G. Shen , Q. Xu, and J. for Big Data Applications.” 2014 IEEE International Parallel &
Shao. 2012. “Compression-Aware I/O Performance Distributed Processing Symposium Workshops (2014): 1228-
Analysis for Big Data Clustering.” Proceedings of the 1st 1237. Accessed May 19-23, 2014. doi:
International Workshop on Big Data, Streams and 10.1109/IPDPSW.2014.138.

12

You might also like