You are on page 1of 6

2013 10th Web Information System and Application Conference

Research on Improved Apriori Algorithm Based on Coding and MapReduce

Jian Guo Yong-gong Ren


School of Computer and Information Technology School of Computer and Information Technology
Liaoning Normal University Liaoning Normal University
Dalian, China Dalian, China
e-mail: guojianyyc@hotmail.com e-mail: ryg@lnnu.edu.cn

Abstract—Based on the column-oriented database called Hbase,


by using a distributed file system HDFS in Hadoop as the II. RELATED WORK
underlying storage system, and utilizing Map/Reduce data In recent years cloud computing technology [6-7] has
programming model as a distributed data processing engine, been a very popular endorsement. In fact, it’s not something
this paper proposes an improved Apriori algorithm based on new but developed based on distributed computing, grid
coding and Map/Reduce (CMR-Apriori) which is able to computing, parallel computing and virtualization technology.
process data in distributed cloud computing environment and
It is safe to say cloud computing received increasing
is applicable in book sales system. Results of this study
demonstrate that the system is capable of realizing various
attention as an emerging field. Cloud computing archives
functions such as fast-analysis, low redundancy, and exhibiting cloud computing services through a huge platform. The
good performance in terms of interactivity, scalability and high services including Infrastructure as a Service (IAAS),
reliability. Platform as a Service (PAAS) and Software as a Service
(SAAS).
Keywords-cloud computing; Hadoop; Hbase; Apriori
algorithm; book sales P
SAAS
Blog Management System, Content Management System, ERP,
R Matlab, Office Software (Google Doc, MS Office), Web QQ, Forum
O Management System S
I. INTRODUCTION V E
I PAAS C
Online bookstore platform have been gradually become S
I
Network Computing Software, Parallel Computing Software, Load
Balancing Software, Database, Google App, Engine, MS Azure, Web
U
R
more intelligent and individualized where service of book O Service I
N T
recommendation is one of the essential factors. Association I IAAS Y
rule mining is an important method of data analysis in the N
G
Virtualization Technology, Operation System, Virtual and Physical
Machine Management Technology, Storage Devices, Network
data mining technology. The main idea is that if a repeated Equipment
probability is high between the values of two or more data in
the transaction data set items, it is believed that an Figure 1. Cloud computing architecture.
association exists between them, which can be established
among these data items [1-3]. According to the statistics
released by the authority in 2011, the total amount of global A. Definition 1 Hadoop Platform
data every two years will be doubled, and it is expected that Hadoop [8-9] originated in Google's GFS project. It is
human beings have the amount of data in 2020 will reach a currently the most popular cloud computing platform. The
staggering 35 trillion GB. To tackle the defects in the Linux shell, SSH tool and several other related Linux
traditional Apriori algorithm of association rules, researchers component are require to run it. The core contents of Hadoop
have proposed many mining algorithms implemented with include Map/Reduce and Hadoop Distributed File System, as
parallel association rules that improved efficiency of those well as other subprojects like Common, HBase. The
algorithms effectively [1-5]. However, several defected underlying structure is the Distributed File System HDFS,
remain to be addressed. In this paper, we use Map/Reduce which is responsible for storing the entire Hadoop cluster
framework and the encoding operation to improve them. We node file. We store the application data and the metadata
propose CMR-Apriori algorithm to identify the separately. The metadata are stored in the NameNode file on
corresponding rules in the knowledge model quickly and the index server and the application data are stored in
accurately. The improved CMR-Apriori algorithm is applied DataNode. The HDFS operating data block size default is
to the huge amounts of data on a distributed Hbase that is the 64MB, but when HDFS file size is less than 64MB of a
input source of Map/Reduce. This service application model block of data, it does not monopolize the entire data block,
of book recommendation runs, schedules, tracks tasks, and so HDFS is suitable for write once, read many times in the
also provides association and query on clients as well as tests system. In HDFS, it stores the files as many pieces, then
data mining by Hadoop. these blocks are copied to multiple hosts, which is not the
same as the RAID architecture. Hadoop offers sound
reliability, scalability, efficiency and high fault tolerance.
Before using Hadoop file system, formatting the file system

978-1-4799-3219-1/13 $31.00
978-0-7695-5134-0/13 $26.00 © 2013 IEEE 294
DOI 10.1109/WISA.2013.62
is required to start the Hadoop with start-all.sh command. C. Definition 3 Hbase Storage System
The application of Hadoop is widely welcome nowadays. Hbase is the database developed for Hadoop, which is a
For example, Baidu mines web data and analyses search logs open source database with column-oriented, distributed,
with it. Taobao uses it for storing and e-commerce sparse, sorting, multi-dimensional and primarily used to
transactions. Yahoo deals with more than 5PB webs by manage real-time read and write, random accessing to large
operating more than 10,000 Hadoop virtual machines in amount of data tables. Similar to starting the Hadoop, it starts
2000 nodes. an independent HBase instance (the HBase uses /tmp/hbase-
Client
File access request NameNode
Metadata
$USERID directory by default) by the start-hbase.sh
File storage location
command, and starts HBase housing to manage the HBase
Control commands
Control commands instance by HBase shell command. Hbase is dependent on
File data block
Status commands Status commands distributed data lock service ZooKeeper which is mainly
DataNode DataNode
used in starting position of storage HBase data. ZooKeeper
Copy Data block Data block Data block
finds the server of Hregion and terminated after the task is
Data block Data block

Data block Data block Data block Data block


completed, and also controls storage access and records the
information of column family of Hbase table.
Figure 2. HDFS architecture.
III. IMPLEMENTATION OF THE MCM-APRIORI
ALGORITHM UNDER CLOUD
B. Definition 2 Map/Reduce programming model
The main task of the association rule mining concentrates
Map/Reduce is a software architecture designed by on finding in frequent itemsets. Apriori algorithm uses the
Google, a parallel programming model [10-13]. It creates priori knowledge of the nature of frequent itemset to exhaust
mapping (map), simplification (reduce) through functional the data sets of all frequent itemset by layer-by-layer
programming languages. It’s convenient for programmers to iterative procedures. The traditional Apriori algorithm needs
define data structures, and archive large-scale data sets to scan the database repeatedly and generate a lot of
parallel in computing even when they are not familiar with candidate itemsets. It will not only greatly increase the I/O
distributed parallel model. The following is the procedure of overhead, but also become severely challenging in terms of
execution of the Map/Reduce: time and space of main memory.
1) Map stage: When the system begins to run, a map Cloud computing environment displays the distributed
function is specified. It employs hash function to map the characteristic, which supports parallel execution of
output key to a new set of keys. At the same time the map algorithms, thereby enhancing the efficiency of mining. This
creates a partition (partitioner, usually HashPartitioner) for article presents a new CMR-Apriori algorithm of distribute
the output of each reduced task, and then determines the processing based on Hadoop clustering framework in the
record where partition pairs with the corresponding hashed cloud computing environment.
key value. A. The basic idea of the algorithm
2) Shuffle stage: Map/Reduce determines each reduce CMR-Apriori algorithm relies on the traditional Apriori
input all in accordance with the order of arrangement of the algorithm that combines Map/Reduce parallel
key, which is issued to reduce the input in the output of the implementation, with Map/Reduce programming model and
map, which is called the shuffle shuffling process. With data associated encoding operation. Through twice Map/Reduce
compression it compresses data for map side, reduces side processes, CMR-Apriori algorithm greatly reduces the
memory allocation by using http transport protocol. running time of the algorithm, solving problems using
3) Reduce stage: Reducing after receiving the map-side efficient and accurate algorithms.
data, converting the same key value from outputs key-value B. Description of Algorithm specific process
pairs into reducing, as the final output, then output to HDFS.
The implementation steps of the algorithm are as follows:
The default file name is stored in HDFS part-00000 ~ part-
1) Obtain frequent 1-itemset: The data which needs to
0000N. In short, the Map/Reduce is “the decomposition of
be calculated is copied to Hbase. In fact Hbase already
the task and the summary of the results”.
Input Data Data division PDS VKXIIOH UHGXFH Output Data
sorted during the time of data storage. When operating, the
user can configure HBase storage according to their query
Data Fragment
(key,value) (key,value) (key,value_list) (key,value) data requirements. Therefore, it saves the sort query time loss
Data Fragment (key,value) (key,value) (key,value_list) (key,value) data and improves the efficiency. Hbase data are divided
Data Fragment
data according to the user's query, and transferred to different
(key,value) (key,value) (key,value_list) (key,value)
... ... ... ... ... ... nodes. Each node is operated under the framework of the
Hadoop cluster in the Map and Reduce; it is easy to obtain
Figure 3. Map/Reduce task processing flow. frequent 1-itemset.
2) Code the processed items: Scan the database, delete
the item which does not meet the minimum support count

295
based on user-defined support. Conduct coded // Ik-1 means that the corresponding value (Ik-1=0, 1)
representation to the processed items according to the {
transaction record. Set the transaction set T = {t1 , t 2 , Ă Scan (Hb);
, t n } ˈ item set I = { i1 , i2 , Ă , im }, for any given for k each 1 in Max
transaction database D, so that f: D  rij , f (D)  rij . rij is Dividedinto_Every_Col(ChartˈIk-1);
defined as: }
1 I j  Tk // In map function, scan the database first.
rij  
// Then divide the line-items of the records from the first to
 0 I j  Tk the last
In which i = 1, 2... N; j = 1, 2... M.
Set a subset of features corresponding to the set as a 4) Reduce operations: Reduce obtains each column in
subset of a relational database which is named DB ' , DB ' is the item data of the Map, and then conduct "and" operation
composed with the tuple (TID itemset). A sample among columns of data subsets. By Reduce function count
corresponds to a record in the DB ' , the samples of each the number of '1's to determine whether it’s greater than or
component constituting the corresponding attribute in the equal to the minimum support degrees, until the pending
DB ' . There are m transaction records and n items. We can item set is empty.
obtain the following database by scanning once: Set the maximum of the column number as Max_col, the
column variables are controlled with p (1 ≤ p ≤ Max_col),
TID I 1 I2  In the line variables are controlled with q (1 ≤ the q ≤
T1 r11 r12  r1n MAX_ROW). Also, let N denote the number of nodes.

T2 r21  r2 n
Algorithm 3 Reduce operations
    
Tm rm1 rm 2  rmn Reduce(Sign, p)
rij equals to 1 or 0, which means that the i-th transaction // Sign to mark
// Ip-1 is a value corresponding to
contains or does not contain the j-th item, respectively.
for t each 1 in N
for p each 1 in Max_col
Algorithm 1 code the processed items for q each 1 in Max_row
Ip-1 = GET_map_Context(Signˈq);
TEMP = Count_User_Same_Data () // Get Map data
// First obtain the number of the same data
count = (I0  I1  …  Ik-1) ;
if (TEMP < minsupport)
Delete_This_Columns(); numt = separate_1_num (count);
else // Count the number of "1" after "And" operation
Make_Code_Columns(); for t each 1 in N
// Judge the same number of data All_num+ = numt;
// If less than the minimum support directly delete this // Count the number of "1" in all nodes
column data, or encode the data columns if(All_num≥min_sup)
// Compare the number of "1" in node with the count which
given minimum support
This above method has good parallelism and scalability . return Lk-1;
It overcomes the shortcomings of the Apriori algorithm that else
needs to scan the database many times. Delete_this_Item ();
3) Map operations: Put the encoded database divided // Delete the items which does not meet the requirements
into M sections about the subset of data, the number of M Then L =L1 L2 L3 … Lp-1;
depends on the number of nodes in the platform of Hadoop
data.
It needs that map scans each inputted purchase history, 5) Calculate the degree of confidence, and ultimately
and then cuts and divides by columns on each node. Set the obtain the association rules which meet the requirements.
largest column as Max, the range of the column is defined as
k (1 ≤ k ≤ Max). IV. DESIGN OF BOOK SALES SYSTEM BASED ON
CLOUD COMPUTING
Algorithm 2 Map operations The book sales system is running on the latest Google's
open source Hadoop platform. Fig. 4 shows its overall
Map (Chart, Ik-1) system architecture.
// Chart means transaction identifier Functions of each level are described below:

296
Book recommendation algorithm
Application based on CMR-Apriori
service 
Task1 Task TaskQ

Task submission
Returned results
Data
Index: add, delete, query
Access

Data issued and reported

Data The temporary cache data


Cache frequently called
Node unit

Parallel computing and scheduling


The model of Map/Reduce distributed parallel
computing programming

Compute node  Compute node  Compute node Q Figure 5. The project on book sales system.

A large number of data blocks As shown in Fig. 5, the system uses the Eclipse
Data storage
development environment, which nests Hadoop plug-ins, and
debug the HDFS file system through it. There are two project
Open source database of Hbase (ZooKeeper) files: "cloud" project, which includes the source of cloud
HDFS platform, and JRE System Library is the library of Java
Storage node  Storage node   Storage node Q running-time that is used for supporting Java Virtual
Machine. Then importing the hadoop-0.20.2-core.jar and
HBase-0.90.2.jar. “Engineering Liaoning Normal University
Figure 4. Architecture of the book sales system.
Online Book sales system" is the upper implementation of
the system of book sales, which mainly include the http
1) Application service layer: To retrieve and protocol set of web services, Java resource file use and etc.
recommend the content-based books by users’ purchase The application uses the interactive way of JSP, making
records. In the case that the support degrees have been are the algorithm code run in the server named da21, and
given by the known users, search and recommend the strong entering the system by accessing the named port of the da21
correlation books with using CMR-Apriori algorithm, and server's 8080. As shown in Fig. 6, the initialized data are
output the search results. displayed and by setting the degrees of support, finally we
2) Data Access Layer: To support retrieving get the results of frequent itemsets.
information of the upper books, including reading, storing,
additions, deletions and other related operations to the
characteristic data of book information.
3) Data caching layer: To pull in the information
characterized data in the cache book, reduce its load, and
enhance the performance of the data reading.
4) Parallel computing and scheduling layer: To do map
and reduce process to the large amount of data in a cluster.
5) Data storage layer: To fast read and store data
through combining with the advantage of HBase by column
stores, and then builing on HDFS characteristic data type of
book information.
V. IMPLEMENTATION AND ANALYSIS OF Figure 6. System functions.
CMR-APRIORI ALGORITHM
The system is running on the cluster of Hadoop-0.20.2 In order to expand the practical application of this system,
which consists of five machines: the NameNode (JobTrack) here the client of cloud book sales system based on official
machine da21 and the DataNode (TaskTracker) machines version of Android 2.3 is added, as shown in Fig. 7. Users
da1, da2, da22, da23. The operating system uses the open- can log in the cloud servers and purchase books whenever
source version of Ubuntu; the model of CPU is core dual- they want.
core processor. The memory capacity is 1GB, and the hard
drive capacity is 250GB. Also, the system data sets come
from analog information generated from Liaoning Normal
University Library (about 100,000 transaction records).

297
Fig. 8 presents the comparison of execution time of three
algorithms (CMR-Apriori algorithm, Apriori algorithm with
parallel processing, traditional Apriori algorithm) in several
transaction records. It is obvious that the original Apriori
algorithm performs worst and the one with parallel
processing has some slight improvement in efficiency.
However, the CMR-Apriori algorithm significantly
outperforms others with the same number of processed
transaction records.
Furthermore, we can observe in Fig. 9 that with the
increase of the number of nodes in the cluster, the efficiency
of parallel processing improves. In the meanwhile, the slope
Figure 7. The client of cloud book sales system. of the curve becomes smaller and smaller. Hence, with the
same number of transactions, when the number of nodes
In order to demonstrate the efficiency and accuracy of the increases to a certain value, we get a stable running time.
book sales system, two performance evaluations are
provided as follows:
Evaluation 1: Compare the execution time of the original VI. CONCLUSION
and the improved Apriori algorithm with the same number of Hadoop, one of the most popular cloud computing
transactions. platform recently, is considered as a hotspot in the IT field.
This paper introduces some background knowledge of cloud
Running CMR-Apriori algorithm
time the Apriori algorithm with parallel processing computing and Hadoop, and then provides analysis of the
˄s˅ the traditional Apriori algorithm traditional mining algorithm Apriori by utilizing the
30 Map/Reduce programming framework and the open source
distributed database Hbase. Further, we give details of the
25 proposed CMR-Apriori algorithm and apply it to book
20
recommendation service. Finally, we provide careful
performance evaluations to demonstrate that CMR-Apriori
15 significantly outperforms the traditional Apriori association
10
rule mining algorithm in the book recommendation service
model. It is able to provide customers with much more
5 convenient and efficient personalized service. Nevertheless,
some slight deficiencies still exist in our experiment, such as
0 1 2 3 4 5 6 7 8 9 10 11 12 the treatment of failure in NameNode end single point and
transaction number˄*1000˅ NameNode memory ceiling which will be the focus of our
future study. As we are fully confident with the promising
Figure 8. The comparison of three algorithms. prospect of Hadoop application, more efforts should be
directed toward extensively exploring existing resources to
Evaluation 2: Observe performance of the CMR-Apriori achieve its continuous improvement.
algorithm when the number of calculation nodes increases
(from 1 to 5). ACKNOWLEDGMENT
Running time˄s˅ The authors would like to thank Science and Technology
40
Plan Projects of Liaoning Province (Grant No. 2012232001),
Science and Technology Plan Projects of Dalian (Grant No.
35 2013A16GX116), Natural Science Foundation of Liaoning
30 Province (Grant No. 201202119).
25
20 REFERENCES
15 [1] M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal, “PARMA: a
parallel randomized algorithm for approximate association rules
10 mining in MapReduce,” In: Proceedings of the 21st ACM
International Conference on Information and Knowledge
5
Management (CIKM), 2012, pp. 85-94.
[2] R. Agrawal and J. C. Shafer, “Parallel mining of association rules,” In:
0 1 2 3 4 5
IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6):
the number of nodes
962-969.
Figure 9. Comparison of running time with different number of nodes. [3] T. Shintani and M. Kitsuregawa, “Hash based parallel algorithms for
mining association rules,” In: Proceedings of the Fourth International

298
Conference on Parallel and Distributed Information Systems, 1996, [10] H. Yang, A. Dasdan, R. L. Hsiao, and D. S. Parker, “Map-Reduce-
pp. 19-31. Merge: Simplified relational data processing on large clusters,” In:
[4] K. W. Lin and D. J. Deng, “A novel parallel algorithm for frequent Proceedings of SIGMOD, 2007, pp. 1029-1040.
pattern mining with privacy preserved in cloud computing [11] J. Dean and S. Ghemawat, “Map/Reduce: simplified data processing
environments,” In: Int. J. Ad Hoc and Ubiquitous Computing, 2010, on large clusters,” In: Communications of the ACM, 2008, 51(1):
pp. 205-215. 107-113.
[5] L. Li and M. Zhang, “The strategy of mining association rule based [12] J. Dean and S. Ghemawat, “MapReduce: a flexible data processing
on cloud computing,” In: International Conference on Business tool,” In: Communications of the ACM, 2010, 53(1): 72-77.
Computing and Global Informatization (BCGIN), 2011, pp. 475-478. [13] D. Wegener, M. Mock, D. Adranale, and S. Wrobel, “Toolkit-based
[6] J. W. Huang, S. C. Lin, and M. S. Chen, “DPSP: distributed high-performance data mining of large data on MapReduce clusters,”
progressive sequential pattern mining on the cloud,” In: Proceedings In: IEEE International Conference on Data Mining Workshops
of the 14th Pacific-Asia Conference on Knowledge Discovery and (ICDMW), 2009, pp. 296-301.
Data Mining (PAKDD), 2010, pp. 27-34. [14] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, “PFP:
[7] Z. Wu, J. Cao, and C. Fang, “Data cloud for distributed data mining parallel FP-Growth for query recommendation,” In: Proceedings of
via pipelined MapReduce,” In: Proceedings of the 7th International the 2008 ACM Conference on Recommender Systems, 2008, pp. 107-
Workshop on Agents and Data Mining Interation (ADMI), 2011, pp. 114.
316-330. [15] Q. He, F. Zhuang, J. Li, and Z. Shi, “Parallel implementation of
[8] T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc, USA, classification algorithms based on MapReduce,” In: Proceedings of
Yahoo Press 2010. the 5th International Conference on Rough Set and Knowledge
[9] Y. Lai and S. ZhongZhi, “An efficient data mining framework on Technology (RSKT), 2010, pp.655-662.
Hadoop using java persistence API,” In: IEEE 10th International [16] X. Qin, H. Wang, X. Du, and S. Wang, “Big data analysis—
Conference on Computer and Information Technology (CIT), 2010, competition and symbiosis of RDBMS and MapReduce,” In: Journal
pp. 203-209. of Software, 2012, 23(1): 32-45.

299