Professional Documents
Culture Documents
Advanced Data Mining and Applications 10th International Conference ADMA 2014 Guilin China December 19 21 2014 Proceedings 1st Edition Xudong Luo
Advanced Data Mining and Applications 10th International Conference ADMA 2014 Guilin China December 19 21 2014 Proceedings 1st Edition Xudong Luo
https://textbookfull.com/product/advanced-data-mining-and-
applications-15th-international-conference-adma-2019-dalian-
china-november-21-23-2019-proceedings-jianxin-li/
https://textbookfull.com/product/combinatorial-optimization-and-
applications-8th-international-conference-cocoa-2014-wailea-maui-
hi-usa-december-19-21-2014-proceedings-1st-edition-zhao-zhang/
https://textbookfull.com/product/advanced-data-mining-and-
applications-14th-international-conference-adma-2018-nanjing-
china-november-16-18-2018-proceedings-guojun-gan/
https://textbookfull.com/product/web-and-internet-economics-10th-
international-conference-wine-2014-beijing-china-
december-14-17-2014-proceedings-1st-edition-tie-yan-liu/
Intelligent Robotics and Applications 7th International
Conference ICIRA 2014 Guangzhou China December 17 20
2014 Proceedings Part II 1st Edition Xianmin Zhang
https://textbookfull.com/product/intelligent-robotics-and-
applications-7th-international-conference-icira-2014-guangzhou-
china-december-17-20-2014-proceedings-part-ii-1st-edition-
xianmin-zhang/
https://textbookfull.com/product/intelligent-robotics-and-
applications-7th-international-conference-icira-2014-guangzhou-
china-december-17-20-2014-proceedings-part-i-1st-edition-xianmin-
zhang/
https://textbookfull.com/product/information-security-and-
cryptology-10th-international-conference-inscrypt-2014-beijing-
china-december-13-15-2014-revised-selected-papers-1st-edition-
dongdai-lin/
https://textbookfull.com/product/simulated-evolution-and-
learning-10th-international-conference-seal-2014-dunedin-new-
zealand-december-15-18-2014-proceedings-1st-edition-grant-dick/
Xudong Luo
Jeffrey Xu Yu
Zhi Li (Eds.)
LNAI 8933
123
Lecture Notes in Artificial Intelligence 8933
13
Volume Editors
Xudong Luo
Sun Yat-sen University
Guangzhou, P.R. China
E-mail: luoxd3@mail.sysu.edu.cn
Jeffrey Xu Yu
Chinese University of Hong Kong
Hong Kong, SAR China
E-mail: yu@se.cuhk.edu.hk
Zhi Li
Guangxi Normal University
Guilin, P.R. China
E-mail: zhili@gxnu.edu.cn
This volume contains the papers presented at ADMA 2014: the 10th Interna-
tional Conference on Advanced Data Mining and Applications held during De-
cember 19–21, 2014, in Guilin.
There were 90 submissions. Each submission was reviewed by up to four, and
on average 2.9, Program Committee (PC) members and additional reviewers.
The committee decided to accept 48 main conference papers plus ten workshop
papers.
Since its start in 2005, ADMA has been an important forum on theoretical
and application aspects of data mining and knowledge discovery in the world.
The previous ADMA conferences were held in Wuhan (2005), Xi’an (2006),
Harbin (2007), Chengdu (2008), Beijing (2009 and 2011), Chongqing (2010),
Nanjing (2012), and Hangzhou (2013).
Program co-chairs Xudong Luo and Jeffrey Xu Yu put in a lot of effort
for the difficult and highly competitive selection of research papers. ADMA
2014 attracted 90 papers from 18 countries and regions. Honorary Chair Ruqian
Lu, local Organization Chair Zhi Li, and Steering Committee Chair Xue Li all
contributed significantly to the paper review process. We thank Hui Xiong and
Osmar R. Zaiane for their keynotes, which were the highlights of the conference.
So many people worked hard to make ADMA 2014 successful. The excellent
conference and banquet places were managed by local Organization Chair Zhi Li;
Publicity Chair Xianquan Zhang promoted the conference. Treasurer Guoquan
Chen played an important role in the financial management.
Different to the previous nine ADMA conferences, we held two important ac-
tivities. One was to grant the best paper award to three submissions to ADMA
2014. Another was to grant the 10-year best paper award to three papers pub-
lished in previous ADMA conferences. These papers were selected by Steering
Committee Chair Xue Li, PC Co-chairs Xudong Luo and Jeffrey Xu Yu, as well
as PC chairs of past conferences. We must thank the Guangxi Key Lab of Multi-
ple Data Mining and Safety at the College of Computer Science and Information
Technology at Guangxi Normal University for offering the honorarium for these
best papers.
This year, in addition to the conference, we organized for papers to be pub-
lished in a special issue in the renowned international journal Knowledge-Based
Systems, which were selected and coordinated by PC Co-chairs Xudong Luo and
Jeffrey Xu Yu and Steering Committee Chair Xue Li. The authors of the top
11 papers were invited to extend their contributions for possible publication in
Knowledge-Based Systems.
VI Preface
Program Committee
Liang Chen Zhejiang University, China
Shu-Ching Chen Florida International University, USA
Zhaohong Deng Jiangnan University, China
Xiangjun Dong Shandong Polytechnic University, China
Stefano Ferilli University of Bari, Italy
Philippe Fournier-Viger Centre Universitaire De Moncton, Canada
Yanglan Gan Gonghua University, China
Yang Gao Nanjing University, China
Gongde Guo Fujian Normal University, China
Manish Gupta University of Illinois at Urbana-Champaign,
USA
Sung Ho Ha Kyungpook National University, Korea
Qing He University of Chinese Academy of Sciences,
China
Wei Hu Nanjing University, China
Daisuke Ikeda Kyushu University, Japan
Xin Jin University of Illinois at Urbana-Champaign,
USA
Daisuke Kawahara Kyoto University, Japan
Sanghyuk Lee Xi’an Jiaotong-Liverpool University, Japan
Gang Li Deakin University, Australia
Jinjiu Li University of Technology Sydney, Australia
Xuelong Li Chinese Academy of Sciences, China
Zhi Li Guangxi Normal University, China
Zhixin Li Guangxi Normal University, China
Bo Liu Guangdong University of Technology, China
Guohua Liu Yanshan University, China
Yubao Liu Sun Yat-sen University, China
Xudong Luo Sun Yat-sen University, China
Marco Maggini University of Siena, Italy
Tomonari Masada Nagasaki University, Japan
Toshiro Minami Kyushu University, Japan
Yasuhiko Morimoto Hiroshima University, Japan
Feiping Nie University of Texas at Arlington, USA
Shirui Pan University of Technology Sydney, Australia
Tieyun Qian Wuhan University, China
Faizah Shaari Polytechnic Sultan Salahuddin Abdul
Aziz Shah, Malaysia
VIII Organization
Additional Reviewers
Bu, Zhan Kawakubo, Hideko
Gomariz Peñalver, Antonio Li, Fangfang
He, Qing Li, Xin
Hlosta, Martin Noh, Yung-Kyun
Organization IX
Data Mining
A New Improved Apriori Algorithm Based on Compression Matrix . . . . . 1
Taoshen Li and Dan Luo
Recommend System
A Personalized Travel System Based on Crowdsourcing Model . . . . . . . . . 163
Yi Zhuang, Fei Zhuge, Dickson K.W. Chiu, Chunhua Ju,
and Bo Jiang
Database
Mining Continuous Activity Patterns from Animal Trajectory Data . . . . 239
Yuwei Wang, Ze Luo, Baoping Yan, John Takekawa,
Diann Prosser, and Scott Newman
Dimensionality Reduction
A New Filter Approach Based on Generalized Data Field . . . . . . . . . . . . . 319
Long Zhao, Shuliang Wang, and Yi Lin
Map Matching for Taxi GPS Data with Extreme Learning Machine . . . . 447
Huiqin Li and Gang Wu
Classification
Constructing Decision Trees for Unstructured Data . . . . . . . . . . . . . . . . . . . 475
Shucheng Gong and Hongyan Liu
Clustering Methods
A Huffman Tree-Based Algorithm for Clustering Documents . . . . . . . . . . . 630
Yaqiong Liu, Yuzhuo Wen, Dingrong Yuan, and Yuwei Cuan
Machine Learning
Discernibility Matrix Enriching and Boolean-And Algorithm for
Attributes Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
ZhangYan Xu, Ting Wang, JinHu Zhu, and XiaFei Zhang
Abstract. The existing Apriori algorithm based on matrix still has the problems
that the candidate itemsets are too large and matrix takes up too much memory
space. To solve these problems, an improved Apriori algorithm based on
compression matrix is proposed. The improvement ideas of this algorithm are
as follows: (1) reducing the times of scanning matrix set during compressing by
adding two arrays to record the counts of 1 in the row and column; (2)
minimizing the scale of matrix and improving space utilization by deleting the
itemsets which cannot be connected and the infrequent itemsets in compressing
matrix; (3) decreasing the errors of the mining result by changing the condition
of deleting the unnecessary transaction column;(4) reducing the cycling number
of algorithm by changing the stopping condition of program. Instance analysis
and experimental results show that the proposed algorithm can accurately and
efficiently mines all frequent itemsets in transaction database, and improves the
efficiency of mining association rules.
1 Introduction
X. Luo, J.X. Yu, and Z. Li (Eds.): ADMA 2014, LNAI 8933, pp. 1–15, 2014.
© Springer International Publishing Switzerland 2014
2 T. Li and D. Luo
sets, and generates huge candidate itemsets. It will require a lot of I/O load and waste
huge running time and larger memory space. In order to solve the existed problems of
Apriori algorithm, some improved algorithms are proposed, such as FP-Growth
algorithm[2],Hash-based algorithm[3], Partitioning-based algorithm[4], Sampling-
based algorithm[5], Dynamic Itemsets Counting-based algorithm[6], Scalable
algorithm[7], et al. In these improved algorithm, some algorithm is designed too
complicated and not practical. Although some algorithms can improve the
shortcomings of the Apriori algorithm, but they may cause other problems to increase
running cost and affect the operation efficiency.
In [8], Huang et al. first proposed the Apriori algorithm based on matrix, which is
called BitMatrix algorithm. The algorithm first converts the transaction database into
Boolean Matrix, and then uses the "AND" operation to calculate the candidate support
count to find the frequent itemsets. It can find all 1-frequent item sets in the first time
scanning and only needs to scan the database once. The following operations are
carried on the matrix. In recent years, many researchers have made improvement to
the Apriori algorithm based on matrix. Yuan and Huang[9] propose a matrix
algorithm for efficient generating large frequent candidate sets. This algorithm first
generates a matrix which entries 1 or 0 by passing over the cruel database only once,
and then the frequent candidate sets are obtained from the resulting matrix. Zhang et
al.[10] present a fast algorithm for mining association rules based on boolean Matrix,
which is based on the concept lattice and can certify all association rules with only
one scan of databases. Wand and Li [11] put forward an improved Apriori algorithm
based on the matix. This algorithm first makes use of the matrix to find out frequent
k-itemsets with most items, and finds out all frequent sub-itemsets from frequent k-
itemsets and deletes these sub-itemsets from candidate itemset to decrease duplicated
scan. And then it finds the remaining frequent itemsets from frequent 2-itemsets to
frequent (k-1)itemsets. In case of smaller support, this algorithm's performance is
better, but its calculation process of the maximum frequent itemsets is too complex.
In order to adapt to the different needs of mining, people has improved Apriori
algorithm based on matrix. In [12], a quantitative association rules mining algorithm
based on matrix is presented. This algorithm firstly transformed quantitative database
into Boolean matrix, and then used Boolean "AND" operation to generate frequent
item sets on matrix vector, thereby reducing the database scanning frequency. Khar et
al.[13] put forward a Boolean matrix based algorithm for mining multidimensional
association rules from relational databases. The algorithm adopts Boolean relational
calculus to discover frequent predicate sets, and uses Boolean logical operations to
generate the association rules. Luo et al. [14] present an improved Apriori algorithm
based on matrix. In order to increase algorithm's running time, this algorithm
transforms the event database into matrix database so as to get the matrix item set of
maximum item set. When finding the frequent k-item set from the frequent k-item set,
only its matrix set is found. Krajca et al. [15] propose an algorithm for computing
factorizations based on approximate Boolean matrix decomposition, which first
utilizes frequent closed itemsets as candidates for factor-items and then computes the
remaining factor-items by incrementally joining items. This algorithm can reduce the
A New Improved Apriori Algorithm Based on Compression Matrix 3
2 Problem Description
Inference 3[22]: Assuming that each transaction and itemsets in transaction are sorted
in order of the dictionary. If two frequent k-itemsets Ix and Iy are unable to connect,
then all itemsets after Ix and Iy are not connected.
Aiming at existing problems of Apriori algorithm based on matrix, the improved ideas
of NCMA algorithm proposed in this paper are as follows.
(1) Improving matrix storage method. In order to reduce the size and the number of
scanning matrix, we added two array . The weight array w is used to record the count
for the same transaction. Using array w, each column in the matrix only stores one or
more transaction information, and no duplicate column. Array m is used to record the
number of 1 in each column. When compressing matrix, algorithm can obtain the
transaction length by scanning the array m. So algorithm can avoid scanning the entire
matrix, reduces scanning time and compresses the matrix quickly and effectively.
(2) Improving method of matrix compression. This improvement is to remove
useless transaction for finding frequent itemsets, so that the matrix is compressed to
be smaller. Improve matrix compression includes the row (itemsets) compression and
column (transaction) compression. The matrix row compression needs to utilize
theorem 2 and inference 3, and the process of row compression is as follows: 1) scan
matrix, if an itemset is unable to connect with its neighbor itemsets, then the row
vector corresponding to this itemset will be deleted; 2) update the values of array m
accordingly; 3) the remaining row vectors are combined into a new matrix in
original order. The matrix column compression has to use property 2 and theorem 1,
and the process of column compression is as follows: 1) after row compression, scan
matrix m, if there are value that less than or equal to the 1, then delete column vector
corresponding to this value. 2)the remaining column vectors are combined into a new
matrix in original order.
(3) Calculating support count. By improving method of calculating support count,
the algorithm can reduce the number of scanning the matrix row and more simply and
quickly find the frequent itemsets. In process of calculating support count, the “AND”
operation will be carried out on the row vectors corresponding to two connected
itemsets, and multiplied by the corresponding weight in array w. Its weight sum is the
support count of itemset after connected. If the support count of an itemset is greater
than or equal to min_sup, then the row vector corresponding to this itemset will be
saved, otherwise rejected. When an itemset cannot be connected with the next itemset,
the connection operation of next itemsets and behind itemsets will be executed, until
all the itemsets scanning is completed.
(4) Updating end condition of algorithm to reduce cycle times of algorithm
execution. This improvement needs to utilize inference 1 and 2. According to
inference 2, when the number of frequent k-itemsets is less than k+1(i.e. the row of
the matrix is less than k+1), the algorithm does not need to find frequent (k+1)-
itemsets (k+1). At this time, the algorithm can be terminated.
6 T. Li and D. Luo
support_count= ;
if(support_counts≥min_support×the number of transaction)
add li to L1;
} // Calculating frequent 1-itemset
Output L1;
delete the column vector appending to infrequent item in D;
D1←QuickSort(D) // Quick sorting D in ascending order of support count
for(i=0; i<M1; i++){ //M1 is the row number of D1
for(j=i+1; j<M1; j++){
for(q=0; q<N1; q++) { // N1 is the column number of D1
support_counts+=D1[i][q]×D1[j][q]×w[q];
}
if(support_counts≥min_support×the number of transaction) {
add this itemset to L2;
accumulate the row vectors appending to this itemset to array m by bits;
}
}
}//Calculating frequent 2-itemset
Output L2;
D2←the row vectors appending to L2;
for(k=2; |Lk|≥k+1; k++){
for each itemset l in Lk {
if(l is unable to carry out join operation with adjacent itemsets) {
array m bitwise minus the value of row vector appending to l;
delete the column vector appending to l;
}
} // Row(itemset) compression
arrange the remaining rows of Dk in order;
for(i=0; i<Nk; i++){ // Nk is the number of column in Dk
if(m[i]≤1)
A New Improved Apriori Algorithm Based on Compression Matrix 7
also contain the same project. Then, the weights for third and fifth column of the
matrix are 2, the weights of rest column are 1. D and w are as follows:
(2) Determine the set of frequent 1-itemsets, L1. It consist of the candidate set of 1-
itemsets satisfying minimum support. The transaction in D are scanned and the
support count of all itemsets is accumulated. support(I1)=6, support(I2)=8,
support(I3)=7, support(I4)=4, support(I5)=3. And then, the saved row vectors are
arranged in ascending order of support count(i.e. {I5, I4, I1, I3, I2}). The initial matrix
D1 generated by recombination is as follows.
(3) Determine the set of frequent 2-itemsets, L2. Each row of D1 carries out
weighted vector inner product operation with other row in order. For example,
∧
I5 I4=00000001,support_count(I5I4)=0×1+0×1+0×2+0×1+0×2+0×1+0×1+1×1=1.
Because the support count of itemset {I5I4} is less than minimum support count, the
row vector appending to itemset is removed. After calculation, a set of {I5I1, I5I3, I5I2,
I4I1, I4I3, I4I2, I1I3, I1I2, I3I2} is the itemsets which support count is greater than or
equal to the minimum support count. The row vectors appending to this itemset are
accumulated to array m by bits, and the itemsets appending to the saved row vector is
L2. After recombining D1 in itemsets order, we can get the new matrix D2 as shown
follows:
A New Improved Apriori Algorithm Based on Compression Matrix 9
(4) Compressing matrix D2. At first, the transactions in D2 are scanned and the
row vectors corresponding to the itemset {I3 I2} that is unable to connect with its
neighbor itemsets are removed. The array m needs to subtract the vector value
corresponding to {I3 I2}. The remaining row vectors are recombined into a new D2 in
original order. Secondly, matrix m is scanned. The columns which value is less than
or equal to the 1 will be removed, such as 2, 3 and 5 column. The remaining column
vectors are recombined into a new D2' in order. D2 and D2' are shown as follows.
(5) Determine the set of frequent 3-itemsets, L3.The array m is cleared. Each itemset
in D2' that is able to join carries out weighted vector inner product operation in order.
For example, {I5I1, I5I3} may be joined to itemset {I5I1I3} , ∧∧ I 5 I 1 I 3=
∧
I5I1 I5I3=00011 , support_count(I5I1I3)=0×1+0×1+0×1+1×1+1×1=2. Because the
support count of itemset {I5I1I3} is equal to minimum support count, the row vector
appending to the itemset will be retained. Similarly, the row vectors appending to
{I5I1I2, I5I3I2, I4I1I2, I4I3I2, I1I3I2} will also be retained. At the same time, these row vectors
are accumulated to array m by bits. The row vector appending to {I4I1I3} will be
removed. The itemsets appending to the saved row vector is L3. After recombination in
itemsets order, we can get the matrix D3 as shown follows. Due to the row number of
D3 is 6 and is larger than 4, the operation will enter the next round of computing.
(6) Compressing matrix D3. At first, the transactions in D3 are scanned and the row
vectors corresponding to the {I5I3I2, I4I1I2, I4I3I2, I1I3I2} that is unable to connect with
its neighbor itemsets are removed. When removing one row, the value of array m needs
to be updated correspondingly. The remaining row vectors are recombined into a new
D3 in order. Secondly, matrix m is scanned. The columns which its value is less than or
equal to the 1 will be removed, such as 1, 2 and 3 column. The remaining column
vectors are recombined into a new D3' in order. D3 and D3' are shown as follows.
10 T. Li and D. Luo
(7) Determine the set of frequent 4-itemsets, L4.The array m is cleared again.
,
{I5I1I3, I5I1I2} is joined to {I5I1I2I3} and the weighted vector inner product operation
∧∧∧ ∧ ,
is carried out. I5 I1 I3 I2=I5I1I3 I5I1I2=11 support_count(I5I1I3I2)=1×1+1×1=2.
Because the support count of itemset {I5I1I3I2} is equal to minimum support count, the
row vector appending to this itemset will be retained. At the same time, this row
vectors are accumulated to array m by bits. The itemsets appending to saved row
vector is L4. After recombination in order, we can get the matrix D4 as shown follows.
Due to the row number of D4 is 2 and is less than 5, the L5 has not to be determined.
At this time, maximal frequent itemsets is frequent 4-itemsets, and running of
algorithm is stopped.
w=[1 1]
D4= I5I1I3I2 [1 1]
m=[1 1]
Before ordering L1, generated L2 is {I1I2, I1I3, I1I4, I1I5, I2I3, I2I4, I2I5, I3I4, I3I5},
correspondent L3 is {I1I2I3, I1I2I4, I1I2I5, I1I3I4, I1I3I5, I1I4I5, I2I3I4, I2I3I5, I2I4I5, I3I4I5},
L4 is {I1I2I3I4, I1I2I3I5, I1I2I4I5, I2I3I4I5}. After ordering L1, L3 is {I5I1I3, I5I1I2, I5I3I2,
I4I1I3, I4I1I2, I4I3I2, I1I3I2}, L4 is {I5I1I3I2}. Compared with the L3 and L4 generated
before ordering, the frequent itemset number of L3 and L4 generated after ordering
decrease 3 respectively, thus reducing the three vector inner product operations
respectively. Obviously, after the NCMA has sorting frequent 1-itemsets in increasing
order, the number of 3 candidate frequent 3-itemsets and later candidate frequent
itemsets that are generated by algorithm are decreased. For the frequent itemsets
generated in later, the effect of sorting will be getting better and better. So, NCMA
algorithm can greatly improve the operation efficiency of running algorithm.
(a) Comparison of the three algorithms (b) Comparison of CM_Apriori and NCMA
As can be seen from Figure 1, the running time of the traditional Apriori
algorithm is increased with the increase of the number of transactions, and its
running time is much larger than the other two Apriori algorithms based on
compression matrix. Compared with CM_Apriori algorithm, the running time of
NCMA algorithm is smaller, which shows that the NCMA algorithm has a good
performance.
Table 3 and Figure 2 show the runtime comparison results of different support
under the same database for three algorithms. In experiment, the number of
transactions is 40, the average length of transaction is 24, and the support is 30%. The
efficiency of the algorithms is measured by the algorithm’s running time (seconds).
Table 3. The results under different supports from the same database
(a) Comparison of the three algorithms (b) Comparison of CM_Apriori and NCMA
Obviously, with the increase of the degree of support, the running time of Apriori
algorithm is reduced gradually. When the support is small, the running time of two
Apriori algorithms based on compression matrix is far less than the traditional
Apriori algorithm, and the running time of NCMA algorithm is less than the
CM_Apriori algorithm.
Table 4 and Figure 3 give the runtime comparison results under different number of
item’s database for three algorithms. In experiment, the number of transactions is 2000,
and the average length proportion of transaction is 0.6, and the support is 45%. The
efficiency of the algorithms is measured by the algorithm’s running time (seconds).
Table 4. The results under different number of items’ databases
(a) Comparison of the three algorithms (b) Comparison of CM_Apriori and NCMA
As can be seen from the Figure 3, with the increase of item numbers, the running
time of the traditional Apriori algorithm is exponential growth, and its running time is
much larger than the other two Apriori algorithms based on compression matrix. With
the increase of item numbers, the running time of NCMA algorithm is less than CM_
Apriori algorithm, which shows that NCMA algorithm has good operating efficiency.
Table 5 and Figure 4 show the runtime comparison results under different density’s
database for three algorithms. In experiment, the number of transactions is 1000, the
number of items is 50, and the support is 45%. The efficiency of the algorithms is
measured by the algorithm’s running time (seconds).
(a) Comparison of the three algorithms (b) Comparison of CM_Apriori and NCMA
From the Figure 4, with the increase of database density, the running time of
traditional Apriori algorithm is increased exponentially. When the database density is
more than 0.5, the running time of Apriori is much larger than CM_Apriori algorithm
and NCMA algorithm. The running time of NCMA algorithm is smaller than
CM_Apriori algorithm.
According to above experimental results, we can conclude the analysis conclusion.
(1) When dealing with high density and high correlation data, because the Appriori
algorithm needs to scan database many times and will generates huge candidate
itemsets, its running time is higher than the Apriori algorithm based on compression
matrix. The NCMA algorithm only scans the database once, and the sorting of
frequent 1-itemsets effectively reduce the number of candidate itemsets produced by
the compression matrix. At the same time, the algorithm removes a large number of
14 T. Li and D. Luo
useless transactions by using compression matrix, and can quickly compute support
count through logic "and" operation.
(2) For more items and high intensive database, in the case of increased number of
transactions and dropped support, the performance of NCMA algorithm is much
better than that of CM_Apriori algorithm. Because the number of frequent itemsets
mined in more items and high dense database is huge, the matrix will become bigger
and bigger with the increasing of row number. In this case, the improvement steps of
NCMA algorithm in compression matrix have more obvious advantage. Because the
algorithm eliminates the itemsets that is unable to be connected when compressing
matrix, the scanning matrix generated by compressing is smaller. In addition, because
of changing the end condition of the algorithm, the cycle number of NCMA algorithm
is less than or equal to CM_Apriori algorithm. As a result, the running time of NCMA
algorithm is less than CM_Appriori algorithm, and its running efficiency is also better
than CM_Appriori algorithm.
5 Conclusion
The existing Appriori algorithms based on matrix and its improved algorithms still
have some problems: (1) the candidate itemsets is too large and matrix takes up too
much memory space; (2) because of only compressing the transaction or itemset in
the process of compression, compression matrix still stores a lot of elements that are
independent of generated frequent itemset; (3) the process of calculating the support
count needs to repeatedly scan matrix, so it increases the running time. To solve these
problem, a new improved Apriori algorithm based on compressed matrix called
NCMA is proposed in this paper. The algorithm is improved from matrix storing,
itemsets sorting, matrix compressing, support count computing and the condition of
stopping algorithm. The Instance analysis and experiment results show that proposed
NCMA algorithm can reduce the number of scanning the matrix, compress the scale
of matrix greatly, reduce the number of candidate itemsets and raise the efficiency of
mining frequent itemsets. The algorithm has better operation efficiency and scalability
than existing Apriori algorithms based on compressed matrix when mining more
items and high dense databases. However, NCMA algorithm is a serialization
algorithm, and not adapted to handle huge amounts of data because the matrix
generated by algorithm is easy to cause memory overflow. The next work is to design
parallel NCMA algorithm to mine large databases effectively.
References
1. Agrawal, R., Imielinaki, T., Swami, A.: Mining association rules between sets of items in
large databases. In: Proceedings of the ACM SIGMOD Conference on Management of
Date, pp. 207–216. ACM press, Washington, D.C (1993)
2. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generations. In:
Proceedings of the 2000 ACM SIGMOD International Conference on Management of
Data, pp. 1–12. ACM Press, Dallas (2000)
3. Park, J.S., Chen, M.S., Yu, P.S.: An effective hash-based algorithms for mining
association rules. In: Proceedings of the 1995 ACM SIGMOD International Conference on
Management of Data, pp. 175–186. ACM Press, San Jose (1995)
4. Savasere, A., Omiecinski, E., Navathe, S.: An Efficient Algorithm for Mining Association
rules. In: Proceedings of the 21st International Conference on Very Large Database,
pp. 432–444. ACM Press, New York (1995)
Another random document with
no related content on Scribd:
The Project Gutenberg eBook of Less than kin
This ebook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this ebook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
Language: English
New York
Henry Holt and Company
1909
Copyright, 1909
BY
HENRY HOLT AND COMPANY