You are on page 1of 14

JID: NEUCOM

ARTICLE IN PRESS [m5G;August 6, 2019;16:20]


Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

An efficient method for pruning redundant negative and positive


association rules
Xiangjun Dong∗, Feng Hao, Long Zhao, Tiantian Xu
School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Science), Jinan, China

a r t i c l e i n f o a b s t r a c t

Article history: One of the most important problems that occur when mining positive and negative association rules
Received 21 November 2017 (PNARs) is that the number of mined PNARs is usually large, which increases the difficulties that users
Revised 29 August 2018
retrieve decision-information. Those methods that prune redundant positive association rules (PARs) don’t
Accepted 3 September 2018
work when negative association rules (NARs) are taken into consideration. So in this paper, we first ana-
Available online xxx
lyze what kinds of PNARs are redundant and then propose a novel method called LOGIC by using logical
Keywords: reasoning to prune redundant PNARs. In addition, we combine correlation coefficient and multiple mini-
Correlation coefficient mum confidences (mc) to ensure that the mined PNARs are strongly correlated and their number can be
Multiple minimum confidence flexibly controlled. Experimental results show that our method can prune up to 81.6% redundant PNARs.
Negative association rules To the best of our knowledge, LOGIC is the first method to prune redundant PNARs simultaneously.
Redundant negative association rules
© 2019 Elsevier B.V. All rights reserved.

1. Introduction of important information. There are also some other proposed


solutions that can not only reduce the number of PARs, but also
Association rules, which can discover the positive or negative reserve the same information(see, for instance, [18,19,24–31]).
relations between different items, have received a great deal of at- The core idea of them is to prune redundant PARs which can be
tention recently in many applications, such as web mining, rec- derived from other rules. For example, suppose A⇒BC (A, B, and C
ommender systems, and intrusion detection. An association rule represent different itemsets in database. A⇒BC indicates that if A
is called a valid rule if its support is greater than a user-defined occurs in a transaction, then B and C will also be likely to occur at
minimum support (ms) threshold and its confidence greater than a the same time in the same transaction.) is a valid association rule,
user-defined minimum confidence (mc) threshold [1]. Generally, a then A⇒B and A⇒C can be derived from A⇒BC. Hence, A⇒B and
rule at the form A⇒B, which can indicate the positive relations be- A⇒C are the redundant rules of A⇒BC and can be pruned from
tween different items, is called positive association rule (PAR). And the final rule set. ADRR [29] is one of such methods and is the
the rules at other three forms A⇒¬B, ¬A⇒B, ¬A⇒¬B, which can primary method to be used in the following parts of this article.
indicate the negative relations between items, are called negative The methods introduced above work efficiently in pruning re-
association rules (NARs) [2]. Many studies [3–9] have mentioned dundant PARs, but may not work well with NARs taken into con-
that NARs can provide more valuable information, and sometimes sideration. Now we explain the reason by an example.
play irreplaceable roles by PARs alone in many areas. Hence, many
researchers have focused on mining positive and negative associa- Example 1. Given a database TD with 10 transactions as shown in
tion rules (PNARs) simultaneously [10–14]. Table 1, and ms=0.3, mc=0.3. According to the PNARs mining al-
Many problems would occur when mining PNARs. A typical gorithm in [27], if itemsets A and B are positively correlated, rules
situation is that a large number of PNARs will be mined and this like A⇒B and ¬A⇒¬B are mined only; if A and B are negative cor-
will increase the difficulties for users to retrieve relevant decision- related, then rules in the form of A⇒¬B and ¬A⇒B are mined only.
information. In fact, this problem has been found and solved when In this example, we can get that b⇒cd and b⇒d are PARs, respec-
researchers mined PARs only [15–23]. One typical solution is to tively, but b⇒¬c is a NAR (The details can be seen in Section 5). Ac-
increase the minimum confidence [12], but this will lose a lot cording to ADRR, b⇒c and b⇒d are the redundant rules of b⇒cd,
which can be pruned. In other words, ADRR takes b⇒c as a valid
PAR. However now, b⇒c is not a valid PAR, while b⇒¬c is a valid

Corresponding author. NAR because b and c are negative correlated. Therefore, ADRR is no
E-mail addresses: dxj@qlu.edu.cn (X. Dong), hf_mails@163.com (F. Hao), longer capable of pruning redundant PNARs with NARs taken into
zxcvbnm9515@163.com (L. Zhao), xtt-ok@163.com (T. Xu). consideration.

https://doi.org/10.1016/j.neucom.2018.09.108
0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

2 X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx

Table 1 edge, LOGIC is the first method to prune both redundant PARs
Database TD.
and NARs simultaneously.
Tid Items Tid Items
The remainder of the paper is organized as follows.
T1 a,b,d T6 b,d,f
Section 2 summarizes related work on mining PNARs. Some
T2 a,b,c,d T7 a,e,f
T3 b,d T8 c,f basic concepts are given in Section 3. Section 4 discusses how to
T4 b,c,d,e T9 b,c,f mine strong correlation PNARs based on correlation coefficient
T5 a,c,e T10 a,b,c,d,f and PNARMC, and proposes PNARCMC model. Section 5 gives
the theorem of redundant PNARs and proposes LOGIC method.
Section 6 presents experiment results. Conclusions and future
work are described in Section 7.
Furthermore, the numbers of PNARs are about four times than
the number of PARs theoretically in simultaneous PNARs mining.
This further increases the difficulties that users select decision- 2. Related work
making information from so much PNARs if no pruning meth-
ods are used. Unfortunately, no methods to prune redundant NARs In this section we first discuss the related works on PNARs min-
have been found yet. Thus, how to prune redundant NARs and ing. The most popular model for mining association rules is the
correct PARs pruning methods becomes an urgent and challeng- support-confidence framework proposed by Agrawal et al. [1]. They
ing problem. The main challenge is to determine which kinds of think an association rule is valid when its support is greater than
PNARs are redundant. As we know, NAR is more complicated than a user-defined minimum support (ms) threshold and its confidence
PAR, and NARs are also implemented differently [2,11]. These fac- is greater than a user-defined minimum confidence (mc) thresh-
tors will unavoidably increase the difficulty to determine redun- old. However, they only consider PARs with the form of A⇒B but
dant NARs. not that of A⇒¬B, ¬A⇒B, and ¬A⇒¬B. The latter three forms which
To address the aforementioned critical challenge, we propose an indicate the negative relations between different items, are called
efficient method called LOGIC in this paper to prune redundant NARs[2]. Although NARs can detect some problems that PARs can-
PNARs. The main idea of LOGIC is given in the following. Firstly, not, contradictory information occurs when both forms of rules are
it analyzes the main methods of generating PNARs which often mined simultaneously. In [33], Dong et al. proposed a new model
use a metric to judge the correlation before generating PNARs. Sec- for mining PNARs from frequent itemsets simultaneously. It deletes
ondly, it analyzes what kinds of rules are redundant to each type of contradictory rules based on the correlation between itemsets, and
PNARs, respectively. Finally, it introduces logic reasoning to prune in turn mines rules meeting the confidence threshold. Zhu and Xu
redundant PNARs. proposed a new algorithm to mine PNARs in database [34]. This
In addition, the pruning method of increasing mc does not work algorithm uses the correlation coefficient to judge in which forms
well either when we mine PNARs simultaneously. Along with the rules should be mined, and a pruning strategy to reduce the search
increase of mc, in the reminder PNARs, the number of rules like space. The correlation coefficient calculation formula is defined as
¬A⇒¬B would be relatively large compared with that of ¬A⇒B. follows:
For example, the reminder PNARs contain more rules with type of
corrA B = s(A ∪ B )/(s(A )s(B )). (1)
¬A⇒¬B, when the support of itemsets A and B is small (such as
0.1), but contain less rules with type of ¬A⇒B when mc is more where A and B are both itemsets, and A ∩ B = ∅. It is different
than 0.6(More details about this can be found in [11]). This means from the correlation coefficient calculation method (Eq. (14)) used
that only increasing mc would led to some useful rules missing. in this paper. In [12], a novel algorithm named PNAR_MLMS was
This problem is caused by using a single confidence threshold to proposed to generate PNARs correctly from the frequent and infre-
all types PNARs and has been well solved by assigning four con- quent itemsets. In [2], Wu et al. designed a new method for effi-
fidence thresholds to the four types of PNARs respectively in PN- cient PNARs mining in databases. They have designed constraints
ARMC model [11]. In this way, the number of required PNARs can for reducing the search space, and have used the increasing de-
be controlled flexibly. gree of the conditional probability relative to the prior probabil-
PNARMC model, however, does not take correlation coefficient ity to estimate the confidence of PNARs. In [11], Dong et al. stud-
into consideration. This will generate a lot of weak correlation ied the relationships among four confidences, and proposed the
PNARs, which are almost useless to decision-making, but can in- shortcomings of the single-confidence model. The downside of this
crease the difficulties in retrieving relevant decision-information. model is that, along with the increase of mc, in the reminder
Correlation coefficient measures the strength and direction of the PNARs, the number of rules like ¬A⇒¬B would be relatively large
linear relationship between a pair of two variables so that can pre- compared with that of ¬A⇒B. As a result, they proposed a Multi-
cisely find out strong correlation rules[12,32–35]. In this paper, we confidence model named PNARMC to mine PNARs. The main idea
propose a new model called PNARCMC by applying correlation co- of this model is to equip each form of the rules with its own con-
efficient to PNARMC model. The main idea of PNARCMC is to calcu- fidence threshold.
late the correlation coefficient of each rule and delete those weak In addition of the above survey, we discuss the related works on
correlation rules before mining PNARs using PNARMC model. pruning the number of association rules. All the above-mentioned
In this paper, we first apply PNARCMC model to mine PNARs method can reduce the number of PNARs by increasing the mc
with more strong correlation strength, then we apply LOGIC to because they are all extensions based on the support-confidence
prune redundant ones. The main contributions are summarized as framework. However, this will lose a lot of important information.
follows: So some other methods, which can not only reduce the number
of rules, but also reserve the same information, were proposed. In
1) We propose a new model named PNARCMC by applying cor- [36], Pham et al. proposed a technique for mining non-redundant
relation coefficient and PNARMC. PNARCMC can not only mine sequential rules directly from database. This method uses a dy-
PNARs with strong correlation but also control the number of namic bit vector data structure and adopts a prefix tree in the
rules with various types flexibly. mining process. In [37], an efficient method for pruning redundant
2) We propose a novel method named LOGIC to prune redundant sequential rules from an attributed prefix-tree was proposed by
PNARs based on logical reasoning. To the best of our knowl- Thi-Thiet Pham et al. In addition, this work also proposed a good

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx 3

pruning mechanism to reduce the search space and the execution the rule. The support of rule A⇒B is the ratio of the number of
time in the mining process. In [38], an algorithm named TNS (Top- transactions in D containing A and B in transaction set to the num-
k Non-redundant Sequential Rules) was proposed to mine non- ber of all transactions, which is the probability P(A ∪ B ), denoted
redundant sequential rules. It uses the same depth-first search pro- by s(A⇒B) and it is actually the support of the set A ∪ B. That is,
cedure as TopSeqRules [39], which is the first algorithm to discover s(A ∪ B ). This paper makes no distinction between these two rep-
top-k sequential rules. The same thought (Top-k) was also applied resentations. That is,
in mining association rules proposed by Amardeep Kumar [40]. Ex-
perimental results show it can work well for pruning the redun- s (A ⇒ B ) = s (A ∪ B ) = P (A ∪ B )
dant association rules and obtain interested rules. = (A ∪ B ).count/ | D |
In [41], a generic GNRR (Generate Non-Redundant Rules) algo- = | T : A ∪ B, T ⊆ D|/ | D | (2)
rithm was proposed to generate a redundant association rule from
a large number of frequent itemsets. It utilizes the redundant re- The confidence of rule A⇒B is the ratio of the number of trans-
lationship among rules to mine potential new rules in a certain actions containing both A and B to the number of transactions con-
order, in order to eliminate the redundancy among rules and re- taining A, which is the conditional probability P(B | A ), denoted by
duce the number of rules. The generic basis for exact association c(A ⇒ B ). That is,
rules and the informative basis for approximate association rules
were defined in [26]. These rules are established using frequent c (A ⇒ B ) = P (A | B ) = s(A ∪ B )/s(A )
closed itemsets and their generators. The role of them is to mini- = (A ∪ B ).count/A.count
mize the number of association rules generated while maximizing
= | T : A ∪ B, T ⊆ D|/|T : A ⊆ T , T ⊆ D| (3)
the quantity and the quality of the information conveyed. In [29], a
definition of redundant association rule was proposed: let the as- A valid association rule is that its support is greater than its
sociation rules be X⇒Y and A⇒B. If the rule A⇒B is established minimum support and its confidence is greater than its minimum
by the rule X⇒Y, the rule A⇒B is the redundant rule of the rule confidence. It can also be expressed by the following formula:
X⇒Y. According to this definition, an association rule pruning algo-
rithm ADRR was proposed. The algorithm considers that if A⇒BC s(A ⇒ B ) ≥ ms (4)
is a valid association rule, then rules A⇒B and A⇒C are redundant
rules of A⇒BC, which can be pruned. However, all those methods
consider redundant PARs only but little NARs. c(A ⇒ B ) ≥ mc (5)
Moreover, as the amount of data increases and the data dimen-
sion increases, the efficiency of the above mentioned algorithms According to [33], we can calculate the support and confidence
will gradually decrease. Using parallel computing [5,20,35], cloud of NARs using the following equations:
computing [15,30,42,43], distributed computing [10,25] and other
technologies in big data analysis to improve the efficiency of asso- s ( ¬A ) = 1 − s ( A ). (6)
ciation rules mining algorithms is a popular solution. In [44], Lan
and Alaghband presented a novel parallel method named ShaFEM
to mine association rules on multi-core share memory machine. s ( A ⇒ ¬B ) = s ( A ) − s ( A ⇒ B ). (7)
This method runs faster and consumes less memory than the state-
of-the-art methods. Moreover, this method can be used to imple-
ment the association rule mining component of databases manage-
s ( ¬A ⇒ B ) = s ( B ) − s ( A ⇒ B ). (8)
ment systems. Djenouri et al. explored the combination of GPU and
cluster-based parallel computing in association rule mining [45].
Four approaches including BSOMW, MWBSO, BSOMW-SEGPU and
MWBSO-MEGPU have been proposed. Experimental results reveal s ( ¬A ⇒ ¬B ) = 1 − s ( A ) − s ( B ) + s ( A ⇒ B ). (9)
that MWBSO-MEGPU outperforms the other proposed approaches
in terms of speed up. In [46], Giuseppe Agapito et al. proposed a
novel parallel FP-Growth algorithm and related software applica- c (A ⇒ B ) = s(A ⇒ B )/s(A ). (10)
tion to extract association rules from SNP microarrays. It can com-
plete frequent itemsets mining efficiently and thus enhance the
overall performance of the sequential FP-Growth algorithm.
c (A ⇒ ¬B ) = (s(A ) − s(A ⇒ B ))/s(A ). (11)
3. Basic concepts

Let I={i1 , i2 , . . . , in } be a set of n distinct items, and D be a c (¬A ⇒ B ) = (s(B ) − s(A ⇒ B ))/1 − s(A ). (12)
database of transactions, where each transaction T is a set of items
such that T ⊆ I. Each transaction has a unique identity, recorded
as TD. For a given itemset A ⊆ I and a given transaction T, we say c (¬A ⇒ ¬B ) = (1 − s(A ) − s(B ) + s(A ⇒ B ))/(1 − s(A )). (13)
that T contains A if and only if A ⊆ T. The support of itemset A,
denoted by s(A), is the number of transactions in D that contains
A. Let A.count denotes the number of transactions in D which con- 4. Mining PNARs
tains itemset A, so s(A ) =| T : A ⊆ T, T ⊆ D | / | D |= A.count/ | D |. If
s(A ) ≥ ms, A is called Frequent ItemSet (FIS), otherwise it is called In this section, we study the relationships among four confi-
in Frequent itemSet (inFIS) where ms is the minimum support dences and the correlation coefficient, and then propose a model
given by the user or expert, and it represents the minimum thresh- named PNARCMC which combines the correlation coefficient and
old that the user is interested in the itemset. An association rule is PNARMC. The PNARCMC model can not only mine strong correla-
an implication of the form A⇒B, where A ∩ I, B ∩ I, and A ∩ B = ∅. tion PNARs but also control the number of different types of rules
A is called the antecedent of the rule and B is the consequent of flexibly.

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

4 X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx

4.1. Correlation coefficient

Correlation coefficient measures the degree of linear depen-


dency between a pair of random variables. According to the dis-
cussion in [12], the correlation coefficient can be applied to asso-
ciation rule mining. For items A and B, the correlation coefficient
can be calculated by Eq. (14).

ρAB = (s(A ∪ B ) − s(A )s(B ))/ s(A )(1 − s(A ))s(B )(1 − s(B )).
(14)
where s( ∗ ) =
0, 1.
The range of ρ AB is between -1 and +1. The value of ρ AB rep-
resents the correlation strength of A and B. If |ρ AB | ≥ 0.5, the cor-
relation strength of A and B is moderate; if 0.3 ≤ |ρ AB | < 0.5, the
strength is small; if 0.1 ≤ |ρ AB | < 0.3, the strength is weak. If ρ AB
is between -0.1 to 0.1, A and B are considered being uncorrelated.
Therefore, it is important to set ρmin as the threshold of mini-
mum correlation strength to prune rules with less value. It is ob-
vious that when rules having | ρAB |< ρmin are trimmed, only rules
having | ρAB |≥ ρmin are left. | ρAB |≥ ρmin (0 ≤ ρmin ≤ 1) indicates
ρ¬AB ≤ −ρmin , ρA¬B ≤ −ρmin , ρ¬A¬B ≥ ρmin [12]. It means that if
ρAB ≥ ρmin , the rules in the forms of A⇒B and ¬A⇒¬B can be
mined, and if ρAB ≤ −ρmin , the rules in the forms of A⇒¬B and
¬A⇒B can be mined.

4.2. The relationships between four confidences and correlation


coefficient

The necessity of mining PNARs with multiple confidences


thresholds levels has been given in [11] through the analysis of the
relationships among these four different confidences. Now we take
correlation coefficient into account, and study the relationship be-
tween the four confidences and the correlation coefficients based
on the method given in literature [11].
Inequalities in (15)–(18) represent the range of values for
c(A⇒B), c(A⇒¬B), c(¬A⇒B) and c (¬A⇒¬B), which can be calculated
according to Eqs. (6)–(13).
   
s (A ) + s (B ) − 1 s (B )
MAX 0, ≤ c (A ⇒ B ) ≤ MIN 1, . (15)
s (A ) s (A )
   
s (A ) − s (B ) 1 − s (B )
MAX 0, ≤ c (A ⇒ ¬B ) ≤ MIN 1, . (16)
s (A ) s (A )
   
s (B ) − s (A ) s (B )
MAX 0, ≤ c (¬A ⇒ B ) ≤ MIN 1, . (17)
1 − s (A ) 1 − s (A )

   
1 − s ( A )s ( B 1 − s (B )
MAX 0, ≤ c (¬A ⇒ ¬B ) ≤ MIN 1, .
1 − s (A ) (1 − s (A )
(18)
The relationships are being discussed in the following five
situations:

Case 1. Both s(A) and s(B) are very low (e.g. s(A)=s(B)=0.1);
Case 2. Both s(A) and s(B) are very high (e.g. s(A)=s(B)=0.9);
Case 3. s(A) is very high, but s(B) is very low (e.g.
s(A)=0.9,s(B)=0.1);
Case 4. s(A) is very low, but s(B) is very high (e.g. s(A)=0.1,
s(B)=0.9);
Case 5. Both s(A) and s(B) are moderate (e.g. s(A)=0.4, s(B)=0.6).

We can calculate the value ranges of c(A⇒B), c(A⇒¬B), c(¬A⇒B) Fig. 1. Relationships between four confidences and correlation coefficient.
and c(¬A⇒¬B) and ρ AB in each of the above five cases, and the
results can be found in Fig. 1.

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx 5

Table 2 A⇒¬B and ¬A⇒B (Lines (10) to (14)). Line (15) returns the results
The Pseudo Code of PNARCMC.
and ends the whole algorithm.
Algorithm: PNARCMC.

Input: mc_11, mc_10, mc_01, mc_00, ρmin and FIS; 5. Pruning redundant PNARs
Output: PAR, NAR;
(1) PAR = ∅; NAR = ∅ ; In this section we first give and prove the theorem of the four
(2) for (any itemset X in FIS) {
types of redundant association rules. Then we propose a method
(3) for (any itemset A ∪ B = X and A ∩ B = ∅){
(4) calculate ρ AB with Eq. (14); named LOGIC to prune redundant PNARs mined by PNARCMC.
(5) if (ρAB ≥ ρmin ) {
(6) if (c (A ⇒ B ) ≥ mc_11) {
5.1. Redundant PNARs
(7) PAR = PAR ∪ {A ⇒ B};}
(8) if (c (¬A ⇒ ¬B ) ≥ mc_00){
(9) NAR = NAR ∪ {¬A ⇒ ¬B};}} In section I, we show that the ADRR method does not work for
(10) if (ρAB ≤ −ρmin ) { pruning PARs with NARs not being taken into consideration. The
(11) if (c (A ⇒ ¬B ) ≥ mc_10){ reason is that the correlation between two items is not considered
(12) NAR = NAR ∪ {A ⇒ ¬B};}
when studying redundant PNARs. In order to search for redundant
(13) if (c (¬A ⇒ B ) ≥ mc_01) {
(14) NAR = NAR ∪ {¬A ⇒ B};}}}} PNARs, two steps are required. The first step is to generate each
(15) return PAR and NAR. type of PNARs using a metric judging the correlation. The second
step is to analyze which kinds of rules are redundant to each type
of PNARs respectively. Now we give and prove the theorem of each
As shown in Fig. 1, graphs on the left show the results with type of redundant association rules.
the correlation coefficient taken into account, while graphs on the
Theorem 1. Suppose A, B ⊆ I, A ∩ B = ∅, and B ⊂ B, and A⇒B is a
right display the results that only the values of the rules’ confi-
valid positive association rule. If ρAB ≥ ρmin , then A⇒B is also a
dences are considered. The translucent gray background represents
valid positive association rule, and is a redundant rule of A⇒B.
the area where the value of correlation coefficient ranges from -0.1
to 0.1. According to the discussion above, in such areas, only rules Proof. According to Definition 1, A⇒B is not a valid negative as-
with of ¬A⇒B and A⇒¬B on the left and the rules with of A⇒B sociation rule unless both conditions ρAB ≥ ρmin and c(A⇒B ) ≥ mc
and ¬A⇒¬B on the right can be mined. Thus, the results indicate are satisfied. Now inequality ρAB ≥ ρmin holds, and we only need
that the minimum correlation coefficient can effectively prune the to prove that c(A⇒B ) ≥ mc is established. In other words, we only
weak correlation PNARs. need to prove that c(A⇒B ) ≥ c(A⇒B) because of c(A⇒B) ≥ mc.
c(A⇒B )-c(A⇒B)=(s(A ∪ B )/s(A))-(s(A ∪ B)/s(A))=(s(A ∪ B )-s(A ∪ B))/
4.3. PNARCMC model s(A). Subsequently, the problem is converted to investigate the
difference between s(A ∪ B ) and s(A ∪ B).
We now give the definition of PNARs in PNARCMC model. Let As denotes a set of transactions that contains itemset A, and
Definition 1. Let I be a set of items, TD be a database, A, B⊆ I |As| (the size of set As) is the number of transactions in As. Sim-
and A ∩ B = ∅, where 0 < s(A ) and s(B ) < 1. FIS is generated by fre- ilarly, let Bs denotes a set of transactions that contains itemset B,
quent itemsets generation algorithm. Furthermore, ρmin (0 ≤ ρmin ≤ and | Bs | (the size of set Bs) is the number of transactions in Bs.
1 ) is the minimum correlation strength, and mc_11, mc_10, mc_01 D (a complete set) represents the collection of all transactions in
as well as, mc_00 represent the minimum confidence thresholds the database, and |D| is the total number of transactions. The cor-
of A⇒B, A⇒¬B, ¬A⇒B and ¬A⇒¬B respectively. If | ρAB |< ρmin , no responding conversion is described as follows:
rules will be generated; otherwise, s(A ∪ B ) = (A ∪ B ).count/ | D |=| As ∩ B s | / | D | . (19)
1) If ρAB ≥ ρmin and c(A ⇒ B ) ≥ mc_11, A ⇒ B is considered being
a PAR; s(A ∪ B ) = (A ∪ B ).count/ | D |=| As ∩ Bs | / | D | . (20)
2) If ρAB ≤ −ρmin and c(A ⇒ ¬B ) ≥ mc_10, A ⇒ ¬B is considered
being a NAR; Eq. (19) minus Eq. (20) is:
3) If ρAB ≤ −ρmin and c(¬A ⇒ B ) ≥ mc_01, ¬A ⇒ B is considered
s(A ∪ B ) − s(A ∪ B ) = (| As ∩ B s | − | As ∩ Bs | )/ | D | (21)
being a NAR;
4) If ρAB ≥ ρmin and c(¬A ⇒ ¬B ) ≥ mc_00, ¬A ⇒ ¬B is considered
If B ⊂ B, then Bs ⊂ B s, and |B s| ≥ |Bs| holds.
being a NAR; Subsequently, we obtain:
In the above definition, ρAB ≥ ρmin or ρAB ≤ −ρmin ensures | As ∩ B s |≥| As ∩ Bs | . (22)
that the association rule is a strong correlation association rule;
c(∗ ) >= mc_# ensures each types of rule meets its minimum con- Bring inequality(22) to (21)we can get s(A ∪ B ) ≥ s(A ∪ B).
fidence threshold. Thus, c(A⇒B ) ≥ c(A⇒B). This result explains that A⇒B can be
According to Definition 1, the PNARCMC model can be coded as derived from A⇒B so that A⇒B is a redundant rule of A⇒B.
shown in Table 2: Proof is completed. 
Suppose the frequent itemsets are mined by any existing fre- Theorem 2. Suppose A, B ⊆ I, A ∩ B = ∅, and B ⊂ B, and A⇒¬B is a
quent itemsets mining algorithms (i.e. Apriori and FP-Growth), and valid negative association rule. If ρAB ≤ −ρmin , A⇒¬B is also a valid
saved in set FIS. negative association rule, and is a redundant rule of A⇒¬B .
Line (1) initializes both PAR and NAR to be empty sets. Lines (2)
to (14) generate all PARs and NARs from FIS. In Line (4), the ρ AB Proof. According to Definition 1, A⇒¬B is not a valid neg-
can be calculated by Eq. (14). If ρAB ≥ ρmin and the rule’s confi- ative association rule unless both conditions ρAB ≤ −ρmin and
dence is greater than its own minimum confidence threshold, the c(A⇒¬B) ≥ mc are satisfied. Now inequality ρAB ≤ −ρmin holds, so
algorithm generates rules like A⇒B and ¬A⇒¬B (Lines (5) to (9)); we only need to prove that c(A⇒¬B) ≥ mc is satisfied. In other
if ρAB ≤ −ρmin and the ruls’s confidence is greater than its own words, we only need to prove that c(A⇒¬B) ≥ c(A⇒¬B ) due to
minimum confidence threshold, the algorithm generates rules like c(A⇒¬B ) ≥ mc.

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

6 X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx

c(A⇒¬B)- c(A⇒¬B ) = (s(A)-s(A ∪ B))/s(A)-(s(A)-s(A ∪ B ))/s(A) =


Table 3
(s(A ∪ B )-s(A ∪ B))/s(A). PNARs generated by PNARCMC.
Subsequently, the problem is converted to investigate the differ-
ence between s(A ∪ B ) and s(A ∪ B). In Theorem 1, we have proved Items PNAR c(∗ ) s(X) s(Y) s(X ∪ Y) ρ AB
that s(A ∪ B ) ≥ s(A ∪ B). Thus, we can get c(A⇒¬B) ≥ c(A⇒¬B ). This ab a⇒¬b 0.4 0.5 0.7 0.3 >0
means that A⇒¬B can be derived from ¬A⇒¬B and therefore, ab b⇒¬a 0.57 0.7 0.5 0.3 >0
¬b⇒a >0
¬A⇒¬B is a redundant rule of ¬A⇒¬B .
ab 0.67 0.7 0.5 0.3
ab ¬a⇒b 0.8 0.5 0.7 0.3 >0
Proof is completed.  ac a∪c - 0.5 0.6 0.3 =0
ad a∪d - 0.5 0.6 0.3 =0
Theorem 2 states that for any rule in the form of A⇒¬B , if any bc b⇒¬c 0.43 0.7 0.6 0.4 <0
superset B of B’ is a negative association rule, then A⇒¬B is the bc c⇒¬b 0.33 0.6 0.7 0.4 <0
redundant rule of A⇒¬B . bc ¬b⇒c 0.67 0.7 0.6 0.4 <0
bc ¬c⇒b 0.75 0.6 0.7 0.4 <0
Theorem 3. Suppose A, B ⊆ I, A ∩ B = ∅, and B ⊂ B, and ¬A⇒B is a bd d⇒b 1 0.6 0.7 0.6 >0
b⇒d >0
valid negative association rule. If ρAB ≤ −ρmin , then ¬A⇒B is also a
bd 0.86 0.7 0.6 0.6
bd ¬d⇒¬b 0.75 0.6 0.7 0.6 >0
valid negative association rule, and is a redundant rule of ¬A⇒B . bd ¬b⇒¬d 1 0.7 0.6 0.6 >0
bf b⇒¬f 0.57 0.7 0.5 0.3 <0
Proof. According to Definition 1, ¬A⇒B is not a valid nega- bf f⇒¬b 0.4 0.5 0.7 0.3 <0
tive association rule unless both conditions ρAB ≤ −ρmin and bf ¬f⇒b 0.8 0.5 0.7 0.3 <0
c(A⇒¬B ) ≥ mc are satisfied. Now inequality ρAB ≤ −ρmin holds, so bf ¬b⇒f 0.67 0.7 0.5 0.3 <0
we only need to prove that c(A⇒¬B ) ≥ mc is satisfied. In other cd c⇒¬d 0.5 0.6 0.6 0.3 <0
cd d⇒¬c 0.5 0.6 0.6 0.3 <0
words, we only need to prove that c(¬A⇒B ) ≥ c(¬A⇒B) because cd ¬d⇒c 0.75 0.6 0.6 0.3 <0
c(¬A⇒B) ≥ mc. cd ¬c⇒d 0.75 0.6 0.6 0.3 <0
c(¬A⇒B )-c(¬A⇒B) = (s(¬A ∪ B )/s(¬A))-(s(¬A ∪ B)/s(¬A)) = cf c∪f - 0.6 0.5 0.3 =0
(s(¬A ∪ B )- s(¬A ∪ B))/ s(¬A). abd (ad)⇒b 1 0.3 0.7 0.3 >0
abd b⇒(ad) 0.43 0.7 0.3 0.3 >0
Subsequently, the problem is converted to investigate the dif-
abd ¬(ad)⇒¬b 0.43 0.3 0.7 0.3 >0
ference between s(¬A ∪ B ) and s(¬A ∪ B). abd ¬b⇒¬(ad) 1 0.7 0.3 0.3 >0
The set theory can help us solve this problem. Let As denotes a abd (ab)⇒d 1 0.3 0.6 0.3 >0
set of transactions containing itemset A; | As | (the size of set As) abd d⇒(ab) 0.5 0.6 0.3 0.3 >0
abd ¬(ab)⇒¬d 0.57 0.3 0.6 0.3 >0
is the number of transactions in As, and ¬As is the complemen-
abd ¬d⇒¬(ab) 1 0.6 0.3 0.3 >0
tary set of As. Similarly, let Bs denote a set of transactions that abd a ∪ (bd) - 0.5 0.6 0.3 =0
containing itemset B, | Bs | (the size of set Bs) is the number of bcd (bc)⇒d 0.75 0.4 0.6 0.3 >0
transactions in Bs, and ¬Bs is the complementary set of Bs. Let D bcd d⇒(bc) 0.5 0.6 0.4 0.3 >0
be a complete set of all transactions in the database, and | D | is bcd ¬d⇒¬(bc) 0.75 0.6 0.4 0.3 >0
bcd ¬(bc)⇒¬d 0.5 0.4 0.6 0.3 >0
the total number of transactions. The corresponding conversion is
bcd (cd)⇒b 1 0.3 0.7 0.3 >0
as follows: bcd b⇒(cd) 0.43 0.7 0.3 0.3 >0
bcd ¬b⇒¬(cd) 1 0.7 0.3 0.3 >0
s(¬A ∪ B ) = (¬A ∪ B ).count/ | D |=| ¬As ∩ B s | / | D | . (23) bcd ¬(cd)⇒¬b 0.43 0.3 0.7 0.3 >0
bcd (bd)⇒¬c 0.5 0.6 0.6 0.3 <0
bcd c⇒¬(bd) 0.5 0.6 0.6 0.3 <0
s(¬A ∪ B ) = (¬A ∪ B ).count/ | D |=| ¬As ∩ Bs | / | D | . (24) bcd ¬(bd)⇒c 0.75 0.6 0.6 0.3 <0
bcd ¬c⇒(bd) 0.75 0.6 0.6 0.3 <0
Eq. (23) minus Eq. (24) is:

s(¬A ∪ B ) − s(¬A ∪ B ) = (| ¬As ∩ B s | − | ¬As ∩ Bs | )/ | D | (25)


Table 4
If B ⊂ B, then Bs ⊂ B s. So |B s| ≥ |Bs| holds. The Pseudo Code of LOGIC.
Subsequently, we obtain: Algorithm: LOGIC.

| ¬As ∩ B s |≥| ¬As ∩ Bs |. (26) Input: mc_11, mc_10, mc_01, mc_00, ρmin and FIS;
Output: NPAR: set of non-redundant PARs
Bring inequality(26) to (25)we can get s(¬A ∪ B ) ≥ s(¬A ∪ B). Thus, NNAR: set of non-redundant NARs;
we can get c(¬A⇒B ) ≥ c(¬A⇒B), in other words, ¬A⇒B can be de- (1) Call Algorithm PNARCMC;
(2) For (any rules Y in PAR ∪ NAR) {
rived from ¬A⇒B, and is a redundant rule of ¬A⇒B. (3) if (Y is a rule of type A⇒B) {
Proof is completed.  (4) for (any B in B) {
(5) if (ρAB ≥ ρmin ) {
Theorem 4. Suppose A, B ⊆ I, A ∩ B = ∅, and B ⊂ B, and ¬A⇒¬B is (6) delete rule A⇒B from PAR; }}}
a valid negative association rule. If ρAB ≥ ρmin , ¬A⇒¬B is also a valid (7) if (Y is a rule of type ¬A⇒¬B ){
negative association rule, and is a redundant rule of ¬A⇒¬B . (8) for (any B in B) {
(9) if (ρAB ≥ ρmin ) {
Proof. According to Definition 1, ¬A⇒¬B is not a valid neg- (10) delete rule ¬A⇒¬B from NAR;}}}
(11) if (Y is a rule of type A⇒¬B ){
ative association rule unless both conditions ρAB ≥ ρmin , and (12) for (any B in B) {
c(¬A⇒¬B) ≥ mc must be satisfied. Now inequality ρAB ≥ ρmin holds, (13) if (ρAB ≤ −ρmin ) {
so we only need to prove that c(¬A⇒¬B) ≥ mc is established. In (14) delete rule A⇒¬B from NAR;}}}
other words, we only need to prove that c(¬A⇒¬B) ≥ c(¬A⇒¬B ) be- (15) if (Y is a rule of type ¬A⇒B){
for (any B in B) {
cause c(¬A⇒¬B ) ≥ mc. (16)
(17) if (ρAB ≤ −ρmin ) {
delete rule ¬A⇒B from NAR;}}}
c ( ¬A ⇒ ¬B ) − c ( ¬A ⇒ ¬B ) (18)
(19) NNPAR=PAR and NNAR=NAR;
= s(¬A ∪ ¬B )/s(¬A ) − s(¬A ∪ ¬B )/s(¬A ) (20) Return NPAR and NNAR.

= (s(¬A ∪ ¬B ) − s(¬A ∪ ¬B ))/s(¬A ) (27)

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx 7

Table 5
The summary of datasets.

Datasets Number of items Maximum number of items Average items per transaction Number of transactions

Mushroom 23 17 13 8124
Nursery 28 9 9 12960
Chess 76 37 37 3196
DS1 100 72 47 256
DS2 80 70 47 2000
DS3 100 76 34 3430
DS4 80 28 40 1500
DS5 60 27 45 1000

4 Mushroom Nursery
x 10
6 A=>B
4000
A=>B
A=> B A=> B
Number of Rules

Number of Rules
A=> B A=> B
A=>B
3000 A=>B
4
2000
2
1000

0 0
0.1 0.3 0.5 0.7 0.9 mc1 mc2 0.1 0.3 0.5 0.7 0.9 mc1 mc2
Minimum Confidence Minimum Confidence
Chess 4 DS1
x 10
10000 2
A=>B
A=> B
Number of Rules

Number of Rules

8000 A=> B
1.5
A=>B
6000 A=>B
A=> B 1
4000 A=> B
A=>B
0.5
2000

0 0
0.1 0.3 0.5 0.7 0.9 mc1 mc2 0.1 0.3 0.5 0.7 0.9 mc1 mc2
Minimum Confidence Minimum Confidence
4 DS2 DS3
x 10
3 800
A=>B
A=> B
Number of Rules

Number of Rules

A=> B
A=>B
600
2 A=>B
A=> B
400 A=> B
A=>B
1
200

0 0
0.1 0.3 0.5 0.7 0.9 mc1 mc2 0.2 0.3 0.4 0.5 0.6 mc1 mc2
Minimum Confidence Minimum Confidence
DS4 4 DS5
x 10
15000 8
A=>B A=>B
A=> B A=> B
Number of Rules
Number of Rules

A=> B A=> B
A=>B
6 A=>B
10000
4
5000
2

0 0
0.3 0.4 0.5 0.6 0.7 mc1 mc2 0.1 0.3 0.5 0.7 0.9 mc1 mc2
Minimum Confidence Minimum Confidence
Fig. 2. Experimental results of PNARCMC on different datasets.

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

8 X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx

Because Eqs. (28) and (29)hold. Table 6


The value of each parameter.
s ( ¬A ∪ ¬B ) = s ( ¬A ) − s ( ¬A ∪ B ) (28)
Datasets ms ρ AB mc1 mc2

s ( ¬A ∪ ¬B ) = s ( ¬A ) − s ( ¬A ∪ B ) (29) Mushroom 0.37 0.1 mc_11=0.9 mc_11=0.7


mc_00=0.9 mc_00=0.9
Eq. (27) can be rewritten as: mc_10=0.5 mc_10=0.3
mc_01=0.7 mc_01=0.1
c (¬A ⇒ ¬B ) − c (¬A ⇒ ¬B ) = (s(¬A ∪ B ) − s(¬A ∪ B ))/s(¬A ) Nursery 0.023 0.1 mc_11=0.5 mc_11=0.3
mc_00=0.9 mc_00=0.7
(30) mc_10=0.9 mc_10=0.7
mc_01=0.5 mc_01=0.3
In Theorem 3, we have proved that s(¬A ∪ B ) ≥ s(¬A ∪ B). Thus,
Chess 0.88 0.1 mc_11=0.9 mc_11=0.7
we can get c(¬A⇒¬B) ≥ c(¬A⇒¬B ). In other words, ¬A⇒¬B can be mc_00=0.7 mc_00=0.5
derived from ¬A⇒¬B , and is a redundant rule of ¬A⇒¬B . mc_10=0.3 mc_10=0.1
Proof is completed.  mc_01=0.5 mc_01=0.3
DS1 0.15 0.1 mc_11=0.9 mc_11=0.9
Theorems 1 to 4 illustrate the redundant PNARs, and we can mc_00=0.7 mc_00=0.5
prune the redundant rules based on these four theorems. mc_10=0.3 mc_10=0.1
mc_01=0.3 mc_01=0.1
DS2 0.85 0.1 mc_11=0.9 mc_11=0.7
5.2. An example mc_00=0.7 mc_00=0.3
mc_10=0.7 mc_10=0.5
In this section, we specify how to determine the four types of mc_01=0.7 mc_01=0.5
DS3 0.3 0.1 mc_11=0.6 mc_11=0.5
redundant association rules by the following example.
mc_00=0.5 mc_00=0.4
Example 2. Transaction database D in Table 1, ms=0.3, mc_10=0.3 mc_10=0.2
mc_01=0.4 mc_01=0.3
mc_11=mc_10=mc_01=mc_00=0.3, ρmin =0. We can discover
DS4 0.33 0.1 mc_11=0.6 mc_11=0.5
the PNARs based on the PNARCMC model. All itemsets and rules mc_00=0.5 mc_00=0.4
of the specific information, such as support, confidence, minimum mc_10=0.5 mc_10=0.4
correlation coefficient, are shown in Table 3. Now, we discuss the mc_01=0.5 mc_01=0.4
DS5 0.5 0.1 mc_11=0.9 mc_11=0.9
following four cases separately.
mc_00=0.7 mc_00=0.5
Case1. redundant rules with type of A⇒B mc_10=0.5 mc_10=0.3
mc_01=0.7 mc_01=0.5
From Table 3, we can see that ρb,cd > ρmin and ρb,d > ρmin ,
so b⇒cd and b⇒d are positive association rules. According to
Theorem 1, because of d ⊂ cd, b⇒d is a redundant rule of b⇒cd
that can be deleted. However, there is a negative correlation be- Datasets (http://archive.ics.uci.edu/ml/index.php). Other datasets
tween itemsets b andc because ρb,c < −ρmin , so b⇒c is not a valid (DS1, DS2, DS3, DS4 and DS5) are synthetic datasets generated by
association rule. IBM data generator. More details on these databases including the
Case2. redundant rules with type of A⇒¬B number of items and transactions are shown in Table 5.
According to Theorem 2, ρc,b < −ρmin , ρc,bd < −ρmin and b ⊂ bd,
6.2. Experiment results
so c⇒¬(bd) is a valid negative association rule and is a redundant
rule of c⇒¬b. Hence, c⇒¬(bd) can be deleted.
Now we demonstrate the effectiveness of our methods from the
Case3. redundant rules with type of ¬A⇒B
following three aspects.
According to Theorem 3, ρc,b < −ρmin , ρc,d < −ρmin , ρc,bd <
−ρmin and b, c ⊂ bd, so ¬c⇒b and ¬c⇒d are valid negative asso- 1. The PNARCMC model can flexibly control the numbers of each
ciation rules and are redundant rules of ¬c⇒(bd). Thus, ¬c⇒b and type of rules.
¬c⇒d can be deleted.
Case4. redundant rules with type of ¬A⇒¬B In order to highlight the characteristics of multi-confidence, we
According to Theorem 4, ρb,d > ρmin , ρb,ad > ρmin , ρb,cd > ρmin only need to compare the PNARCMC model with a common single
and d ⊂ ad, d ⊂ cd, so ¬b⇒¬(ad) and ¬b⇒¬(cd) are valid nega- confidence model. For the purpose of maintaining the uniqueness
tive association rules and are redundant rules of ¬b⇒¬d. Hence, of the variables, mc is changed but other variables including ms
¬b⇒¬(ad) and ¬b⇒¬(cd) can be deleted. and ρmin are fixed. Considering Fig. 2, the number of four types
of PNARs generated from the PNARCMC model varies with confi-
5.3. LOGIC Algorithm dence. The abscissa of the graph represents the mc of each PNAR-
CMC model in dataset in which, the first 5 abscissas of each graph
According to Theorems 1–4, the pseudo code of LOGIC for prun- indicate that the mc for the four types of PNARs are the same
ing redundant PNARs is given in Table 4. We suppose the frequent (mc_11 = mc_10 = mc_01 = mc_00 = mc). In contrast, the abscissas
itemsets are mined by any existing frequent itemsets mining algo- mc1 and mc2 represent that each type of association rule has its
rithm, such as Apriori and FP-Growth, and saved in set FIS. own mc (Different values for mc_11, mc_10, mc_01 and mc_00). The
specific experimental parameters are described in Table 6.
6. Experiment and results Consider Fig. 2, although the number of different types of rules
will be on decrease along with the increase of mc, their extents
6.1. Datasets are different. If we use a higher mc, some valued PARs would be
missing; if we use a lower mc, the number of PNARs would be
Our programs are performed in Java on a 64-bit Windows very huge. However, if we set different mc for each type of associ-
7 Professional PC with 8 Intel Core i7 CPUs of 3.40 GHz, 8 GB ation rule, the above situations would not happen. The experiment
memory. results have shown that our model PNARCMC can flexibly control
Eight databases are applied in this experiment. Mushroom, the numbers of each type of rules by the use of multi-confidence
nursery and chess databases are real databases taken from UCI thresholds.

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx 9

4 Mushroom 4 Nursery
x 10 x 10
14 5
PAR PAR
NAR NAR

Number of Rules

Number of Rules
12 4

10 3

8 2

6 1

4 0
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2
min min

4 Chess 4 DS1
x 10 x 10
2.5 10
PAR PAR
NAR NAR

Number of Rules
8
2
Number of Rules

6
1.5
4
1
2

0.5 0
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2
min min

4 DS2 4 DS3
x 10 x 10
15 10
PAR PAR
NAR NAR
Number of Rules

Number of Rules

8
10
6

4
5
2

0 0
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2
min min
4 DS4 4 DS5
x 10 x 10
8 15
PAR PAR
NAR NAR
Number of Rules

Number of Rules

6
10
4
5
2

0 0
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2
min min

Fig. 3. The number of PNARs changes with ρ min on different datasets.

2. The PNARCMC model can prune PNARs with weak correlation. increase as ρmin increases. Hence, it proves that PNARCMC model
can effectively prune the PNARs with weak correlation.
Similar to the first experiment, the first step in this case is also 3. The LOGIC method can prune redundant PNARs effectively.
to ensure the uniqueness of the variable, due to which, ρmin is
changed while mc(the vaule is 0.1) is fixed. To the best of our knowledge, LOGIC is the first algorithm to
Fig. 3 shows the changes of PNARs mined by PNARCMC model prune redundant NARs. Therefore, there is no relevant algorithms
with different ρmin (ρmin =0 means that the PNARs are not pruned to compare with in terms of the efficiency on the number of prun-
by the correlation coefficient). ing association rules. We evaluate the effectiveness and efficiency
The number of PARs and NARs decrease as ρmin increases, and of LOGIC by comparing the number of rules mined by PNARCMC
it indicate the number of PNARs which are pruned by PNARCMC and the number of rules mined by LOGIC.

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

10 X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx

4 Mushroom Nursery
x 10
6 1500
Num__PAR Num__PAR
Num__NPAR Num__NPAR

Number of Rules

Number of Rules
4 1000

2 500

0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Minimum Confidence Minimum Confidence
Chess 4 DS1
x 10
10000 2
Num__PAR
Num__NPAR
Number of Rules

Number of Rules
8000
Num__PAR 1.5
Num__NPAR
6000
1
4000
0.5
2000

0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Minimum Confidence Minimum Confidence
4 DS2 DS3
x 10
3 800
Num__PAR
Num__NPAR
Number of Rules

Number of Rules

2.5
600
2 Num__PAR
Num__NPAR

1.5
400
1

0.5 200
0.1 0.3 0.5 0.7 0.9 0.2 0.3 0.4 0.5 0.6
Minimum Confidence Minimum Confidence
DS4 4 DS5
x 10
15000 8
Num__PAR
Num__NPAR
Number of Rules
Number of Rules

6
10000 Num__PAR
Num__NPAR
4
5000
2

0 0
0.3 0.4 0.5 0.6 0.7 0.1 0.3 0.5 0.7 0.9
Minimum Confidence Minimum Confidence
Fig. 4. The comparison of the number of PARs.

Experimental results are shown in Figs. 4 and 5. We use datasets. It can be seen that the LOGIC algorithm performs with
Num_PAR to represent the number of PARs mined by PNARCMC, a good pruning rate on all datasets. In particular, when the confi-
Num_NAR to represent the number of NARs mined by PNARCMC, dence is equal to 0.1, the pruning rate on the chess dataset is even
Num_NPAR to represent the number of PARs mined by LOGIC and more than 81%. In this way, the experimental results show that
Num_NNAR to represent the number of NARs mined by LOGIC. LOGIC can prune redundant PNARs effectively.
Fig. 4 is a comparison of the number of PARs and Fig. 5 is a
comparison of the number of NARs. As we can see from these
two figures, the number of associated rules mined by LOGIC is far 6.3. Scalability test
less than the number of associated rules mined by PNARCMC, no
matter whether the rule is a PAR or a NAR. In addition, Table 7 LOGIC devotes itself to extracting non-redundant PNARs from
presents the pruning rate of PNARs based on LOGIC on different frequent itemsets. Thus, its performance is affected by the size

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx 11

4 Mushroom Nursery
x 10
10 10000
Num__NAR Num__NAR
Num__NNAR Num__NNAR

Number of Rules
Number of Rules
8 8000

6 6000

4 4000

2 2000

0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Minimum Confidence Minimum Confidence
Chess 4 DS1
x 10
15000 2
Num__NAR Num__NAR
Num__NNAR Num__NNAR

Number of Rules
Number of Rules

1.5
10000
1
5000
0.5

0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Minimum Confidence Minimum Confidence
4 DS2 DS3
x 10
3 1000
Num__NAR Num__NAR
Num__NNAR Num__NNAR
Number of Rules

Number of Rules

800
2
600

400
1
200

0 0
0.1 0.3 0.5 0.7 0.9 0.2 0.3 0.4 0.5 0.6
Minimum Confidence Minimum Confidence
4 DS4 4 DS5
x 10 x 10
4 15
Num__NAR Num__NAR
Num__NNAR Num__NNAR
Number of Rules

Number of Rules

3
10
2
5
1

0 0
0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
Minimum Confidence Minimum Confidence
Fig. 5. The comparison of the number of NARs.

Table 7
The pruning rate(%) of association rules on different datasets.

mc / Dataset Mushroom Nursery Chess DS1 DS2 DS3 DS4 DS5

0.1 81.5 51.0 81.6 77.4 73.2 70.4 55.3 74.2


0.2 81.4 52.9 80.5 76.4 72.6 70.0 54.6 74.3
0.3 81.0 54.2 80.0 76.2 71.8 67.6 53.1 73.1
0.4 80.7 54.3 79.8 75.8 71.0 65.8 53.1 73.1
0.5 79.7 54.1 79.5 75.8 70.2 62.2 51.0 73.1
0.6 78.9 56.7 79.2 75.5 68.3 58.2 45.7 71.8
0.7 77.2 58.6 79.0 75.6 68.8 54.7 44.8 71.0
0.8 75.9 58.1 78.8 75.5 68.8 45.4 44.3 62.3
0.9 72.6 58.9 78.2 71.2 63.2 43.6 43.2 52.3
Average 78.7 55.4 79.6 75.5 69.8 59.8 49.5 69.5

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

12 X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx

6 Mushroom 5 Nursery
x 10 x 10
4 5
mc=0.5 mc=0.5
mc=0.4 mc=0.4
mc=0.3 4 mc=0.3
3

Runtime(ms)

Runtime(ms)
mc=0.2 mc=0.2
mc=0.1 3 mc=0.1
2
2
1
1

0 0
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Number of Frequent Itemsets Number of Frequent Itemsets
4 Chess 5 DS1
x 10 x 10
10 2
mc=0.5 mc=0.5
mc=0.4 mc=0.4
8 mc=0.3 mc=0.3
1.5
Runtime(ms)

Runtime(ms)
mc=0.2 mc=0.2
6 mc=0.1 mc=0.1

1
4
0.5
2

0 0
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Number of Frequent Itemsets Number of Frequent Itemsets
5 DS2 4 DS3
x 10 x 10
10 6
mc=0.5 mc=0.5
mc=0.4 mc=0.4
8 mc=0.3 mc=0.3
Runtime(ms)

Runtime(ms)

mc=0.2 mc=0.2
mc=0.1
4 mc=0.1
6

4
2
2

0 0
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Number of Frequent Itemsets Number of Frequent Itemsets
5 6 DS5
x 10 x 10
2 4
mc=0.5 mc=0.5
mc=0.4 mc=0.4
mc=0.3 mc=0.3
1.5 3
Runtime(ms)

Runtime(ms)

mc=0.2 mc=0.2
mc=0.1 mc=0.1
1 2

0.5 1

0 0
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Number of Frequent Itemsets Number of Frequent Itemsets
Fig. 6. Scalability test on Mushroom and Nursery.

of frequent itemsets. The scalability test is used to evaluate the 7. Conclusions and future work
performance of LOGIC on large frequent itemsets. In order to
avoid the contingency of the experimental results, we run LOGIC There are a large number of association rules when mining
30 times independently on each of datasets (Mushroom, Nurs- PARs and NARs simultaneously. A significant number of approaches
ery, Chess, DS1, DS2, DS3, DS4 and DS5), and record their av- have been proposed to prune redundant PARs. Those methods,
erage running time. The results given in Fig. 6 show that the however, do not work when we take NARs into consideration. In
growth of runtime of LOGIC on large frequent itemsets follows a fact, the number of NARs is three times more than PARs in the-
roughly linear relationship with the frequent itemsets size increas- ory. How to prune redundant NARs becomes an open challenging
ing on different minimum confidence. The results in this scalabil- problem. In order to solve this problem, in this paper we propose a
ity test show that LOGIC works particularly well on huge frequent novel method named LOGIC to prune redundant PNARs which can
itemsets. not only reduce the number of PNARs, but also reserve the same

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx 13

information. We show that LOGIC can work out the redundant [15] J. Chen, K. Li, Z. Tang, K. Bilal, S. Yu, C. Weng, K. Li, A parallel random forest
rules for each type of PNARs through various pruning strategies algorithm for big data in a spark cloud computing environment, IEEE Trans.
Parallel Distrib. Syst. 28 (4) (2017) 919–933.
correspondingly. In addition, in order to ensure what the PNARs [16] M.J. Zaki, Mining non-redundant association rules, Data Min. Knowl. Discov. 9
produced are strongly correlated and can be controlled flexibility, (3) (2004) 223–248.
we also propose a new model named PNARCMC that combines [17] M.J. Zaki, CHARM : an efficient algorithm for closed itemset mining, in:
Proceedings of the SIAM International Conference on Data Mining, 2002,
correlation coefficient and multiple minimum confidences. Exper- pp. 457–473.
imental results show that the proposed PNARCMC model can not [18] M.J. Zaki, B. Phoophakdee, MIRAGE: a framework for mining, exploring and
only control the number of each type of rules, but also prune the visualizing minimal association rules, J. Account. Econ. 18 (3) (2003) 289–324.
[19] A. Boudane, S. Jabbour, L. Sais, Y. Salhi, Enumerating non-redundant associa-
weak correlation rules. The results also show that LOGIC can work
tion rules using satisfiability, in: Proceedings of the Pacific-Asia Conference on
well for pruning redundant PNARs. Knowledge Discovery and Data Mining, 2017, pp. 824–836.
Finally, there are two aspects to be noted. LOGIC works well for [20] K. Li, W. Yang, K. Li, Performance analysis and optimization for SpMV on GPU
using probabilistic modeling, IEEE Trans. Parallel Distrib. Syst. 26 (1) (2015)
pruning association rules no matter whether their consequent set
196–205.
is a positive itemset or a negative itemset, but it does not involve [21] A. Boudane, S. Jabbour, L. Sais, Y. Salhi, A sat-based approach for mining asso-
how to prune the antecedent of PNARs. Moreover, LOGIC does not ciation rules, in: Proceedings of the International Joint Conference on Artificial
work for pruning the rules (e.g.¬ab⇒c¬d) that contain both posi- Intelligence, 2016, pp. 2472–2478.
[22] S. Jabbour, L. Sais, Y. Salhi, Decomposition Based SAT Encodings for Itemset
tive and negative itemsets in the antecedent or consequent. These Mining Problems., Springer International Publishing, 2015.
two issues play important roles in many areas of data mining in- [23] X. Dong, C. Liu, T. Xu, D. Wang, Select actionable positive or negative sequential
cluding web mining and recommender systems, and would be the patterns, J. Intell. Fuzzy Syst. 29 (6) (2015) 2759–2767.
[24] M. Apelo, Mine rare and non-redundant quantitative association rules,
focus of our future work. In addition, in order to obtain more use- Schizophr. Res. 3 (1) (2015) 66–67.
ful results efficiently on large-scale datasets with complex struc- [25] C. Chen, K. Li, A. Ouyang, Z. Zeng, K. Li, GFlink: an in-memory computing ar-
tures and high dimensional features, we will research on the par- chitecture on heterogeneous CPU-GPU clusters for big data, IEEE Trans. Parallel
Distrib. Syst. 29 (6) (2018) 1275–1288.
allel algorithm on parallel platform. [26] Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, L. Lakhal, Mining minimal non-re-
dundant association rules using frequent closed itemsets, Lect. Notes Comput.
Sci. 1861 (20 0 0) 972–986.
Declaration of interest [27] M. Kryszkiewicz, Representative association rules and minimum condition
maximum consequence association rules., in: Proceedings of the Second Eu-
ropean Symposium on Principles of Data Mining and Knowledge Discovery,
None. PKDD ’98, Nantes, France, September 23–26, 1998, pp. 361–369.
[28] A.N. Tran, V.D. Hai, T.C. Truong, B.H. Le, Efficient algorithms for mining fre-
Acknowledgements quent itemsets with constraint, in: Proceedings of the Third International Con-
ference on Knowledge and Systems Engineering, 2011, pp. 19–25.
[29] S.Y. Wei, J.I. Gen-Lin, Q.U. Wei-Guang, Pruning and clustering discovered asso-
This work was supported by the National Natural Science Foun- ciation rules, J. Chin. Comput. Syst. 27 (1) (2006) 110–113.
dation of China (71271125) and Natural Science Foundation of [30] K. Li, C. Liu, K. Li, A.Y. Zomaya, A framework of price bidding configurations
for resource usage in cloud computing, IEEE Trans. Parallel Distrib. Syst. 27 (8)
Shandong Province, China (ZR2018MF011).
(2016) 2168–2181.
[31] P. Fournier-Viger, V.S. Tseng, Mining Top-K Non-redundant Association Rules,
Springer Berlin Heidelberg, 2012.
References
[32] J. Cohen, Statistical power analysis for the behavioral sciences (revised edition),
Technometrics 31 (4) (1989) 499–500.
[1] T. Imielienskin, A. Swami, R. Agrawal, Mining association rules between set of [33] X.J. Dong, S.J. Wang, H.T. Song, Y.C. Lu, Study on negative association rules, J.
items in largedatabases, ACM SIGMOD Rec. 22 (2) (1993) 207–216. Beijing Inst. Technol. 11 (2004) (2004) 978–981.
[2] X. Wu, C. Zhang, S. Zhang, Efficient mining of both positive and negative asso- [34] H. Zhu, Z. Xu, An effective algorithm for mining positive and negative associ-
ciation rules, ACM Trans. Inf. Syst. 22 (3) (2004) 381–405. ation rules, in: Proceedings of the International Conference on Computer Sci-
[3] A. Abdul-Wahabal-Opahi, B. Mohamad, An efficient algorithm to automated ence and Software Engineering, 2008, pp. 455–458.
discovery of interesting positive and negative association rules, Int. J. Adv. [35] C. Chen, K. Li, A. Ouyang, Z. Tang, K. Li, GPU-accelerated parallel hierarchical
Comput. Sci. Appl. 6 (6) (2015) 169–173. extreme learning machine on Flink for big data, IEEE Trans. Syst. Man Cybern.
[4] L. Cao, X. Dong, Z. Zheng, E-NSP: efficient negative sequential pattern mining, Syst. 47 (10) (2017) 2740–2753.
Artif. Intell. 235 (2016) 156–182. [36] T.T. Pham, J. Luo, T.P. Hong, B. Vo, An efficient method for mining non-redun-
[5] K. Li, X. Tang, B. Veeravalli, K. Li, Scheduling precedence constrained stochas- dant sequential rules using attributed prefix-trees, Eng. Appl. Artif. Intell. 32
tic tasks on heterogeneous cluster systems, IEEE Trans. Comput. 64 (1) (2014) (32) (2014) 88–99.
191–204. [37] M.T. Tran, B. Le, B. Vo, T.P. Hong, Mining non-redundant sequential rules with
[6] X. Dong, G. yongshun, L. Cao, F-NSP+: a fast negative sequential patterns min- dynamic bit vectors and pruning techniques, Appl. Intell. 45 (2) (2016) 1–10.
ing method with self-adaptive data storage, Pattern Recognit. 84 (2018) 13–27. [38] P. Fournier-Viger, V.S. Tseng, TNS: mining top-k non-redundant sequential
[7] X. Dong, S. Wang, H. Song, Approach for mining positive & negative association rules, in: Proceedings of the 28th Annual ACM Symposium on Applied Com-
rules based on 2-level support, Comput. Eng. 31 (10) (2005) 16–18. puting (SAC ’13), ACM, New York, NY, USA, 2013, pp. 164–166.
[8] M.D. Cock, C. Cornelis, E.E. Kerre, Elicitation of fuzzy association rules from [39] P. Fournierviger, V.S. Tseng, Mining top-k sequential rules, in: Proceedings of
positive and negative examples, Fuzzy Sets Syst. 149 (1) (2005) 73–85. the Advanced Data Mining and Applications - International Conference, Adma,
[9] M. Gan, M.Y. Zhang, S.W. Wang, One extended form for negative association Beijing, China, 2011, pp. 180–194.
rules and the corresponding mining algorithm, in: Proceedings of the Interna- [40] A.U. Amardeep Kumar, An efficient algorithm to mine non redundant top k
tional Conference on Machine Learning and Cybernetics, 2005, pp. 1716–1721. association rules, Int. J. Emerg. Trends Sci. Technol. 03 (1) (2016) 3491–3500.
Vol. 3 [41] G. Feng, X.J. Ying, Algorithm for generating non-redundant association rules, J.
[10] X. Zhou, K. Li, Y. Zhou, K. Li, Adaptive processing for distributed skyline queries Shanghai Jiaotong Univ. 35 (2) (2001) 256–258.
over uncertain data, IEEE Trans. Knowl. Data Eng. 28 (2) (2016) 371–384. [42] C. Liu, K. Li, K. Li, Minimal cost server configuration for meeting time-varying
[11] X. Dong, F. Sun, X. Han, R. Hou, Study of positive and negative association rules resource demands in cloud centers, IEEE Trans. Parallel Distrib. Syst. 29 (11)
based on multi-confidence and chi-squared test, Lect. Notes Comput. Sci. 4093 (2018) 2503–2513.
(20 06) 10 0–109. [43] C. Liu, K. Li, C. Xu, K. Li, Strategy configurations of multiple users competition
[12] X. Dong, Z. Niu, X. Shi, X. Zhang, D. Zhu, Mining both positive and negative for cloud service reservation, IEEE Trans. Parallel Distrib. Syst. 27 (2) (2016)
association rules from frequent and infrequent itemsets, in: Proceedings of the 508–520.
ADMA, 2007, pp. 122–133. [44] V. Lan, G. Alaghband, Novel parallel method for association rule mining on
[13] Y. Gong, T. Xu, X. Dong, G. Lv, E-NSPFI: efficient mining negative sequential multi-core shared memory systems, Parallel Comput. 40 (10) (2014) 768–785.
pattern from both frequent and infrequent positive sequential patterns, Int. J. [45] Y. Djenouri, D. Djenouri, Z. Habbas, Intelligent mapping between GPU and
Pattern Recognit. Artif. Intell. 31 (02) (2017). 1750 0 02– cluster computing fordiscovering big association rules, Appl. Soft Comput. 65
[14] T. Xu, X. Dong, J. Xu, Y. Gong, E-MSNSP: efficient negative sequential patterns (2018) 387–399.
mining based on multiple minimum supports, Int. J. Pattern Recognit. Artif. [46] G. Agapito, P.H. Guzzi, M. Cannataro, Parallel extraction of association rules
Intell. 31 (02) (2017) 1213–1218. from genomics data, Appl. Math. Comput. (2017) 1–13.

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108
JID: NEUCOM
ARTICLE IN PRESS [m5G;August 6, 2019;16:20]

14 X. Dong, F. Hao and L. Zhao et al. / Neurocomputing xxx (xxxx) xxx

Xiangjun Dong was born in Weifang, Shandong, China Long Zhao was born in QiQihaer, Hei LongJiang, CHINA
in 1968. He received the M.E. and Ph.D. degrees in com- in 1984. He received the B.S. degree in computer science
puter applications from Shangdong Industrial University, and Technology from The Northeast Petroleum University,
in 1999 and Beijing Institute of Technology, in 2005, re- in 2005. He received his M.S. degree in Computer Science
spectively. From 2007 to 2009, he was a postdoctoral and Technology from Shandong Polytechnic University in
position in School of Management and Economics, Bei- 2009. He received his Ph.D. degree in Computer Software
jing Institute of Technology. From 2009 to 2010, he was and Theory from Wuhan University in 2016.
visiting professor in University of Technology Sydney, Recently, he is a lecturer in the School of Informa-
Australia. tion, Qilu University of Technology. He is the author of
He is currently the Professor and master’s tutor of one book, 12 articles. His research interests include image
School of Information, Qilu University of Technology in Ji- processing, machine learning, and knowledge discovery.
nan, China. He has presided over and participated in the
study of more than 10 vertical topics and many horizontal
topics, including the National Natural Science Foundation of China (NSFC), the Na-
tional Natural Science Foundation of Shandong Province. He is the author of more Tiantian Xu received her B.E. and M.E. degree in com-
than 70 academic papers. His research interests include data mining, association puter applications from Qilu University of Technology in
rules, sequential pattern mining and negative sequential pattern mining. 2012 and 2015, respectively. She received her Ph.D. degree
Dr. Dong is a director of Shan dong Computer Federation. He is a member in software engineering from Ocean University of China in
of ACM and served as the program committee member, including PRICAI2012, 2018. She is a lecturer in School of Information, Qilu Uni-
PAKDD2011, PAKDD20 09, AI20 08, AI20 09, AI2010, AI2011, AI2012, ADMA’09, versity of Technology (Shandong Academy of Sciences).
ADMA08, ADMA07, Session Chair of ADMA08. He is a reviewer at IEEE Transac- Her research interests include pattern recognition, associ-
tions on Knowledge Discovery and Engineering, Knowledge Base Systems. He won ation rules, sequential pattern mining. She has published
the Shandong Provincial Department of Education outstanding scientific research research papers in international journals.
award.

Feng Hao was born in Dezhou, ShanDong, China in 1994.


He received the B.E. degree in computer applications from
Qilu University of Technology in 2016. And he is currently
a postgraduate student at computer applications depart-
ment of the school. His research interests cover the fields
of data mining, association rules, positive and negative se-
quential pattern mining.

Please cite this article as: X. Dong, F. Hao and L. Zhao et al., An efficient method for pruning redundant negative and positive association
rules, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.09.108

You might also like