You are on page 1of 6

Hiding Sensitive Predictive Association Rules

Shyue-Liang Wang, Ayat Jafari


Department of Computer Science
New York Institute of Technology
New York, New York USA
slwang@nyit.edu
Abstract - Privacy-preserving data mining [22], is a novel proposed for this type of output privacy
research direction in data mining and statistical databases, [1,6,7,8,9,17,18,20,23]. For example, perturbation,
where data mining algorithms are analyzed for the side blocking, aggregation or merging, swapping, and sampling
effects they incur in data privacy. For example, through are some alternation methods that have recently been
data mining, one is able to infer sensitive information, proposed. The second type of privacy is that the data is
including personal information or even patterns, from non- manipulated so that the mining result is not affected or
sensitive information or unclassified data. There have been minimally affected [10,11,12,13,21]. For example, the
two types of privacy concerning data mining. The first type cryptography-based techniques like secure multiparty
of privacy is that the data is altered so that the mining computation allow users access to only a subset of data
result will preserve certain privacy. The second type of while global data mining results can still be discovered.
privacy is that the data is manipulated so that the mining The reconstruction-based technique where the original
result is not affected or minimally affected. distribution of the data can be reconstructed from the
randomized data is another method for this type of input
privacy.
Given specific rules to be hidden, many data altering
techniques for hiding association, classification and
Given specific rules to be hidden, many data altering
clustering rules have been proposed. However, to specify
techniques for hiding association, classification and
hidden rules, entire data mining process needs to be
clustering rules have been proposed. However, to specify
executed. For some applications, we are only interested in
hidden rules, entire data mining process needs to be
hiding certain sensitive predicative rules that contain given
executed. For some applications, we are only interested in
items. In this work, we assume that only sensitive items
hiding certain sensitive predicative rules that contain given
are given and propose two algorithms, ISL (Increase
items. In this work, we assume that only sensitive items
Support of LHS) and DSR (Decrease Support of RHS), to
are given and propose two algorithms to modify data in
modify data in database so that sensitive predicative rules
database so that sensitive predicative rules containing
containing specified items on the left hand side of rule
specified items on the left hand side of rule cannot be
cannot be inferred through association rule mining.
inferred through association rule mining. The proposed
Examples illustrating the proposed algorithms are given.
algorithms are based on modifying or perturbing the
The characteristics of the algorithms are analyzed. The
database transactions so that the confidence of the
efficiency of the proposed approach is further compared
association rules can be reduced. Examples demonstrating
with Verykios etc [9,23] approach. It is shown that our
the proposed algorithms are shown. The characteristics of
approach required less number of databases scanning and
the proposed algorithms are analyzed. The efficiency of
prune more number of hidden rules. However, our
the proposed approach is further compared with Verykios
approach must hide all rules containing the hidden items
etc approach. It is shown that our approach required less
on the left hand side, where Verykios etc approach can
number of databases scanning and prune more number of
hide any specific rule.
hidden rules. However, our approach must hide all rules
containing the hidden items on the left hand side, where
Keywords: privacy preserving data mining, predictive
Verykios etc approach can hide any specific rule.
association rule.

1 Introduction The rest of the paper is organized as follows. Section


2 presents the statement of the problem and the notation
The concept of privacy preserving data mining has used in the paper. Section 3 presents the proposed
been recently proposed in response to the concerns of algorithms for hiding sensitive predictive association rules
preserving personal information from data mining that contain the specified items. Section 4 shows some
algorithms [3,4,5,15,16,22]. There have been two types of examples of the proposed algorithms. Section 5 analyses
privacy concerning data mining. The first type of privacy the characteristics of proposed algorithms and further
is that the data is altered so that the mining result will compare with Verykios etc approach. Concluding remarks
preserve certain privacy. Many techniques have been and future works are described in section 6.
2 Problem Statement the smallest rule set that makes the same prediction as the
association rule set by confidence priority.
2.1 Predicative Association Rules
The problem of mining association rules was 2.2 Problem Description
introduced in [2]. Let I = { i1 , i 2 , L , i m } be a set of The objective of data mining is to extract hidden or
literals, called items. Given a set of transactions D, where potentially unknown interesting rules or patterns from
databases. However, the objective of privacy preserving
each transaction T is a set of items such that T ⊆ I , an
data mining is to hide certain sensitive information so that
association rule is an expression X ⇒ Y where X ⊆ I, they cannot be discovered through data mining techniques
Y ⊆ I , and X ∩ Y = φ . The X and Y are called [1,4-12,16]. In this work, we assume that only sensitive
items are given and propose two algorithms to modify data
respectively the body (left hand side) and head (right hand
in database so that sensitive predictive association rules
side) of the rule. An example of such a rule is that 90% of
cannot be inferred through association rule mining. More
customers buy hamburgers also buy Coke. The 90% here is
specifically, given a transaction database D, a minimum
called the confidence of the rule, which means that 90% of
support, a minimum confidence and a set of sensitive items
transaction that contains X also contains Y. The confidence
X, the objective is to modify the database D such that no
| X ∪Y | predictive association rules containing X on the left hand
is calculated as . The support of the rule is the
|X| side will be discovered.
percentage of transactions that contain both X and Y, which
As an example, for a given database in Table 1, a
| X ∪Y | minimum support of 33%, a minimum confidence of 70%,
is calculated as , where N is the number of
N and a hidden item X = {C}, if transaction T5 is modified as
transactions in D. In other words, the confidence of a rule AC, then the following rules that contain item C on the left
measures the degree of the correlation between itemsets, hand side will be hidden: C=>B (50%, 60%), AC=>B (50%,
while the support of a rule measures the significance of the 60%), C=>AB(50%, 60%).
correlation between itemsets. The problem of mining
association rules is to find all rules that are greater than the The following notation will be used in the paper.
user-specified minimum support and minimum confidence. Each database transaction has three elements: T=<TID,
list_of_elements, size>. The TID is the unique identifier of
As an example, for a given database in Table 1, a the transaction T and list_of_elements is a list of all items
minimum support of 33% and a minimum confidence of in the database. However, each element has value 1 if the
70%, nine association rules can be found as follows: B=>A corresponding item is supported by the transaction and 0
(66%, 100%), C=>A (66%, 100%), B=>C (50%, 75%), otherwise. Size means the number of elements in the
C=>B (50%, 75%), AB=>C (50%, 75%), AC=>B (50%, list_of_elements having value 1. For example, if I =
75%), BC=>A(50%, 100%), C=>AB(50%, 75%), {A,B,C}, a transaction that has the items {A, C} will be
B=>AC(50%, 75%). represented as t = <T1,101,2>. In addition, a transaction t
supports an itemset I when the elements of
Table 1: Database D t.list_of_elements corresponding to items of I are all set to
TID Items 1. A transaction t partially supports an itemset I when the
T1 ABC elements are not all set to 1. For example, if I = {A,B,C} =
T2 ABC [111], p = <T1,[111],3>, and q = <T2,[001],1>, then we
T3 ABC would say that p supports I and q partially supports I.
T4 AB
T5 A 3 Proposed Algorithms
T6 AC
In order to hide an association rule, we can either
decrease its support or its confidence to be smaller than
However, mining association rules usually generates a pre-specified minimum support and minimum confidence.
large number of rules, most of which are unnecessary for To decrease the confidence of a rule, we can either (1)
the purpose of prediction. For example, given itemset for increase the support of X, i.e., the left hand side of the rule,
prediction P = {C}, the rule set that contains only two rules but not support of X ∪ Y, or (2) decrease the support of the
C=>A (66%, 100%), C=>B (50%, 75%), will generate the itemset X ∪ Y. For the second case, if we only decrease the
same predicted itemset Q = {A, B} as the nine association support of Y, the right hand side of the rule, it would reduce
rules found from Table 1. A predictive association rule set the confidence faster than simply reducing the support of X
(or informative rule set) [14] can be informally defined as ∪ Y. To decrease support of an item, we will modify one
item at a time in a selected transaction by changing from 1
to 0 and from 0 to 1 to increase the support. Output: a transformed database D’, where rules
containing X on LHS will be hidden
Based on these two strategies, we propose two data-
mining algorithms for hiding sensitive predictive 1. Find large 1-item sets from D;
association rules, namely Increase Support of LHS (ISL) 2. For each hidden item x ∈ X
and Decrease Support of RHS (DSR). The first algorithm 3. If x is not a large 1-itemset, then X := X -{x} ;
tries to increase the support of left hand side of the rule. 4. If H is empty, then EXIT;// no AR contains X in LHS
The second algorithm tries to decrease the support of the 5. Find large 2-itemsets from D;
right hand side of the rule. The details of the two 6. For each x ∈ X {
algorithms are described as follow. 7. For each large 2-itemset containing x {
8. Compute confidence of rule U, where U is a rule
Algorithm ISL like x → h ;
Input: (1) a source database D, 9. If confidence(U) < min_conf, then
(2) a min_support, 10. Go to next large 2-itemset ;
(3) a min_confidence, 11. Else { //Decrease Support of RHS
(4) a set of hidden items X 12. Find TR = { t in D | t fully support U} ;
Output: a transformed database D’, where rules 13. Sort TR in ascending order by the number of
containing X on LHS will be hidden Items;
14. While { confidence(U) >= min_conf and TR
1. Find large 1-item sets from D; is not empty) {
2. For each hidden item x ∈ X 15. Choose the first transaction t from TR;
3. If x is not a large 1-itemset, then X := X -{x} ; 16. Modify t so that h is not supported;
4. If H is empty, then EXIT;// no AR contains X in LHS 17. Compute support and confidence of U;
5. Find large 2-itemsets from D; 18. Remove and save the first transaction t
6. For each x ∈ X { from TR;
7. For each large 2-itemset containing x { 19. }; // end While
8. Compute confidence of rule U, where U is a rule 20. }; // end if
like x → h ; 21. If TR is empty, then {
9. If confidence(U) < min_conf, then 22. Can not hide x → h;
10. Go to next large 2-itemset; 23. Restore D;
11. Else { //Increase Support of LHS 24. Go to next large-2 itemset;
12. Find TL = { t in D | t does not support U} ; 25. } // end if TR is empty
13. Sort TL in ascending order by the number of 26. } //end of for each large 2-itemset
Items; 27. Remove x from X;
14. While { confidence(U) >= min_conf and TL is 28. } // end of for each x
not empty) { 29. Output updated D, as the transformed D’;
15. Choose the first transaction t from TL;
16. Modify t to support x, the LHS(U); 4 Examples
17. Compute support and confidence of U;
18. Remove and save the first transaction t from TL; This section shows four examples for demonstrating
19. }; // end While the two proposed algorithms in hiding sensitive predictive
20. }; // end if association rules in the association rule mining.
21. If TL is empty, then {
22. Can not hide x → h; For a given database in Table 1, a minimum support
23. Restore D; of 33% and a minimum confidence of 70%, the first two
24. Go to next large-2 itemset; examples hide the sensitive rules using the ISL algorithm.
25. } // end if TL is empty The difference of the two examples is that the order of
26. } //end of for each large 2-itemset hiding item is different. The first example hides item C and
27. Remove x from X; then item B. The second example hides item B and then
28. } // end of for each x item C. The result is given in section 4.1.
29. Output updated D, as the transformed D’;
The third and fourth examples hide the sensitive
Algorithm DSR predictive association rules using DSR algorithm. The
Input: (1) a source database D, difference is also the order of items to be hidden. The
(2) a min_support, result is given in section 4.2.
(3) a min_confidence,
(4) a set of hidden items X
Table 2: Database D using the specified notation 4.2 Examples Running DSR Algorithm
TID Items Size
T1 111 3 Example 3 Assuming that the min_supp=33% and
T2 111 3 min_conf=70%, the result of hiding item C and then item B
T3 111 3 using DSR algorithm is as follows. To hide item C, the
T4 110 2 rule C => A (60%, 100%), C => B (50%, 75%), B => C
T5 100 1 (50%, 75%), B => A (60%, 100%) will be hidden if
T6 101 2 transaction T6 is modified from 101 to 001, T1 is modified
from 111 to 011, T1 is modified from 011 to 001, and T4 is
modified from 110 to 010, using DSR. The new database
4.1 Examples Running ISL Algorithm D3 is shown in Table 5.

Example 1 Assuming that the min_supp = 33% and Table 5 Databases before and after hiding item C and item
min_conf = 70%, the result of hiding item C and then item B using DSR
B using ISL algorithm is as follows. To hide item C, the TID D D3
rule C => B (50%, 75%) will be hidden if transaction T5 is T1 111 001
modified from 100 to 101 using ISL (Increase Support of T2 111 111
LHS). The new database D1 is shown in Table 3. The rule T3 111 111
C => B will have support = 50% and confidence = 60%. T4 110 010
However, rules C => A, B => A, B => C cannot be hidden T5 100 100
by ISL algorithm. T6 101 001

Table 3 Databases before and after hiding item C and item Example 4 As in example 3, reversing the order of hiding
B using ISL items, the result of hiding item B and then item C using
TID D D1 DSR algorithm is as follows. To hide item B, the rule B
T1 111 111 => A (60%, 100%), B => C (50%, 75%), C => A (60%,
T2 111 111 100%), C => B (50%, 75%) will be hidden if transaction T4
T3 111 111 is modified from 110 to 010, T1 is modified from 111 to
T4 110 110 011, T1 is modified from 011 to 010, T6 is modified from
T5 100 101 101 to 001, using DSR. The new database D4 is shown in
Table 6.
T6 101 101
Table 6 Databases before and after hiding item B and item
C using DSR
Example 2 As in example 1, reversing the order of hiding TID D D4
items, the result of hiding item B and then item C using ISL T1 111 010
algorithm is as follows. To hide item B, the rule B => C T2 111 111
(50%, 75%) will be hidden if transaction T5 is modified T3 111 111
from 100 to 110 using ISL. The new database D2 is shown T4 110 010
in Table 4. The rule B => C will have support = 50% and T5 100 100
confidence = 60%. However, rules B => A, C => A, C => T6 101 001
B cannot be hidden by ISL algorithm.
One observation is that different sequences of hiding
One observation we can make is that different sequences of items will result in different transformed databases, i.e., D3
hiding items will result in different transformed databases, and D4 for DSR algorithm.
i.e., D1 and D2 for ISL algorithm.

Table 4 Databases before and after hiding item B and item 5 Analysis
C using ISL This section analyzes some of the characteristics of
TID D D2 the proposed algorithms and compares with the algorithms
T1 111 111 proposed in Verykios etc’s [9,23]. The first characteristic
T2 111 111 we observe is the item ordering effect. The transformed
T3 111 111 databases are different under different ordering of hiding
T4 110 110 items, even though the same set of sensitive items is
T5 100 110 specified. The second characteristic we observe is the
T6 101 101 algorithm effect. The transformed databases will be
different under different algorithm. These characteristics
are demonstrated in the four examples in section 4 and
summarized in Table 7. Databases D1 and D2 are resulting The fourth characteristic we analyze is efficiency
databases using ISL algorithm and D3 and D4 are resulting comparison of the ISL and DSR algorithms. One
databases using DSR algorithm. observation we conclude from the examples in section four
is that DSR algorithm seems to be more effective when the
Table 7 Databases before and after hiding items B and C support count of the hidden item is large. This is due to
using ISL and DSR when support of right hand side of the rule is large;
TID D D1 D2 D3 D4 increase support of left hand side usually does not reduce
T1 111 111 111 001 010 the confidence of the rule. However, decrease support of
T2 111 111 111 111 111 right hand side usually decreases the confidence of the rule.
T3 111 111 111 111 111
T4 110 110 110 010 010 6 Conclusions
T5 100 101 110 100 100
T6 101 101 101 001 001 In this work, we have studied the database privacy
problems caused by data mining technology and proposed
two naïve algorithms for hiding sensitive predictive
The third characteristic we analyze is the efficiency of
association rules in association rules mining. The proposed
the proposed algorithm compared with the Verykios etc
algorithms are based on modifying the database
algorithms. Even though it is the hidden rules, instead of
transactions so that the confidence of the association rules
hidden items, that are specified in [9,23], we compare the
can be reduced. Examples demonstrating the proposed
number of database scanning and the number of rules
algorithms are shown. The item ordering and algorithm
pruned between the two approaches. Table 8 summarizes
ordering characteristics of the proposed algorithms are
the results.
analyzed. The efficiency of the proposed approach is
further compared with Verykios etc’s [9,23]. It was shown
For ISL algorithm, the number of database scanning
that our approach required less number of database
comes from the calculation of large one itemsets, large two
scanning and prune more number of hidden rules.
itemsets, and transactions TL. The rules pruned are AC =>
However, our approach must hide all rules containing the
B and C => AB. For Verykios etc’s 1a algorithm, the
hidden items on the left hand side, where Verykios etc’s
number of database scanning comes from the calculation of
approach can hide any specified rule. Currently we are
large one itemsets, large two itemsets, large three itemsets,
performing numerical simulation on the time effects and
and partial support transactions T. No rules are pruned in
side effects (the number of lost rules and new rules due to
the Verykios etc’s algorithm. It can be seen that the ISL
alternation of the database) and will be included in this
algorithm requires less database scanning and prune more
work. In the future, we will examine and compare with
number of association rules. Similar results are obtained
other alternation techniques for hiding predictive
for comparing DSR algorithm and Verykios etc’s 1b
association rules based on current approach
algorithm.

Table 8 Database scans and rules pruned in hiding item C


using ISL References
DB Scans Rules [1] D. Agrawal and C. C. Aggarwal, “On the design and
Pruned quantification of privacy preserving data mining
ISL 3 2 algorithms”, In Proceedings of the 20th Symposium on
Dasseni’s 1a 4 0 Principles of Database Systems, Santa Barbara, California,
USA, May 2001.
One of the reasons that Verykios etc’s approach does
[2] R. Agrawal, T. Imielinski, and A. Swami, “Mining
not prune rules is that hidden rules are given in advance
Association Rules between Sets of Items in Large
and the algorithms try to hide every single rule without
Databases”, In Proceedings of ACM SIGMOD
checking to see if rules can be pruned after some
International Conference on Management of Data,
transactions have been changed.
Washington DC, May 1993.
However, our approach needs to hide all rules
[3] R. Agrawal and R. Srikant, ”Privacy preserving data
containing hidden items on the left hand side. But
mining”, In ACM SIGMOD Conference on Management of
Verykios etc’s approach can hide some of the rules
Data, pages 439–450, Dallas, Texas, May 2000.
containing hidden item on the left hand side. For example,
for hidden item C, Verykios etc’s approach can hide C =>
[4] Ljiljana Brankovic and Vladimir Estivill-Castro,
A, but show C => B, whereas our approach must hide both
“Privacy Issues in Knowledge Discovery and Data Mining”,
C => A and C => B.
Australian Institute of Computer Ethics Conference, July, [17] S. Oliveira, O. Zaiane, “Algorithms for Balancing
1999, Lilydale. Priacy and Knowledge Discovery in Association Rule
Mining”, Proceedings of 7th International Database
[5] C. Clifton and D. Marks, “Security and Privacy Engineering and Applications Symposium (IDEAS03),
Implications of Data Mining”, in SIGMOD Workshop on Hong Kong, July 2003.
Research Issues on Data Mining and knowledge Discovery,
1996. [18] S. Oliveira, O. Zaiane, “Protecting Sensitive
Knowledge by Data Sanitization”, Proceedings of IEEE
[6] C. Clifton, “Protecting Against Data Mining Through International Conference on Data Mining, November 2003.
Samples”, in Proceedings of the Thirteenth Annual IFIP
WG 11.3 Working Conference on Database Security, 1999. [19] S. J. Rizvi and J. R. Haritsa, “Privacy-preserving
association rule mining”, In Proc. of the 28th Int’l
[7] C. Clifton, “Using Sample Size to Limit Exposure to Conference on Very Large Databases, August 2002.
Data Mining”, Journal of Computer Security, 8(4), 2000.
[8] Chris Clifton, Murant Kantarcioglu, Xiaodong Lin and [20] Y. Saygin, V. Verykios, and C. Clifton, “Using
Michael Y. Zhu, “ Tools for Privacy Preserving Distributed Unknowns to Prevent Discovery of Association Rules”,
Data Mining”, SIGKDD Explorations, 4(2), 1-7, Dec. 2002. SIGMOD Record 30(4): 45-54, December 2001.

[9] E. Dasseni, V. Verykios, A. Elmagarmid and E. [21] J. Vaidya and C.W. Clifton. “Privacy preserving
Bertino, “Hiding Association Rules by Using Confidence association rule mining in vertically partitioned data”, In
and Support” in Proceedings of 4th Information Hiding Proc. of the 8th ACM SIGKDD Int’l Conference on
Workshop, 369-383, Pittsburgh, PA, 2001. Knowledge Discovery and Data Mining, Edmonton,
Canada, July 2002.
[10] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke,
“Privacy preserving mining of association rules”, In Proc. [22] V. Verykios, E. Bertino, I.G. Fovino, L.P. Provenza,
Of the 8th ACM SIGKDD Int’l Conference on Knowledge Y. Saygin, and Y. Theodoridis, “State-of-the-art in Privacy
Discovery and Data Mining, Edmonton, Canada, July 2002. Preserving Data Mining”, SIGMOD Record, Vol. 33, No. 1,
50-57, March 2004.
[11] Alexandre Evfimievski, “Randomization in Privacy
Preserving Data Mining”, SIGKDD Explorations, 4(2), [23] V. Verykios, A. Elmagarmid, E. Bertino, Y. Saygin,
Issue 2, 43-48, Dec. 2002. and E. Dasseni, “Association Rules Hiding”, IEEE
Transactions on Knowledge and Data Engineering, Vol. 16,
[12] Alexandre Evfimievski, Johannes Gehrke and No. 4, 434-447, April 2004.
Ramakrishnan Srikant, “Limiting Privacy Breaches in
Privacy Preserving Data Mining”, PODS 2003, June 9-12,
2003, San Diego, CA.

[13] M. Kantarcioglu and C. Clifton, “Privacy-preserving


distributed mining of association rules on horizontally
partitioned data”, In ACM SIGMOD Workshop on Research
Issues on Data Mining and Knowledge Discovery, June
2002.

[14] Jiuyong Li, Hong Shen and Rodney Topor, “Mining


the Smallest Association Rule Set for Predictions”,
Proceedings of the 2001 IEEE International Conference on
Data Mining, 361-368.

[15] Y. Lindell and B. Pinkas, “Privacy preserving data


mining”, In CRYPTO, pages 36–54, 2000.

[16] D. E. O’ Leary, “Knowledge Discovery as a Threat to


Database Security”, In G. Piatetsky-Shapiro and W. J.
Frawley, editors, Knowledge Discovery in Databases, 507-
516, AAAI Press/ MIT Press, Menlo Park, CA, 1991.

You might also like