You are on page 1of 37

Knowl Inf Syst (2009) 20:263–299

DOI 10.1007/s10115-008-0178-7

REGULAR PAPER

Hiding sensitive knowledge without side effects

Aris Gkoulalas-Divanis · Vassilios S. Verykios

Received: 18 January 2008 / Revised: 16 September 2008 / Accepted: 27 September 2008 /


Published online: 14 November 2008
© Springer-Verlag London Limited 2008

Abstract Sensitive knowledge hiding in large transactional databases is one of the major
goals of privacy preserving data mining. However, it is only recently that researchers were
able to identify exact solutions for the hiding of knowledge, depicted in the form of sensitive
frequent itemsets and their related association rules. Exact solutions allow for the hiding of
vulnerable knowledge without any critical compromises, such as the hiding of nonsensitive
patterns or the accidental uncovering of infrequent itemsets, amongst the frequent ones, in
the sanitized outcome. In this paper, we highlight the process of border revision, which
plays a significant role towards the identification of exact hiding solutions, and we provide
efficient algorithms for the computation of the revised borders. Furthermore, we review two
algorithms that identify exact hiding solutions, and we extend the functionality of one of
them to effectively identify exact solutions for a wider range of problems (than its original
counterpart). Following that, we introduce a novel framework for decomposition and parallel
solving of hiding problems, which are handled by each of these approaches. This framework
improves to a substantial degree the size of the problems that both algorithms can handle
and significantly decreases their runtime. Through experimentation, we demonstrate the
effectiveness of these approaches toward providing high quality knowledge hiding solutions.

Keywords Data mining · Association rule hiding · Borders of frequent itemsets ·


Parallelization

1 Introduction

Modern computers can typically collect, analyze, and store millions of data in huge transac-
tional data warehouses. In many cases, the analysis of these piles of data may be proved to

A. Gkoulalas-Divanis (B) · V. S. Verykios


Department of Computer and Communication Engineering,
University of Thessaly, 382 21 Vólos, Greece
e-mail: arisgd@inf.uth.gr

V. S. Verykios
e-mail: verykios@inf.uth.gr

123
264 A. Gkoulalas-Divanis, V. S. Verykios

be beneficial for the data holder and, possibly, for a larger community of people. However,
there are circumstances under which this data contains sensitive information, such as refe-
rences to individuals or to trade secrets, that must remain confidential. A typical example
involves large organizations, such as super-markets, that collect market basket data of their
customers’ purchases on a regular basis. These organizations may be willing to share their
collected information with other parties, such as advisory organizations, for mutual benefit.
However, in most of the cases, the participating organizations have to protect any business’
secrets that may allow a business competitor to gain advantage over his peers. Thus, prior
to releasing the data to others, the data holder wants to protect the sensitive knowledge that
corresponds to his or her business’ secrets. The necessity of preserving the privacy of indi-
viduals and the need of protecting corporate knowledge has led the researchers to propose a
set of techniques that offer sensitive knowledge hiding.
In this work, we are interested in the hiding of sensitive knowledge, depicted in the form of
frequent patterns that lead to the production of sensitive association rules, induced from a set of
data records in a transactional database. Our goal is to create a new—hereon called sanitized—
dataset, which achieves, when mined for frequent patterns using the same (or a higher) support
threshold, to protect the sensitive knowledge, while leaving all the nonsensitive patterns intact.
It is essential to mention the possible removal of the sensitive attributes from the dataset,
apart from the fact that it may cause unnecessary loss of nonsensitive frequent patterns; it
may also be insufficient to prohibit leakage of sensitive knowledge. Indeed, association rule
mining [2,22,39] may reveal sensitive patterns that were unknown to the database owner
prior to its mining. Moreover, inference techniques may be used to uncover private data, as
discussed in [12,13,24]. To make matters worse, even nonsensitive information may lead to
the production of sensitive patterns and related association rules. As well stated in [31], “it
is not only the data, but the hidden knowledge in this data, that should be made secure.”
Frequent itemset hiding algorithms are fundamental in providing a solution to this problem.
They accomplish this by causing a small deviation to the original database, which blocks
the production of sensitive itemsets at a prespecified support threshold, usually set by the
owner of the data. What differentiates the quality of one approach from that of another is
the actual harm1 that is introduced on the dataset as a result of the hiding process. Ideally,
the hiding process should be accomplished in such a way that the nonsensitive knowledge
remains, to the highest possible degree, intact. This means that the application of frequent
itemset mining in the sanitized dataset (using the same or a higher support threshold as the
one used in the original dataset) should, ideally, reveal all the frequent patterns of the original
dataset excluding only the sensitive ones and their supersets. This is the notion of an exact
hiding solution.
The contributions of this paper are as follows. First, we present the border revision process,
introduced in [32], along with algorithms for the computation of the original and the revised
borders. Border revision plays a significant role on exact hiding methodologies, as it captures
the itemsets that need to remain frequent and those that need to become infrequent in the
sanitized database to eliminate the side effects of the hiding process. Second, we survey two
exact frequent itemset hiding approaches that appear in [14,15] by pinpointing the common
ground that exists behind their theory and by unifying them in a common mining framework.
In both approaches, the hiding process involves the construction of a Constraints Satisfac-
tion Problem (CSP) that constraints the support of the itemsets in the revised borders and its
solution through Binary Integer Programming (BIP). Third, we extend one of the presented

1 We will later on present how the notion of “harm” can be quantified to allow us compare different hiding
solutions.

123
Hiding sensitive knowledge without side effects 265

hiding approaches to identify exact solutions to an extended set of problem instances (than
its original counterpart). Fourth, we introduce a novel framework that is suitable for decom-
position and parallelization of the approaches in [14,15] and can be applied to improve the
scalability of these algorithms. The generality of the proposed framework allows it to effi-
ciently handle any CSP that consists of linear constraints involving binary variables. Unlike
distributed approaches [37], where the search for the CSP solution is conducted in parallel by
multiple agents, our approach to decomposition and parallelization takes into consideration
the binary nature of the CSPs to achieve a direct decomposition. Together with existing work
for parallel mining of association rules, as in [1,7,16,28,38], our framework can be applied
to parallelize the most time-consuming steps of the exact hiding algorithms. Finally, we
compare the three exact hiding algorithms against two state-of-the-art heuristic approaches,
and we demonstrate their superiority toward providing hiding solutions of higher quality.
Moreover, through experimentation, we prove the effectiveness of the decomposition and
parallelization framework in drastically reducing the runtime of the exact hiding algorithms.
The rest of this paper is organized as follows. Section 2 provides an overview of the state-
of-the-art methodologies for frequent itemset and association rule hiding. In Sect. 3 we set
out the problem and provide the necessary background. Section 4 presents the border revision
process that is used for the identification of the exact hiding solutions. The two hiding schemes
are presented in brief in Sect. 5, while Sect. 6 introduces a two-phase iterative process that
extends the functionality of one of these approaches. A novel framework for decomposition
and parallelization of the exact knowledge hiding algorithms, is presented in Sect. 7. Section 8
presents a model to quantify the privacy that is offered by the exact hiding approaches. Finally,
Sect. 9 contains the experimental evaluation and Sect. 10 concludes this work.

2 Related work

Clifton and his co-workers [8,9] are among the first to discuss the security and privacy
implications of data mining and propose data-obscuring strategies to prohibit inference and
discovery of sensitive knowledge.
Atallah et al. [4] propose a greedy algorithm for selecting items to sanitize among the
transactions supporting sensitive itemsets. Various extensions of this work have been propo-
sed over the years, including those by Saygin et al. [31], Oliveira et al. [26], Bertino et al.
[6], and Kantarcioglu et al. [18].
Rizvi and Haritsa [30] present a data perturbation approach that is based on probabilistic
distortion of the data, as introduced in [3]. Their approach limits a set of privacy breaches of
[3] and provides privacy while retaining a high level of accuracy in the mining results. In the
pioneering work of [19], the utility of random-value distortion techniques is questioned. As
the authors indicate, the produced random matrices have predictable structures in the spectral
domain, and thus, special filtering techniques can be applied to the sanitized dataset to reveal
the original data. As an effect, this category of knowledge hiding approaches offers limited
privacy.
Verykios et al. [34] present a classification of privacy-preserving techniques along five
dimensions: data distribution, data modification, data mining algorithm, data or rule hiding,
and privacy preservation. The first dimension refers to the distribution of the data, which can
be either horizontal or vertical. In a horizontal distribution, the individual records reside in
different places, while in a vertical distribution, all the values for different attributes reside
in different places. A horizontal approach for association rule hiding is presented in [18] and
a vertical one in [33].

123
266 A. Gkoulalas-Divanis, V. S. Verykios

Oliveira and Zaïane [27] present a five-step, one-scan, heuristic approach for association
rule hiding. Regardless of the volume of the transactional database, their algorithm requires
only one pass over the whole dataset. Moreover, their approach allows for the use of a
disclosure threshold φ to be set for each sensitive association rule.
Menon et al. [23] present an integer programming approach that identifies the minimum
number of transactions that have to be sanitized to facilitate sensitive itemsets hiding. The
proposed algorithm produces a CSP consisting of linear constraints, where each constraint
refers to a sensitive itemset and restricts the amount of its supporting transactions that have to
be sanitized to be effectively hidden. After solving the CSP by using integer programming,
a heuristic is enforced that tries to identify the required number of transactions (as indicated
by the optimization criterion of the solved CSP) and to perform their sanitization. When
compared to exact approaches (e.g., [14,15]), the work of Menon et al. suffers from the
following shortcomings: (i) the sanitization of the dataset is performed heuristically and as
an effect this approach may fail to identify a hiding solution bearing no side effects even if one
exists, and (ii) the sanitization process does not guarantee minimal distortion of the original
dataset, since heuristic algorithms take locally best rather than globally best decisions. On
the contrary, exact approaches formulate the whole hiding process as a CSP, a fact, which
guarantees that the optimal hiding solution (both at terms of no side effects and of minimal
distortion) will be attained, provided that such a solution exists.
A border-based approach for frequent itemset hiding is introduced in [32]. The proposed
algorithm is greedy in nature and focuses on preserving the quality of the border constructed
by the nonsensitive frequent itemsets in the itemsets’ lattice. The authors use the positive
border, after the removal of the sensitive itemsets, to keep track of the impact of altering
transactions in the database.
Moustakides and Verykios [25] propose two heuristic algorithms that rely on the max-
min criterion and the border theory of frequent itemsets, to hide the sensitive knowledge.
The proposed solution aims at minimizing the impact of the hiding process to the revised
positive border. By restricting the impact on the border, the authors efficiently select which
items to modify, while ensuring that the nonborder itemsets are protected from hiding. Other
interesting heuristics to sensitive knowledge hiding can be sought in [10,11,29,31,35,36].
The aforementioned approaches are greedy in nature to speed up the hiding process.
However, heuristics suffer from local minima issues and in many cases fail to identify optimal
solutions. Two exact approaches for frequent itemset hiding are presented in [14,15], and in
this work, we further extend their functionality. In both approaches, the authors use a border
revision process similar to [32] to identify candidate itemsets for sanitization. The hiding
process is then performed by formulating a CSP based on a small portion of itemsets and
by solving it using BIP. The solution is proved to lead to an exact hiding of the sensitive
patterns. Furthermore, a heuristic is applied in both cases to relax the problem and to identify
a good suboptimal solution, if the constructed CSP is infeasible.

3 Notation and problem formulation

This section provides the necessary background regarding sensitive itemset hiding and allows
us to proceed to our problem’s formulation.
Let I = {i 1 , i 2 , . . . , i M } be a finite set of literals, called items, where M denotes the
cardinality of the set. Any subset I ⊆ I is called an itemset over I . A transaction T over I is
a pair T = (tid, I ), where I is the itemset and tid is a unique identifier, used to distinguish
among transactions that correspond to the same itemset. Furthermore, a transaction database

123
Hiding sensitive knowledge without side effects 267

D over I is a N × M table consisting of N transactions over I carrying different identifiers,


where entry Tnm = 1 if and only if the mth item appears in the nth transaction. Otherwise,
Tnm = 0. A transaction T = (tid, J ) supports an itemset I over I , if I ⊆ J . Given a set of
items S, let ℘ (S) denote the powerset of S, which is the set of all subsets of S.
Given an itemset I over I in D, we denote by sup(I, D) the number of transactions
T ∈ D that support I . Moreover, we define the frequency of an itemset I in a database D,
denoted as freq(I, D), to be the fraction of transactions in D that support I . An itemset I
is large or frequent in database D, if and only if its frequency in D is at least equal to a
minimum threshold minf. Equivalently, I is large in D, if and only if sup(I, D) ≥ msup,
where msup = minf · N . All itemsets with frequency less than minf are infrequent.
Let FD = {I ⊆ I : freq(I, D) ≥ minf} be the set of all frequent itemsets in D, and P =
℘ (I ) be the set of all patterns in the lattice of D. The positive and the negative borders of FD
are defined as follows: Bd+ (FD ) = {I ∈ FD | for all J ∈ P with I ⊂ J we have that J ∈ /
FD } and Bd− (FD ) = {I ∈ P − FD | for all J ⊂ I we have that J ∈ FD }.
Based on the notation presented above, we proceed to our problem’s statement. In what
follows, we assume that we are provided with a database DO , consisting of N transactions,
and a threshold minf set by the owner of the data. After performing frequent itemset mining
in DO with minf, we yield a set of frequent patterns, denoted as FDO , among which a subset
S contains patterns, which are considered to be sensitive from the owner’s perspective.
Given the set of sensitive itemsets S , we define the set Smin = {I ∈ S | for all J ⊂
I, J ∈ / S } that contains all the minimal sensitive itemsets from S , and the set Smax = {I ∈
FDO |∃J ∈ Smin , J ⊆ I } that contains all the itemsets of Smin along with their frequent
supersets. Our goal is to construct a new, sanitized database D, which achieves to protect the
sensitive itemsets from disclosure, while leaving intact the nonsensitive itemsets existing in
FDO . The hiding of a sensitive itemset corresponds to a lowering of its statistical significance,
depicted in terms of support, in the resulting database. To be more specific, what we want to
achieve is to minimally alter the original dataset in such a way that when the sanitized dataset
 =F
D is mined, the frequent patterns that are discovered are exactly those in FD DO − Smax .
We call this set ideal as it pertains to an optimal hiding solution. To formulate this set, we use
Smax instead of S , since, due to the apriori principle [2], hiding of the itemsets in S has as a
consequence the hiding of all itemsets in Smax . When constructed, database D can be safely
released, since it protects sensitive knowledge. The key step in the hiding methodologies lies
on the minimum impact 2 of DO to construct D. All the solutions that are presented in this
paper are based on this principle.

4 The border revision process

Borders allow for a condense representation of the itemsets’ lattice, identifying the key
itemsets, which separate all the frequent patterns from their infrequent counterparts. The
process of border revision was introduced by [32] to facilitate minimum harm in the hiding
of frequent itemsets. Figure 1 allows us to demonstrate how this process works. In this
figure, near each itemset we depict its support in the original (Fig. 1i) and the revised (Fig.
1ii) database. As one can observe, there are four possible scenarios involving the status of an
itemset I prior and after the application of border revision:

2 The impact is minimum when from the frequent itemsets in D , only the itemsets in S
O max become infrequent
in the released database D.

123
268 A. Gkoulalas-Divanis, V. S. Verykios

Fig. 1 An itemset lattice demonstrating (i) the original border and the sensitive itemsets, and (ii) the revised
border

C1 Itemset I was frequent in DO and remains frequent in D.


C2 Itemset I was infrequent in DO and remains infrequent in D.
C3 Itemset I was frequent in DO and became infrequent in D.
C4 Itemset I was infrequent in DO and became frequent in D.

The original border (presented in Fig. 1i) corresponds to the hyperplane that partitions
the universe of itemsets into two groups: the frequent itemsets of FDO (depicted on the
left of the original borderline) and their infrequent counterparts in P − FDO (shown on the
right of the borderline). The hiding process is defined as the revision of the original border
so that the revised border excludes from the frequent itemsets the sensitive ones (shown in
boldface) and their supersets, i.e., all the itemsets in Smax (shown inside a square). Given
the four possible scenarios for an itemset I prior and after the application of border revision,
we deduce that in the ideal scenario where no nonsensitive itemsets are harmed, C2 should
always hold, while C4 must never hold. On the contrary, C1 must hold for all the itemsets in
P − Smax , while C3 must hold for all the itemsets in Smax . Thus, the hiding of the sensitive
itemsets corresponds to a movement of the original borderline in the lattice to a new position
that adheres to the ideal set FD . What is then needed is a way to modify the transactions of

the original database to support the revised border.


To capture the ideal borderline, which is the revised border for the original database based
on the problem at hand, one can use the borders theory and identify the key itemsets, which
separate all frequent patterns from their infrequent counterparts. As it is proved in [14,15],
these key itemsets correspond to the union of the revised positive border Bd+ (FD ) and the
revised negative border Bd− (FD ).3 Since in all the algorithms presented in the following
sections of this paper we treat these two sets differently, we provide separate algorithms
for computing the positive and the negative borders, for capturing both the original and the
revised borderline.

3 In Fig. 1, the itemsets that belong to the positive border (original or revised) are double underlined, while
those of the corresponding negative border are single underlined.

123
Hiding sensitive knowledge without side effects 269

Algorithm 1 Computation of the large itemsets and the original negative border.
1: procedure NB- Apriori(DO , msup)
2: F1 ← GetLargeItems(DO , msup)
3: for k = 2; Fk−1 = ∅; k ++ do
4: Ck ← AprioriGen( Fk−1 , msup)
5: for each t ∈ Ti do for all transactions in DO
6: Ct = subset(Ck , t ) get cand. subsets of t
7: for each candidate c ∈ Ct do
8: c.count ++
9: Fk = {c ∈ Ck |c.count ≥ msup}
10: −
Bd (FDO ) = {c ∈ Ck |c.count < msup}
11: procedure GetLargeItems(DO , msup)
12: for Ti ∈ DO do traverse all transactions
13: for x ∈ Ti do traverse all items in transaction
14: x.count ++
15: for each item x do
16: if x.count ≥ msup then item x is frequent
17: x ∈ Fk
18: else
19: x ∈ B d − (FD ) add item to negative border
O

Algorithm 1 provides a straight-forward way to compute the negative border. It achieves


this by incorporating the computation to the Apriori algorithm [2]. The new Apriori algorithm
has extra code in the NB-Apriori and the GetLargeItems procedures to compute the border. On
the other hand, the Apriori-Gen procedure is the same as in the original version of Apriori. The
proposed method for the computation of the negative border is based on Apriori’s candidate
generation scheme, which uses (k−1) frequent itemsets to produce candidate k large itemsets.
Thus, it achieves to identify the first infrequent candidates in the lattice, whose all subsets
are found to be frequent. In this algorithm, we use Fn to denote large n itemsets that belong
to DO .
Having identified the original negative border, the next step is to compute the original
positive border Bd+ (FDO ). Algorithm 2, presents a level-wise (in the size of large itemsets)
approach to achieve this computation. Assume FDO is the set of frequent itemsets identified
using Apriori. For each itemset in FDO , we associate a counter, initialized to zero. The
algorithm first sorts these itemsets in decreasing length, and then for all itemsets of the
same length, say k, it identifies their (k − 1) large subsets and increases their counters by
one. The value of k iterates from the length of the largest identified frequent itemset down
to 1. Finally, the algorithm performs a one-pass through all the counters and collects the
large itemsets having a value of zero in the associated counter. These, constitute the positive
border Bd+ (FDO ). Because of its very nature, this algorithm is suitable for computing both
the original and the revised positive borders. For this reason, we use the notation Bd+ (F )
to abstract the reference to the exact border that is computed (namely the original or the
revised).

Algorithm 2 Computation of the positive border (original and revised) Bd+ (F ).


1: procedure PB- Computation(F)
2: count{0 . . . |F|} ← 0 initialize counters
3: Fsor t = reverse-sort(F)
4: for each k -itemset f ∈ Fsor t do
5: for all (k − 1)-itemsets q ∈ Fsor t do
6: if q ⊂ f then
7: q.count ++
8: for each f ∈ Fsor t do
9: if f.count = 0 then
10: f ∈ B d + (F ) add itemset to Bd+ (F )

123
270 A. Gkoulalas-Divanis, V. S. Verykios

 =F
A way of computing the negative border of the exact solution, in which FD DO −Smax ,
is presented in Algorithm 3. In this algorithm, we move top–down in the lattice to identify
infrequent itemsets whose all proper subsets are frequent in FD . First, we examine all one-
item itemsets. If any of these itemsets is infrequent, it should be included in the negative border
of FD . Then, we examine all two-item itemsets by properly joining (symbol
denotes a
join) the frequent one-item itemsets. Again, if the produced two-item itemset does not exist
in FD , we include it in Bd− (FD ). To examine k-itemsets (where k > 3), we first construct
them by properly joining frequent (k − 1)-itemsets (as in Apriori) and then check to see if
the produced itemset is large in FD . If it is reported as infrequent, then we examine all its
(k − 1) proper subsets. If none of these is infrequent, then the itemset belongs to the negative
border; so, we include it in Bd− (FD ).

Algorithm 3 Computation of the revised negative border Bd− (FD ) of D.


1: procedure INB- Computation(F)
2: for k = 1; Fk = ∅; k ++ do
3: if k = 1 then
4: for each item x ∈ F1 do
5: if x ∈/ FD then x is infrequent
6: x ∈ B d − (FD )
7: else if k = 2 then
8: for x ∈ F1 do
9: for y ∈ F1 do
10: if (x < y) ∧ (x
y ∈ / F2 ) then
11: (x
y) ∈ Bd− (FD )
12: else for k > 2
13: for x ∈ Fk−1 do
14: for y ∈ Fk−1 do
15: if (x1 = y1 ) ∧ . . . ∧ (xk−1 < yk−1 ) then
16: z = x
y z is the join of x and y
17: if z ∈/ FD , rk−1 ⊂ z : rk−1 ∈ / F  then
18: z ∈ B d − (FD )

Algorithm 4 Hiding of all the sensitive itemsets and their supersets.


1: procedure HideSS(FD , S)
2: for each s ∈ S do for all sensitive itemsets
3: for each f ∈ FD do for all large itemsets
4: if s ⊆ f then large itemset is sensitive
5: FD = FD − f remove itemset f

Finally, Algorithm 4 presents the hiding process in which we identify FD  by removing

from FDO all the sensitive itemsets and their supersets. To do so, we first iterate over all the
sensitive itemsets S and the large itemsets FDO , and we identify all those large itemsets that
are supersets of the sensitive. We remove these from the list of large itemsets, thus construct
a new set FD  with the remaining large itemsets.

5 The two hiding schemes: inline versus hybrid

Having presented the process that enables us to identify the exact solution, in this section we
briefly review two exact hiding schemes that try to enforce this solution when constructing
the sanitized version D of DO . Both methodologies exploit cover relations governing the
various itemsets in the lattice of DO to identify a small subset of itemsets whose status

123
Hiding sensitive knowledge without side effects 271

maximize unm ∈U unm

Tn ∈D {X} Im ∈X unm < msup, ∀X ∈ S


subject to
Tn ∈D {R} Im ∈R nm ≥ msup, ∀R ∈ V
u

Fig. 2 The Constraints Satisfaction Problem for the inline approach

(frequent vs. infrequent) needs to be controlled. This will ensure the minimum impact in the
sanitized outcome D. By properly controlling the status of this small set of itemsets, one can
guarantee that the status of all the itemsets in the lattice of D pertains to an exact solution.
The two approaches differ in the itemsets they choose to control and the way they sanitize
database DO .
Specifically, the first approach (hereon called inline) aims at the exclusion of the least
number of items participating in transactions in DO , to facilitate sensitive knowledge hiding.
Through a set of theorems, we [14] prove that the exact solution can be found if the status of
the itemsets in set

C = {I ∈ B d+ (FD

) : I ∩ I S  = ∅} ∪ S (1)

is properly controlled. Assuming that V = {I ∈ Bd+ (FD  ) : I ∩ I S  = ∅} thus C = V ∪ S ,

the authors formulate the CSP depicted in Fig. 2, where D{I } denotes the set of supporting
transactions for an itemset I and u nm corresponds to the mth item of the nth transaction while
in the sanitization process (i.e., when its status is under control).
On the other hand, the second approach (hereon called hybrid) hides the sensitive itemsets
by creating a database extension (denoted as DX ) to the original database DO and controlling
the values of the various items in the transactions of the extension to address the minimum
impact principle. Several issues need to be properly considered regarding the database exten-
sion, such as (i) the size of the extension, (ii) the minimum set of itemsets that need to be
controlled to provide an exact solution, (iii) the validity of the transactions participating in
the extension, and (iv) issues regarding the minimization of the complexity and the size of
the produced CSP for an efficient solution through BIP.
Regarding the size of the extension, we [15] compute the absolute lower bound Q on
the number of transactions Tq in DX , by identifying the minimum number of transactions
that are needed to hide the sensitive itemset from S having the highest support. However,
as it is demonstrated by us [15], there are cases in which this lower bound is insufficient to
allow for an exact solution. For this reason, a threshold, known as Safety Margin (SM), is
applied that adds SM transactions to the ones specified by the lower bound, to ensure that
the produced extension DX contains a sufficiently large number of transactions to allow for
the identification of an exact solution. Since this approach may lead to an unnecessary large
number of transactions, the authors propose a methodology that at a later point checks what
number of the transactions in the extension were actually useful and removes the portion of
the extension that is unnecessary, thus reducing the size of database D.
Regarding the set of itemsets that need to be controlled, a similar approach to the one of
the inline algorithm is enforced. Through a set of theorems, it is proved that itemsets whose
status has to be controlled in D (through the extension DX ) is equivalent to

C = B d+ (FD

) ∪ B d− (FD

) (2)

123
272 A. Gkoulalas-Divanis, V. S. Verykios

Fig. 3 The CSP of the hybrid


approach that ensures the validity minimize q∈[1,Q+SM],m∈[1,M ] uqm
of transactions
⎧ Q+SM

⎪ q=1 im ∈I uqm < thr, ∀I ∈ Bd− (FD )



Q+SM
subject to uqm ≥ thr, ∀I ∈ Bd+ (FD )


q=1 im ∈I



∀Tq ∈ DX : im ∈I uqm ≥ 1

where thr = minf · (N + Q + SM) − sup(I , DO )

As one can notice, the size of C is larger in the case of the hybrid approach, when compared to
the one of the inline approach. This is due to the fact that in the inline approach the sanitized
database has the same size as the original one, a property that allows for further pruning of
the participating itemsets in C .
To ensure the validity of the transactions in the database extension, the authors propose
two approaches: (i) the incorporation of the check for validity conformance in the formulated
CSP (through a set of inequalities that need to hold, each one ensuring the validity of one
transaction in the extended part), and (ii) the exchange of the empty or null transactions of
DX , after the application of the hiding process, with transactions from the revised positive
border. Figure 3 presents the first approach, where the formulated CSP hides the sensitive
knowledge from DO (first type of constraints), while ensuring that all nonsensitive itemsets
remain frequent (second type of constraints) and that the transactions in the sanitized outcome
D are valid (third type of constraints).
Both the inline and the hybrid approaches make use of a Constraints Degree Reduction
(CDR) algorithm to minimize the complexity of the produced CSP. The CDR algorithm
relies on the binary nature of the variables involved in the CSP to alleviate it from high
degree constraints. Such constraints are typically produced due to the product of the involved
binary variables. After the application of the CDR algorithm (presented in Fig. 4 for the case
of the hybrid approach), the outcome of CSP consists only of linear constraints. Each of these

Replace All

Tq ∈DX Ψs thr, Ψs = im ∈Tq uqm

⎧ With
⎪ c1 : Ψs ≤ uq1


⎨ c2 : Ψs ≤ uq2

∀i ..
⎪ .


⎩ cZ : Ψs ≤ uqm

Ψs ≥ uq1 + uq2 + . . . + uqm − | Z| + 1

And

s Ψs thr

where Ψs ∈ {0, 1}

Fig. 4 The Constraints Degree Reduction approach for the hybrid algorithm

123
Hiding sensitive knowledge without side effects 273

constraints is an inequality involving the summation of a set of binary variables and, possibly,
some integer values. It is important to mention that the CDR algorithm guarantees that the
solution set of the newly formulated CSP, after the application of CDR, is identical to the one
that would be produced by solving the initial CSP. Moreover, to reduce the size of the CSP
after the application of the CDR algorithm, the authors make use of two pruning strategies.
The first strategy is responsible to identify and remove any constraints of the CSP that are
tautologies. There are several types of tautologies that can be typically met in a CSP, such as
u 11 ≥ 0 or u 11 + u 21 < 2.1. The second strategy partitions the remaining inequalities into
overlapping sets and maintains only the most specific inequality from each set. For example,
given the constraints C1 : u 11 + u 21 < 1.3 and C2 : u 11 + u 21 < 0.9, only C2 has to be kept,
since its solution set is a subset of the solution set of C1 . In practice, both pruning strategies
are shown to alleviate the BIP solver from a lot of unnecessary work and to speedup the
hiding process.
Finally, in the case that an exact solution cannot be identified for the problem at hand,
a heuristic is applied by both the inline and the hybrid approaches that selectively removes
inequalities from the produced CSP to come up with a CSP that is feasible. The removal
of the inequalities from the CSP is performed in such a way that the sensitive knowledge
is guaranteed to be protected in the sanitized database D. This means that no constraint
involving itemsets from set S will be removed from the CSP. However, database D is bound
to loose some nonsensitive frequent itemsets of DO . Therefore, the goal of the relaxation
strategy of Algorithm 5 is to ensure that the least number of nonsensitive itemsets are lost. The
removal of inequalities is performed from set V in the case of the inline approach and from
the revised positive border Bd+ (FD  ) in the case of the hybrid approach. As demonstrated

in Algorithm 5 (referring to the hybrid approach), emphasis is given on the removal of


constraints involving maximal size and minimum support itemsets from DO . The removal
of inequalities is performed in a “batch” fashion, where all inequalities in the same “batch”
are removed from the CSP.

 ).
Algorithm 5 Relaxation Procedure in V = Bd+ (FD
1: procedure SelectRemove(Constraints
 CR , V , DO )
2: C Rmaxlen ← argmaxi {|Ri |} C Ri ↔ Vi
3: crmsup ← minC R ,i (sup(Ri ∈ V, DO )
maxlen
4: for each c ∈ C R do
maxlen
5: if sup(Ri , DO ) = crmsup then
6: Remove (c) remove constraint

An example will allow us to demonstrate some of the key aspects of these metho-
dologies. Consider the database DO of Table 1. Using a minimum frequency threshold
minf = 0.2, we have that the frequent itemsets of DO are FDO = {A, B, C, D, AB, AC,
AD, C D, AC D}. Suppose that we want to hide sensitive itemset S = {AB}. To do so,
 . Given that S
we need to compute the ideal set FD max = {AB}, since no other frequent
itemset that is a proper superset of S exists in DO , we have that the ideal set is FD  =
+ 
FDO − Smax = {A, B, C, D, AC, AD, C D, AC D} and thus Bd (FD ) = {B, AC D} and
Bd− (FD  ) = {AB, BC, B D}. By using the inline approach stated in our previous work

[14], we proceed to create the intermediate form of DO shown in Table 2. Specifically, for
every transaction of DO that supports the sensitive itemset {AB}, we replace the “1”s in the
columns of items A and B with unique binary variables. Then, based on the intermediate
form of DO , we produce the corresponding CSP that will allow us to assign the best values

123
274 A. Gkoulalas-Divanis, V. S. Verykios

Table 1 Original database DO


A B C D
used in the example
1 0 1 0
1 0 1 1
0 0 1 1
0 1 0 0
1 1 1 1
0 0 0 1
0 0 1 0
1 1 0 0

Table 2 Intermediate form of


A B C D
DO using the inline approach
1 0 1 0
1 0 1 1
0 0 1 1
0 1 0 0
u 51 u 52 1 1
0 0 0 1
0 0 1 0
u 81 u 82 0 0

Fig. 5 The CSP of the inline maximize(u51 + u52 + u81 + u82 )


approach when applied on DO

AB : u51 u52 + u81 u82 < 1.6


subject to B : 1 + u52 + u82 ≥ 1.6
ACD : 1 + u51 ≥ 1.6

Table 3 The three exact


Solution u 51 u 52 u 81 u 82
solutions using the inline
approach
l1 1 0 1 1
l2 1 1 0 1
l3 1 1 1 0

to the inserted binary variables. The CSP for this example is shown in Fig. 5. The solution
of this CSP yields three exact hiding solutions (having a distance of 3), presented in Table 3.
Now suppose that we want to hide the same sensitive knowledge by using the hybrid
approach. The support of the sensitive itemset {AB} in DO is equal to 2, since the itemset
is supported by the fifth and the eighth transactions. To effectively hide this itemset, we
need to augment the original database by a number of transactions (that will not support
{AB}) such that its support in the new database will drop below the minimum support
threshold. If we add one such transaction, the support of the sensitive itemset will drop to
2/9  0.22, if we add two transactions it will drop to 2/10 = 0.2, whereas if we add three
transactions it will become 2/11  0.18 < minf. Thus, the absolutely necessary extension

123
Hiding sensitive knowledge without side effects 275

Table 4 Intermediate form of


A B C D
DO using the hybrid approach
1 0 1 0
1 0 1 1
0 0 1 1
0 1 0 0
1 1 1 1
0 0 0 1
0 0 1 0
1 1 0 0
u 91 u 92 u 93 u 94
u 101 u 102 u 103 u 104
u 111 u 112 u 113 u 114

minimize q∈[9,11],m∈[1,4] uqm




⎪ AB : u91 u92 + u101 u102 + u111 u112 < 0.2

⎨ BC : u92 u93 + u102 u103 + u112 u113 < 1.2
subject to BD : u92 u94 + u102 u104 + u112 u114 < 1.2

⎪ B : u92 + u102 + u112 ≥ −0.8


ACD : u91 u93 u94 + u101 u103 u104 + u111 u113 u114 ≥ 0.2

Fig. 6 The CSP of the hybrid approach when applied on DO

consists of Q = 3 transactions. Assume that we do not use a safety margin, i.e., SM = 0.


Then, we produce the intermediate for of DO , shown in Table 4, simply by adding the
required number of transactions and inserting binary variables each of the M = 4 items.
The hybrid hiding algorithm produces the CSP of Fig. 6, where the first three constraints
involve the itemsets of the revised negative border and the last two constraints the itemsets of
the revised positive border. The solution of the formulated CSP yields three exact solutions:
l1 : {u 91 = u 93 = u 94 = 1, the rest 0}, l2 : {u 101 = u 103 = u 104 = 1, the rest 0},
and l3 : {u 111 = u 113 = u 114 = 1, the rest 0}. Moreover, as one can observe, the fourth
inequality of this CSP is a tautology. Generally speaking, an inequality regarding an itemset
I ∈ Bd+ (FD  ) is a tautology if sup(I, D ) ≥ minf ·|D|, where |D| = N + Q + SM. This is
O
true since, even if I is unsupported in DX (i.e., sup(I, DX ) = 0), its support in DO suffices to
allow it remain frequent in D. Inequalities of this form are commonly met among the CSPs
that are formulated by the hybrid algorithm, and for this reason, the two pruning strategies
explained earlier alleviate the solver from unnecessary work.

6 A two-phase iterative process for exact knowledge hiding

The functionality of the inline algorithm can be extended to allow for the identification of
exact solutions for a wider range of problem instances. We define a problem instance as the
set of (i) the supplied original dataset DO , (ii) the minimum frequency threshold minf, and

123
276 A. Gkoulalas-Divanis, V. S. Verykios

(iii) the sensitive itemsets S that need to be protected. Because of the fact that the inline
algorithm allows only supported items in DO to become unsupported in D, we argue that
there are several problem instances that although they allow for an exact solution, the inline
approach is incapable of identifying it. A proof of this statement can be sought in Sect. 9, as
well as in experimental evaluation of the hybrid algorithm [15].
In what follows, we propose a two-phase iterative process that improves the functionality
of the inline approach. Then, we provide a specific problem instance for which the inline
algorithm fails to provide an exact hiding solution and demonstrate how the new approach
achieves to identify it.

6.1 Theory behind the two-phase iterative approach

The proposed approach consists of two phases that iterate until either (i) an exact solution of
the given problem instance is identified or (ii) a prespecified number of subsequent iterations
 have taken place. Threshold  is called the limiting factor and must remain low enough to
allow for an efficient solution. The first phase of the algorithm utilizes the inline approach
to hide the sensitive knowledge. If it succeeds, then the process is terminated and database
D is returned. This phase causes the retreat of the positive border of DO in the lattice, thus
excluding from FD the sensitive itemsets and their supersets. If the first phase is unable to
identify an exact solution, the algorithm proceeds to the second phase, which implements the
dual counterpart of the inline algorithm. Specifically, from the produced infeasible CSP of
the first phase, the algorithm proceeds to remove selected inequalities as in Algorithm 5 but
in an “one-to-one” rather than in a “batch” fashion,4 until the underlying problem becomes
solvable. The second phase results in the expansion of the positive border, and the iteration
of the two phases aims at the approximation (to the highest possible extend) of the ideal
borderline in the sanitized database.
Let H denote the set of itemsets for which the corresponding inequalities were removed
from the CSP to allow for a feasible solution. Obviously, H ⊆ Bd+ (FD  ). Since the produced

CSP is now feasible, the algorithm proceeds to identify the solution and compose the corres-
ponding sanitized database D. This database is bound to suffer from the side effect of hidden
(nonsensitive) frequent patterns. The purpose of the second phase is to try to constitute these
lost (due to the side effects) itemsets of set H frequent again by increasing their support in
D. However, the support increment should be accomplished in a careful manner to ensure
that both the sensitive itemsets and all the itemsets that do not participate in F  (D) remain
infrequent.
Let DH denote this intermediate (for the purposes of the second phase) view of the sanitized
database D. In this database, we want the itemsets of H to become frequent and the outcome
database to adhere as much as possible to the properties of the ideal set FD  . As one can

observe, the second phase of the approach has as a consequence the movement of the revised
positive border downwards in the lattice to include an extended set of itemsets, constituting
them frequent. As a result, this two-phase iterative process, when viewed vertically in the
itemset lattice, resembles the oscillation of a pendulum, where one stage follows the other
with the hope that the computed revised positive border will at some point converge to the
optimal one.
We now proceed to analyze the mechanism that takes place as part of the second phase of
this “oscillation.” Since the goal of this phase is to constitute the itemsets of H frequent in DH ,
the modification of this database will be based only on item inclusions on a selected subset of

4 The selection among inequalities corresponding to the same “batch” is performed arbitrarily.

123
Hiding sensitive knowledge without side effects 277

Fig. 7 The formulation of the


CSP for the second stage of the minimize unm ∈U unm
two-phase iterative approach
Tn ∈DH, {X} Im ∈X unm < msup, ∀X ∈ Bd− (FD )
subject to
Tn ∈DH, {R} Im ∈R unm ≥ msup, ∀R ∈ H

transactions. Following the inline approach, the candidate items for inclusion are only those
that appear among the itemsets in H, i.e., those in the universe of I H . All these items will be
substituted in the transactions of DH , where they are unsupported, by the corresponding u nm
variables. This will produce an intermediate form of database DH . Following that, a CSP is
constructed in which the itemsets that are controlled belong to

C = H ∪ B d− (FD

) (3)

Finally, the optimization criterion is altered to denote that a minimization, rather than a
maximization, of the binary variables u nm will lead to the best solution. Figure 7 presents the
form of the CSP created in the second phase of this two-phase iterative hiding process. As
mentioned earlier, these two phases can be executed in an iterative fashion until convergence
to the exact solution is achieved or a prespecified number of oscillations  take place.

6.2 Removal of constraints from the infeasible CSP

An aspect of the two-phase iterative approach that requires further investigation regards
the constraints selection and removal process that turns an infeasible CSP into its feasible
counterpart. In the workings of the proposed approach, we consider the eviction process of
Algorithm 5 in an attempt to maximize the probability of yielding a feasible CSP after the
removal of only a few number of constraints (inequalities). This is achieved by removing the
most strict constraints, i.e., those that involve the maximum number of binary variables (equi-
valent maximum length itemsets from DO ). To ensure that the removal of these constraints
will cause minimal loss of nonsensitive knowledge to the database, we select among them
the ones that involve itemsets of low support in DO , as these itemsets would be the first to be
hidden if the support threshold was increased. Although we consider this to be a reasonable
heuristic, it may not always lead to the optimal selection, i.e., the one where the minimum
number of constraints are selected for eviction and their removal from the CSP causes the
least distortion to the database. Thus, in what follows, we discuss the properties of a mecha-
nism that can be used for the identification of the best possible set of inequalities to formulate
the constraints of the feasible CSP.
Let a constraint set be a set of inequalities corresponding to the itemsets of the positive
border, as taken from the original (infeasible) CSP. We argue that there exists a “1–1” cor-
respondence between constraint sets and itemsets, where each item is mapped to a constraint
(and vice versa). A frequent itemset is equivalent to a feasible constraint set, i.e., a set of
constraints that has a solution satisfying all the constraints in the set. The downwards clo-
sure property of the frequent itemsets also holds in the case of constraint sets, since all the
subsets (taken by removing one or more inequalities) of a feasible constraint set are also
feasible constraint sets and all the supersets of an infeasible constraint set are also infeasible
constraint sets. Furthermore, a maximal feasible constraint set (in relation to a maximal
frequent itemset) is a feasible constraint set where all its supersets are infeasible constraint
sets.
The constraints removal process, explained in Sect. 6.1, can be considered as the identi-
fication of a maximal feasible constraint set among the inequalities of the revised positive

123
278 A. Gkoulalas-Divanis, V. S. Verykios

Table 5 The original database


A B C D E
DO
1 0 1 0 0
1 0 1 1 1
0 0 1 1 0
0 1 0 0 1
1 0 1 1 1
0 0 0 1 1
0 0 1 0 0
1 1 0 0 0
1 0 1 0 0
0 0 1 1 0

Table 6 The intermediate


A B C D E
database of DO
1 0 1 0 0
1 0 u 23 u 24 1
0 0 u 33 u 34 0
0 1 0 0 1
1 0 u 53 u 54 1
0 0 0 1 1
0 0 1 0 0
1 1 0 0 0
1 0 1 0 0
0 0 u 103 u 104 0

border. These constraints will be the ones that will participate in the (feasible) CSP. Because
of the correspondence that exists between itemsets and constraint sets, we argue that one
could use techniques that are currently applied on frequent itemset mining algorithms (such
as pruning approaches) to efficiently identify the maximal constraint sets and possibly further
select one among them (if we consider the existence of some metric of quality among the
different constraint sets). However, the decision of whether a constraint set is feasible or
not is a computationally demanding process, and for this reason, we consider the proposed
heuristic to provide a more efficient (although in some cases suboptimal) alternative.

6.3 An example of the two-phase iterative process

Consider the original database DO depicted in Table 5. When mining this database for
frequent itemsets, using minf = 0.2, the following set of patterns is returned
FDO = { A, B, C, D, E, AC, AD, AE, C D, C E
D E, AC D, AC E, AD E, C D E, AC D E}
Suppose that the sensitive knowledge corresponds to the frequent itemset S = {C D},
which needs to be hidden in D. Given S we compute Smax = {C D, AC D, AC D E} that
corresponds to the sensitive itemsets and their frequent supersets. The revised positive border
will then contain the itemsets of Bd+ (FD  ) = {B, AC E, AD E}. The intermediate form of

database DO (generated using the inline approach) is shown in Table 6. From this database,

123
Hiding sensitive knowledge without side effects 279

Table 7 The database DH


A B C D E

1 0 1 0 0
1 0 1 1 1
0 0 1 0 0
0 1 0 0 1
1 0 0 1 1
0 0 0 1 1
0 0 1 0 0
1 1 0 0 0
1 0 1 0 0
0 0 1 0 0

using set V = {I ∈ Bd+ (FD  ) : I ∩ I S  = ∅} = {AC E, AD E}, we produce the following

set of inequalities for C = V ∪ S = {AC E, AD E, C D} that are incorporated in the CSP:

u 23 + u 53 ≥ 2 (4)
u 24 + u 54 ≥ 2 (5)
u 23 u 24 + u 33 u 34 + u 53 u 54 + u 103 u 104 < 2 (6)

The first two inequalities correspond to itemsets ACE and ADE, which must remain
frequent in D, while the last one reflects the status of the sensitive itemset CD, which must
become infrequent in the sanitized outcome. It is easy to prove that if these inequalities are
incorporated in the CSP of Fig. 2, then the produced CSP is unsolvable. Thus, we apply
Algorithm 5 to alleviate the CSP from inequalities of set V = {AC E, AD E}. However, as
mentioned earlier, we do not use the “batch” mode of operation of this algorithm (which
would remove both itemsets in V ), but instead, we use the “one-by-one” mode of operation
that selects to remove one among the inequalities returned in the same “batch.” As already
stated, at this point, the selection among the inequalities of the same batch is arbitrary.
Suppose, we select to remove from the CSP the inequality that corresponds to itemset ACE
of the revised positive border. The resulting CSP is then solvable. Among the alternative
possible solutions, assume that we select the one having u 34 = u 53 = u 104 = 0 and
u 23 = u 24 = u 33 = u 54 = u 103 = 1. Since the criterion function of the CSP requires
the maximization of the number of binary variables that are set to “1”, we attain a solution
that has the minimum possible variables set to “0”. This solution (having a distance of 3)
is depicted in Table 7; since it is not exact, we name our database DH and proceed to the
second phase of the approach.
As we can easily observe, in database DH of Table 7, itemset H = {AC E} is hidden, since
its support has dropped to one. In an attempt to constitute this itemset frequent again, we
create an intermediate form of database DH in which in all transactions that do not support
items A, C, or E, we substitute the corresponding zero entries with binary variables u nm .
This, leads to the database depicted in Table 8.
As a next step, we create the corresponding inequalities for the itemsets of set C , by taking
into account the itemsets of sets H and Bd− (FH  ) = {AB, BC, B D, B E, C D}, which are

not foreign to I H = {A, C, E}. Thus, we have the following set of produced inequalities:

123
280 A. Gkoulalas-Divanis, V. S. Verykios

Table 8 Intermediate form of


A B C D E
database DH
1 0 1 0 u 15
1 0 1 1 1
u 31 0 1 0 u 35
u 41 1 u 43 0 1
1 0 u 53 1 1
u 61 0 u 63 1 1
u 71 0 1 0 u 75
1 1 u 83 0 u 85
1 0 1 0 u 95
u 101 0 1 0 u 105

Table 9 Database D produced


A B C D E
by the two-phase iterative
approach
1 0 1 0 1
1 0 1 1 1
0 0 1 0 0
0 1 0 0 1
1 0 0 1 1
0 0 0 1 1
0 0 1 0 0
1 1 0 0 0
1 0 1 0 0
0 0 1 0 0

AC E : u 15 + u 31 u 35 + u 41 u 43 + u 53 + u 61 u 63
+u 71 u 75 + u 83 u 85 + u 95 + u 101 u 105 ≥ 1 (7)

AB : u 41 < 1 ⇒ u 41 = 0 (8)

BC : u 43 + u 83 < 2 ⇒ u 43 = 0 ∨ u 83 = 0 (9)

B E : u 85 < 1 ⇒ u 85 = 0 (10)

C D : u 53 + u 63 < 1 ⇒ u 53 = u 63 = 0 (11)

Notice that no inequality is produced from itemset BD, since it is not foreign to set I H .
Moreover, it is easy to prove that this set of inequalities is solvable, yielding two exact
solutions, each of which requires only one binary variable to become “1”: either u 15 or u 95 .
These two solutions are actually the same, since the involved transactions (first or ninth) do
not differ from one another and both solutions apply the same modification to either of them.
Table 9 presents the solution where u 15 = 1.
Figure 8 shows the original borderline along with the borderlines of the databases produced
as an output of the two phases of the algorithm. Notice that the output of the second phase

123
Hiding sensitive knowledge without side effects 281

BDE

ACE

BDE
BE

CD

CD
BCE

BDE

BCE
BD

BE

BE
BCD

BCD
BCE
BC

BD

BD
BCDE

BCDE

BCDE
CDE

CDE
BCD
DE

BC

BC
ADE

ADE
CDE
CE

DE

DE
ACDE

ACDE

ACDE
ACE

ADE

ACE
CD

CE

CE
D
E

E
ABDE

ABDE

ABDE
ACD

ACD

ACD
AE

AE

AE
D

D
ABE

ABE

ABE
AD

AD

AD
C

C
ABCE

ABCE

ABCE
ABD

ABD

ABD
AC

AC

AC
B

B
ABCDE

ABCDE

ABCDE
ABCD

ABCD

ABCD
ABC

ABC

ABC
AB

AB

AB
A

A
original borderline Phase 1 – revised borderline Phase 2 – revised borderline

Fig. 8 The two phases of iteration for the considered example

in the first iteration of the algorithm yields an exact solution. Thus, for  = 1, the two-phase
iterative algorithm has provided an exact solution that was missed by the inline algorithm.

7 A framework for exact knowledge hiding through parallelization

Performing knowledge hiding by using the inline, the hybrid, or the two-phase iterative
approach allows for the identification of exact solutions in most problem instances. However,
the cost of identifying an exact solution is high due to the solving of the involved CSPs.
The CDR approach of Fig. 4 somewhat alleviates this problem by eliminating high degree
constraints that participate in the CSP. However, the algorithms still suffer from very large
problem sizes that may constitute the whole hiding process computationally intractable.
In this section, we propose a framework for decomposition and parallel solving that can be
applied as part of the sanitization process of any of the exact hiding algorithms presented so far.
Our proposed framework operates in three phases, namely (i) the structural decomposition
phase, (ii) the decomposition of large individual components phase, and (iii) the parallel
solving of the produced CSPs. In what follows, we present the details that involve each phase
of the framework.

7.1 Phase I: structural decomposition of the CSP

The number of constraints in a CSP can be very large depending on the database properties,
the minimum support threshold used, and the number of sensitive itemsets. Moreover, the
fact that various initial constraints may incorporate products of u nm variables, thus having

123
282 A. Gkoulalas-Divanis, V. S. Verykios

Original CSP

CSP

U1...U1000

CSP CSP CSP CSP

U1...U100 U101...U500 U501...U700


... U950...U1000

Independent components

Fig. 9 Decomposing large CSPs to smaller ones

Table 10 Database DO used in


A B C D
the example of structural
decomposition
1 1 0 0
1 1 0 0
1 0 0 0
1 1 0 0
0 1 0 1
1 0 1 1
0 0 1 1
0 0 1 1
1 0 0 0
0 0 0 1

a need to be replaced by numerous linear inequalities (using the CDR approach), makes the
whole BIP problem tougher to solve. There is, however, a nice property in the CSPs that we
can use to our benefit, that is, decomposition.
Based on the divide and conquer paradigm, a decomposition approach allows us to divide
a large problem into numerous smaller ones, solve these new subproblems independently,
and combine the partial solutions to attain the exact same overall solution. The property of the
CSPs, which allows us to consider such a strategy, lies behind the optimization criterion that
is used. Indeed, one can easily notice that the criterion of maximizing (equiv. minimizing) the
summation of the binary u nm variables is satisfied when as many u nm variables as possible are
set to one (equiv. zero). This, can be established independently, provided that the constraints
that participate in the CSP allow for an appropriate decomposition. The approach we follow
for the initial decomposition of the CSP is similar to the decomposition structure identification
algorithm presented in [23], although applied in a “constraints” rather than a “transactions”
level. As demonstrated on Fig. 9, the output of structural decomposition, when applied on
the original CSP, is a set of smaller CSPs that can be solved independently. An example will
allow us to better demonstrate how this process works.
Consider database DO of Table 10. By performing frequent itemset mining in DO
using frequency threshold minf = 0.3, we compute the following set of large itemsets:

123
Hiding sensitive knowledge without side effects 283

Table 11 Intermediate form of


A B C D
DO
1 u 12 0 0
1 u 22 0 0
1 0 0 0
1 u 42 0 0
0 u 52 0 1
1 0 u 63 u 64
0 0 u 73 u 74
0 0 u 83 u 84
1 0 0 0
0 0 0 1

Fig. 10 The Constraints maximize ( u12 + u22 + u42 + u52 + u63 +


Satisfaction Problem formulation
of the example u64 + u73 + u74 + u83 + u84 )


⎪ u + u22 + u42 + u52 < 3
⎨ 12
u63 u64 + u73 u74 + u83 u84 < 3
subject to
⎪ u63 + u73 + u83 ≥ 3

u64 + u74 + u84 ≥ 1

FDO = {A, B, C, D, AB, C D}. Suppose that we want to hide the sensitive itemsets in
S = {B, C D}, using for instance the inline approach.5 Then, we have that:
Smax = {B, AB, C D} (12)
+ 
Bd (FD ) = {A, C, D} (13)
V = {C, D} ⊂ Bd+ (FD

) (14)
The intermediate form of this CSP is shown in Table 11 and its BIP formulation in Fig. 10.
Table 12 presents the various constraints cr along with the variables that they control. As we
can observe, we can cluster the various constraints into disjoint sets based on the variables
that they involve. In our example, we can identify two such clusters of constraints, namely
M1 = {c1 } and M2 = {c2 , c3 , c4 }. Notice that none of the variables in each cluster of
constraints is contained in any other cluster. Thus, instead of solving the entire problem of Fig.
10, we can solve the two subproblems shown in Fig. 11, yielding, when combined, the same
solution as the one of the initial CSP: u 12 = u 22 = u 42 = u 63 = u 64 = u 73 = u 74 = u 83 = 1
and u 52 = u 84 = 0.

7.2 Phase II: decomposition of large independent components

The structural decomposition of the original CSP allows one to divide the original large
problem into a number of smaller subproblems, which can be solved independently, thus
5 We need to mention that it is of no importance which methodology will be used to produce the CSP, apart
from the obvious fact that some methodologies may produce CSPs that are better decomposable than those
constructed by other approaches. However, the structure of the CSP also depends on the problem instance,
and thus it is difficult to know in advance which algorithm is bound to produce a better decomposable CSP.

123
284 A. Gkoulalas-Divanis, V. S. Verykios

Table 12 Constraints matrix for


c1 c2 c3 c4
the CSP
u 12 X
u 22 X
u 42 X
u 52 X
u 63 X X
u 64 X X
u 73 X X
u 74 X X
u 83 X X
u 84 X X

Fig. 11 The equivalent CSPs for


the given example
maximize(u12 + u22 + u42 + u52 )
subject to u12 + u22 + u42 + u52 < 3
where {u12 , u 22 , u 42 , u 52 } ∈ {0, 1}
and
maximize(u63 + u64 + u73 + u74 + u83 + u84 )
u63 u64 + u73 u74 + u83 u84 < 3
subject to u63 + u73 + u83 ≥ 3
u64 + u74 + u84 ≥ 1
where {u63 , u 64 , u 73 , u 74 , u 83 , u 84 } ∈ {0, 1}

highly reducing the runtime needed to attain the overall solution. However, as it can be
noticed, both (i) the number of subproblems and (ii) the size of each subproblem are totally
dependent on the underlying CSP and the structure of the constraints matrix. This fact means
that there exist problem instances, which are not decomposable, and other instances, which
experience a notable imbalance in the size of the produced components. Thus, in what follows,
we present two methodologies, which allow us to decompose large individual components
that are nonseparable through the structural decomposition approach. In both methods, our
goal is to minimize the number of variables that are shared among the newly produced
components, which are now dependent. What allows us to proceed in such a decomposition
is the binary nature of the variables involved in the CSPs, a fact that we can use to our benefit
to minimize the different problem instances that need to be solved to produce the overall
solution of the initial problem.

7.2.1 Method I: decomposition using articulation points

To further decompose an independent component, we need to identify the least amount of u nm


variables, which, when discarded from the various inequalities of this CSP, produce a CSP
that is structurally decomposable. To find these u nm s we proceed as follows. First, we create
an undirected graph G (V, E) in which each vertex v ∈ V corresponds to a u nm variable, and
each edge e ∈ E connects vertexes that participate in the same constraint. Graph G can be

123
Hiding sensitive knowledge without side effects 285

built in linear time and provides us with an easy way to model the network of constraints and
involved variables in our input CSP. Since we assume that our input CSP is not structurally
decomposable, graph G will be connected.
After creating the constraints graph G , we identify all its articulation points (a.k.a. cut-
vertexes). The rationale here is that removal of a cut-vertex will disconnect graph G , and the
best cut-vertex u nm will be the one that leads to the largest number of connected components
in G . Each of these components will then itself constitute a new subproblem to be solved
independently from the others. To identify the best articulation point, we proceed as follows.
As is already known, a fast way to compute the articulation points of a graph is to traverse
it by using DFS. By adding a counter to a table of vertexes each time we visit a node, we
can easily keep track of the number of components that were identified so far. At the end
of the algorithm, along with the identified articulation points, we can have knowledge of
the number of components by which each of these articulation points decomposes the initial
graph. This operation can proceed in linear time O(V + E).
After identifying the best articulation point, our next step is to remove the corresponding
u nm variable from graph G . Then, each component of the resulting graph corresponds to a new
subproblem (i.e., a new CSP) that can be derived in linear time and solved independently. To
provide the same solution as the original CSP, the solutions of the various created subproblems
need to be cross-examined, a procedure that is further explained in Sect. 7.3.
A final step to be addressed involves the situation in which no single cut-vertex can
be identified in the graph. If such a case appears, we choose to proceed heuristically to
experience low runtime of the algorithm. Our empirical approach is based on the premises
that nodes having high degrees in graph G are more likely than others to correspond to cut-
vertexes. For this reason, we choose to compute the degree of each vertex u ∈ V in graph
G (V, E) and identify the one having the maximum degree. Let v = maxu∈V (degree(u))
be the vertex whose degree is the maximum among all other vertexes in the graph. Then,
among all neighbors of v, we identify the one having the maximum degree and proceed to
remove both vertexes from the graph. As a final step, we use DFS to traverse the resultant
graph to examine if it is disconnected. The runtime of this approach is linear in the number of
vertexes and edges of graph G . If the resultant graph remains connected, we choose to leave
the original CSP as is and make no further attempt to decompose it. Figure 12 demonstrates an
example of decomposition using articulation points. In this graph, we denote as “cut-vertex”
the vertex, which, when removed, leads to a disconnected graph having the maximum number
of connected components (here 3).

7.2.2 Method II: decomposition using weighted graph partitioning

One of the primary disadvantages of decomposition using articulation points is the fact that
we have limited control over (i) the number of components in which our initial CSP will
eventually split, and (ii) the size of each of these components. This fact may lead to low
CPU utilization in a parallel solving environment. For this reason, we present an alternative
decomposition strategy, which can break the original problem into as many subproblems
as we can concurrently solve, based on our underlying system architecture. The problem
formulation is once more tightly related to the graph modeling paradigm, but instead of
using articulation points, we rely on graph partitioning algorithms to provide us with the
optimal split.
By assigning each u nm variable of the initial CSP to a vertex in our undirected graph, and
each constraint c to a number of edges ec formulating a clique in the graph (while denoting the
dependence of the u nm variables involved), we proceed to construct a weighted version of the

123
286 A. Gkoulalas-Divanis, V. S. Verykios

component

a b e f

d c cut-
h g
vertex

component

component

i j

l k

Fig. 12 An example of decomposition using articulation points

graph G presented in the previous section. This weighted graph, hereon denoted as G W , has
two types of weights: one associated with each vertex u ∈ V W , and one associated with each
edge e ∈ E W . The weight of each vertex corresponds to the number of constraints in which it
participates in the CSP formulation. On the other hand, the weight of each edge in the graph
denotes the number of constraints in which the two vertexes (it connects) appear together.
Using a weighted graph partitioning algorithm, such as the one provided by METIS [20],
one can decompose the graph into as many parts as the number of available processors that
can be used to concurrently solve them. The rationale behind the applied weighted scheme
is to ensure that the connectivity between vertexes belonging to different parts is minimal.
Figure 13 demonstrates a three-way decomposition of the original CSP, using weighted graph
partitioning.

7.3 Phase III: parallel solving of the produced CSPs

Breaking a dependent CSP into a number of components (using one of the strategies mentio-
ned earlier) is a procedure that should incur only if the CSP’s size is large enough to worth the
cost of the decomposition. For this reason, it is necessary to define a function FS to calculate
the size of a CSP and a threshold above which the CSP should be decomposed. We choose
function FS to be a weighted sum of the number of u nm variables involved in the CSP and
the associated constraints C . The weights are problem-dependent. Thus,

FS = w1 · |u nm | + w2 · |C | (15)

Our problem-solving strategy proceeds as follows. First, we apply structural decomposi-


tion on the original CSP, and we distribute each component to an available processor. These
components can be solved independently of each other. The final solution (i.e., the value
of the objective for the original CSP) will equal the sum of the values of the individual
objectives; thus, the master node that holds the original CSP should wait to accumulate the
solutions returned by the servicing nodes.

123
Hiding sensitive knowledge without side effects 287

wb wc
wbc
b c
wa wd

cf
w

wc
ab
w

wce
wbf

d
a d

wb
e
wa

ed
w
f
f wbf
e
wf we
Original graph

wb wb wc wc
wbc
b b c c
wa cf wd
w

wc
ab
w

wce
wbf

wce
wbf

d
a d
wb
e
wa

ed
w
e
f

f f wbf
e
wf wf we we

Component 1 Component 2 Component 3

Fig. 13 An example of a three-way decomposition using weighted graph partitioning

Each node in our system is allowed to choose other nodes and assign them some
computation. Whenever a node assigns job to other nodes, it waits to receive the results
of the processing and then enforces the needed function to create the overall outcome (as if it
had solved the entire job itself). After receiving an independent component, each processor
applies the function FS in its assigned CSP and decides whether it is essential to proceed with
further decomposition. If this is the case, then it proceeds to its decomposition using one of
the two schemes presented earlier (i.e., decomposition using articulation points or weighted
graph partitioning) and assigns the newly created CSPs, each to an available processor. A
mechanism that keeps track of the jobs distribution to processors and their status (i.e., idle
vs. occupied) is applied to allow for the best possible CPU utilization. The same procedure
continues until all constructed CSPs are below the user-defined size threshold, and therefore
do not need to be further decomposed.
At any given time, the processors contain a number of independent components and a
number of mutually dependent components. In the case of the independent components,
as mentioned earlier, the value of the objective function for the overall CSP is attained by
summing up the values of the individual objectives. However, in the case of dependent CSPs,
the situation is more complex. To handle such circumstances, let border u nm be a variable
that appears in two or more dependent CSPs. This means that this variable was either the best
articulation point selected by the first strategy or a vertex that was at the boundary of two
different components, identified using the graph partitioning algorithm. Border variables need
to be checked for all possible values they can attain to provide us with the exact same solution
as the one of solving the original CSP. Let at a given problem instance of decomposition
exist p such variables. Then, we have 2 p possible value assignments. For each possible

123
288 A. Gkoulalas-Divanis, V. S. Verykios

C2 C2

a b e f a b e f

d c 0 g d c 1 g
C1 C1

i j i j

l k l k
C3 C3

Solving the 3 CSPs for h = 0 Solving the 3 CSPs for h = 1

O1 O2

Objective = max (O1, O2)

Fig. 14 An example of parallel solving after decomposition

assignment, we solve the corresponding CSPs produced, in which the objective functions
apart from the u nm variables for the nonborder cases also contain (in the objectives and the
constraints) the values of the currently tested assignment for the p variables. After solving
the CSPs for each assignment, we proceed to sum up the resulting objective values. As one
can observe, the final solution will correspond to the maximum value among the different
summations produced by the possible assignments.
To make matters clearer, assume that at some point time a processor receives a CSP that
needs to be solved and finds that its size is greater than the minimum size threshold; thus, the
CSP has to be decomposed. Suppose that applying the first strategy yields a decomposition
involving two border variables u 1 and u 2 and leads to two components C A and C B . Each
component is then assigned to a processor to solve it; all possible assignments should be
tested. Thus, we have four different problem instances, namely C00 , C01 , C10 , C11 , where
C x y means that the problem instance where u 1 = x and u 2 = y is solved; the rest variables’
assignments remain unknown. Given two processors, the first one needs to solve these four
instances for C A , whereas the second one needs to solve them for C B . Depending on the
load-balancing strategy, each processor is capable of sending part of the computation (i.e.,
some instances) to other processors to fine-grain the whole process. Now suppose that the
objective values for C A,00 and C B,00 were found. The objective value for problem instance
C00 will then be the summation of these two objectives. To calculate the overall objective
value and identify the solution of the initial CSP, we need to identify the maximum among the
objective values of all problem instances. An example of parallel solving after the application
of a decomposition strategy is depicted in Fig. 14, where one can notice that the solution of
the initial CSP is provided by examining, for all involved CSPs, the two potential values of
the selected cut-vertex h (i.e., solving each CSP for h = 0 and h = 1). The overall objective
corresponds to the maximum of the two objectives, an argument that is justified by the binary
nature of variable h.

8 Quantifying the privacy offered by the exact hiding algorithms

The hiding algorithms presented in Sects. 5 and 6 are based on the principle of minimum
harm, which requires the minimum amount of modifications to be made to the original

123
Hiding sensitive knowledge without side effects 289

Fig. 15 A layered approach to quantifying the privacy offered by the hiding algorithms

database to facilitate sensitive knowledge hiding. As an effect, in most cases (depending


on the problem at hand), the sensitive itemsets are expected to be positioned just below
the revised borderline in the computed sanitized database D. However, the selection of the
minimum support threshold based on which the hiding is performed can lead to radically
different solutions, some of which are bound to be superior to others in terms of offered
privacy. In what follows, we present a layered approach that enables the owner of the data to
quantify the privacy that is offered on a given database by a hiding algorithm. Assuming that
an adversary has no knowledge regarding which of the infrequent itemsets are the sensitive
one, this approach can compute the disclosure probability for the hidden sensitive knowledge,
given the sanitized database and a minimum support threshold.
Figure 15(i) demonstrates the proposed layered approach as applied on a sanitized database
D. The support axis is partitioned into two regions with respect to the minimum support
threshold msup that is used for the mining of the frequent itemsets in D. In the upper region,
Layer 0 contains all the frequent itemsets found in D after the application of a frequent
itemset mining algorithm like Apriori. The value of MSF indicates the maximum support
of a frequent itemset in D. The region starting just below msup contains all the infrequent
itemsets, including the sensitive ones, provided that they were appropriately covered up by
the applied hiding algorithm. In what follows, we proceed to further partition this region into
three layers:

Layer 1 This layer spans from the infrequent itemsets having a maximum support (MSI) to
the sensitive itemsets with a maximum support (MSS), excluding the latter ones.
It models the “gap” that may exist below the borderline, either due to the use of
a margin of safety to better protect the sensitive knowledge (as is the typical case
in various hiding approaches, e.g., [31]) or due to the properties of the database
and the sensitive itemsets that were selected to be hidden. We consider this layer
to contain ψ itemsets.
Layer 2 This layer spans from the sensitive itemsets having a maximum support (MSS)
to the sensitive itemsets with the minimum support (mSS) inclusive. It contains
all the sensitive knowledge that the owner wishes to protect, possibly along with

123
290 A. Gkoulalas-Divanis, V. S. Verykios

maximize unm ∈U unm

Tn ∈D {X} Im ∈Xunm < msup − x, ∀X ∈ S


subject to
Tn ∈D {R} Im ∈R nm ≥ msup, ∀R ∈ V
u

Fig. 16 The modified CSP for the inline algorithm that guarantees increased safety for the hiding of sensitive
knowledge

some nonsensitive infrequent itemsets. We consider this layer to contain s itemsets


out of which S are sensitive.
Layer 3 This layer collects the rest of the infrequent itemsets, starting from the one having
the maximum support just below mSS and ending at the infrequent itemset with
the minimum support (mSI) inclusive. This layer is assumed to contain r itemsets.
Given the layered partitioning of the itemsets in D with respect to their support values, we
argue that the quality of a hiding algorithm depends on the position of the various infrequent
itemsets in the layers 1–3. Specifically, let x denote the distance (from msup) below the
borderline where an adversary tries to locate the sensitive knowledge (e.g., by mining database
D using support threshold msup − x). Then, estimator Ẽ provides the mean probability of
sensitive knowledge disclosure defined as follows:


⎪ 0 x ∈ [0 . . . ψ]



⎨ S· msup−x
ψ+M SS−m SS+1
Ẽ = x ∈ (ψ . . . (ψ + MSS – mSS + 1)] (16)

⎪ ψ+s· ψ+Mmsup−x


SS−m SS+1

⎩ S
msup−x x ∈ ((MSS – mSS + 1) . . . (msup – mSI + 1)]
ψ+s+r · msup−m S I +1

By computing Ẽ for the sanitized database D, the owner of DO can gain in-depth unders-
tanding regarding the degree of protection that is offered on the sensitive knowledge in D.
Furthermore, he or she may decide on how much lower (with respect to the support) should
the sensitive itemsets be located in D, such that they are adequately covered up. As a result, a
hiding methodology can be applied to the original database DO to produce a sanitized version
D that meets the newly imposed privacy requirements. Given the presented exact approaches
to sensitive knowledge hiding, such a methodology can be implemented as follows:
1. The database owner uses the probability distribution Ẽ to compute the value of x that
guarantees maximum safety of the sensitive knowledge.
2. An exact knowledge hiding approach is selected and extra constraints are added to the
formulated CSP to ensure that the support of the sensitive knowledge in the generated
sanitized database will become at most x.
For example, in the case of the inline approach, the CSP of Fig. 16 guarantees the holding
of these requirements. Another possibility is to apply a postprocessing algorithm that will
increase the support of the infrequent itemsets of Layer 3 in the sanitized database D, such that
they move to Layer 2 (thus increase the concentration of itemsets in the layer that contains the
sensitive ones). On the negative side, it is important to mention that all these methodologies
for increasing the safety of the sensitive knowledge have as an effect the decrement of the
quality of the sanitized database, with respect to its original counterpart. This brings up one
of the most oftenly discussed topics in knowledge hiding: hiding quality versus usability
offered by the hiding algorithm.

123
Hiding sensitive knowledge without side effects 291

Table 13 The sanitized version


A B C D
of the database presented on
Table 11
1 1 0 0
1 1 0 0
1 0 0 0
1 0 0 0
0 0 0 1
1 0 1 1
0 0 1 1
0 0 1 0
1 0 0 0
0 0 0 1

Figure 15(ii) demonstrates the proposed approach for the database of Table 13. As
expected, due to the minimum harm that is introduced by the exact hiding algorithms, both
sensitive itemsets B and C D are located just under the borderline. In this example, the size
of Layer 1 is zero (i.e., ψ = 0). Based on the estimator Ẽ, we compute that the probability of
an adversary to identify the sensitive knowledge equals 2/3 when using x = 1 (equivalently
when mining the database using msup = 2). Since this value is high, the owner of the data
could either (a) use the CSP formulation of Fig. 16 to constraint the support of the sensitive
itemsets to at most 1, or (b) apply a methodology that increases the support of some of the
itemsets in Layer 3 so as to move to Layer 2 (i.e., have a support of 2). Both approaches are
expected to introduce extra distortion to database DO but will also provide a better protection
of the sensitive knowledge.

9 Computational experiments and results

In this section, we provide the results of a set of experiments that we conducted to test our
proposed schemes on real-world data. In what follows, we present the datasets we used and
the different parameters involved in the testing process, and we provide experimental results
involving (i) the inline approach, (ii) the hybrid approach, and (iii) the two-phase iterative
approach. Finally, we provide a set of experiments involving the first step of the decomposition
process, namely the structural decomposition, where we demonstrate the significant gain in
the runtime of the hiding algorithms.

9.1 The experimental setup

The proposed algorithms were tested on three real-world datasets using different parame-
ters such as minimum support threshold and number/size of sensitive itemsets to hide. All
these datasets are publicly available through the FIMI repository located at
http://fimi.cs.helsinki.fi/.
Datasets BMS-WebView-1 and BMS-WebView-2 both contain click stream data from
the Blue Martini Software and were used for the KDD Cup 2000 [21]. The mushroom
dataset was prepared by Roberto Bayardo (University of California, Irvine) [5]. These datasets
demonstrate varying characteristics in terms of the number of transactions and items and the
average transaction lengths. Table 14 summarizes them.

123
292 A. Gkoulalas-Divanis, V. S. Verykios

Table 14 Characteristics of the


Dataset N M Average t len
used real-world datasets
BMS-1 59, 602 497 2.50
BMS-2 77, 512 3, 340 5.60
Mushroom 8, 124 119 23.00

Fig. 17 The Constraint


Satisfaction Problem formulation minimize q∈[1,Q +SM ],m ∈[1,M ] uqm
used for experimentation

⎪ Q+SM
⎨ q=1 im ∈I uqm < thr, ∀I ∈ Bd− (FD )
subject to

⎩ Q+SM
q=1 im ∈I uqm ≥ thr, ∀I ∈ Bd+ (FD )

where thr = minf · (N + Q + SM) − sup(I , DO )

The primary bottleneck that we experienced in most of our experiments was the time
taken to run the frequent itemset mining algorithm, as well as the time needed to solve
the formulated CSPs through the application of BIP. Moreover, in all tested settings, the
thresholds of minimum support were properly selected to ensure an adequate amount of
frequent itemsets and the sensitive itemsets to be hidden were selected randomly among
the frequent ones. We conducted several experiments trying to hide up to 20 sensitive
10-itemsets. Our source code was implemented in Perl and C and all the experiments were
on a PC running Linux on an Intel Pentium D, 3.2 GHz processor equipped with 4 GB of
main memory. All integer programs were solved using ILOG CPLEX 9.0 [17].
In all conducted experiments involving the hybrid approach, we formulated the CSP based
on Fig. 17 and used a Safety Margin of 10. As one can notice, this problem formulation is
similar to the one presented in Fig. 3, with the sole difference that the validity of the produced
transactions, as part of the extension DX of DO , is not controlled at the CSP level but instead
is left to be examined at a later point.
CPLEX provides us the option of presolving the binary integer program, a very useful
feature that allows the reduction of the BIP’s size, the improvement of its numeric properties
(for example, by removing some inactive constraints or by fixing some nonbasic variables),
and also enables us to detect infeasibility early in the BIP’s solution. We used these beneficial
properties of presolving to allow for early actions when solving the CSPs.
To evaluate the two-phase iterative approach, we used a limiting factor of  = 5 to
control the number of iterations that the algorithm is allowed to execute. Moreover, due to
the resemblance of the two-phase iterative approach to the inline algorithm, in what follows,
we compare the two methodologies and derive a set of observations regarding their behavior.
The last set of conducted experiments involves the structural decomposition approach.
This approach can be applied to the constraints matrix of the CSP to identify groups of
constraints–variables that can be isolated. Thus, it leads to a group of CSPs that each can be
solved independently of the others. Through a set of experiments, we prove that the benefit of
this decomposition strategy can be substantial under certain circumstances. Furthermore, we
show that the measured loss in the runtime of the algorithm, in the case when decomposition
is infeasible, is acceptable.

123
Hiding sensitive knowledge without side effects 293

BMS−WebView Datasets Mushroom Dataset


4 180
BBA BBA
3.5 160
MaxMin2 MaxMin2
BMS−WebView−1 Inline 140 Inline
3
Hybrid Hybrid

Side−Effects
120
Side−Effects

2.5
BMS−WebView−2
100
2
80
1.5
60
1
40

0.5 20

0 0
1x2 2x2 1x3 2x3 1x4 2x4 1x2 2x2 1x3 1x4 3x5 5x5 10x5 15x5 20x5 5x7 10x7 15x7 20x7 5x10 10x10 15x10 20x10

Hiding Scenarios Hiding Scenarios

Fig. 18 The quality of the produced hiding solutions in the three datasets

BMS−WebView Datasets Mushroom Dataset


800 700
BBA
BMS−WebView−1 BBA
700 600 MaxMin2
MaxMin2 Runtime (in seconds) Inline
Runtime (in seconds)

Inline Hybrid
600 Hybrid
Hybrid 500

500
400
400 BMS−WebView−2
300
300

200
200

100 100

0 0
1x2 2x2 1x3 2x3 1x4 2x4 1x2 2x2 1x3 1x4 3x5 5x5 10x5 15x5 20x5 5x7 10x7 15x7 20x7 5x10 10x1015x10 20x10

Hiding Scenarios Hiding Scenarios

Fig. 19 The scalability of the hiding algorithms measured in the three datasets

9.2 Evaluating the three hiding approaches

We begin the experimental evaluation by testing the inline against the hybrid approach for
all three datasets. In particular, we test the quality of the two approaches along two primary
directions: (i) the side effects that are introduced to the original database due to the hiding
process and (ii) the scalability of the hiding algorithms.
Let notation a × b denote the hiding of a itemsets of length b. Figure 18 provides a
comparison of the hiding solutions identified by the inline and the hybrid algorithms against
the Border-Based Approach (BBA) of [32] and the Max-Min-2 approach of [25], along
the first dimension. As one can notice, the hybrid algorithm consistently outperforms the
three other schemes, with the inline approach being the second best. In most tested cases,
the heuristics failed to identify a solution bearing minimum side effects, while the inline
approach demonstrated in several occasions that an exact solution could not be attained
without extending the dataset.
Figure 19 provides a comparison of the four algorithms along the second dimension. As
one can observe, the scalability of the heuristic approaches is better when compared to the
exact approaches, since the runtime cost of the latter ones is primarily determined by the
time required by the BIP solver to solve the CSP. However, as shown in Fig. 18, the extra
runtime cost of solving the CSPs of the exact methodologies is worthwhile, since the quality
of the attained solutions is bound to be high.
Figure 20 presents the distance [14] (i.e., number of item changes) between the original
and the sanitized database that is required by each algorithm to facilitate knowledge hiding.
Since the two heuristic approaches and the inline algorithm operate in a similar fashion (i.e.,

123
294 A. Gkoulalas-Divanis, V. S. Verykios

Mushroom Dataset
1200
BBA
MaxMin2
Inline
1000

800
Distance

600

400

200

0
3x5 5x5 10x5 15x5 20x5 4x6 5x7 10x7 15x7 20x7 5x10 10x10 15x10 20x10
Hiding Scenarios

Fig. 20 The distance of the inline approach against those of MaxMin2 and BBA

by selecting transactions of the original database and excluding some items), it makes sense
to compare them in terms of the produced distances. From the comparison, it is evident that
the inline approach achieves to minimize the number of item modifications, a result that can
be attributed to the optimization criterion of the generated CSPs. On the contrary, the hybrid
algorithm does not alter the original dataset but instead uses a database extension to (i) leave
unsupported the sensitive itemsets so as to be hidden in D, and (ii) adequately support the
itemsets of the revised positive border to remain frequent in D. For this reason, we argue
that the item changes (0s → 1s) that are introduced by the hybrid algorithm in DX should
not be attributed to the hiding task of the algorithm, but rather to its power to preserve the
revised positive border and thus eliminate the side effects. This important difference between
the hybrid algorithm and the other three approaches hardens their comparison in terms of
item modifications. However, due to the common way that both the inline and the hybrid
approaches model their produced CSPs, we argue that the property of minimum distortion of
the original database is bound to hold for the hybrid algorithm. As a result, both the inline and
the hybrid approaches better preserve the knowledge of the database by causing minimum
distortion to hide the sensitive knowledge.
An interesting insight from the conducted experiments involving the hybrid approach is the
fact that a Safety Margin of 10 was proved to be satisfactory for an exact solution. Comparing
the hybrid and the inline algorithms, we noticed that the hybrid algorithm is capable of better
preserving the quality of the border and thus produces superior solutions. Indeed, the hybrid
algorithm was able to identify exact solutions in the vast majority of tested cases, while
hiding of the same knowledge using the inline algorithm led to suboptimal solutions in
several instances. A suboptimal solution means that the quality of the border is not preserved
to the maximum extend; thus, some nonsensitive frequent patterns are lost as a side effect
of the hiding process. On the other hand, the hybrid approach is proved to be less scalable
than the inline algorithm and the two heuristics, particularly due to the large number of the
binary variables and associated constraints that are involved in the hiding process. However,
it is important to mention that as long as the given hiding problem decomposes to a CSP
that remains computationally manageable and a sufficiently large safety margin is used, the
hybrid algorithm is bound to identify a solution that will bear the least amount of side effects
to the original database. Thus, contrary to state-of-the-art approaches, the power of the hybrid

123
Hiding sensitive knowledge without side effects 295

Inline vs 2−Phase Iterative Approach for BMS−1 Inline vs 2−Phase Iterative Approach for BMS−2
60
Inline Inline
2−Phase Iterative
16 2−Phase Iterative

50 14
* * *
* * 12
40
Distance

10

Distance
*
30 * *
* 8

20 6
*

* * 4
10
2

0 0
1x2 2x2 1x3 2x3 1x4 2x4 1x2 2x2 1x3 2x3 1x4 2x4
Hiding Scenarios Hiding Scenarios

Inline vs 2−Phase Iterative Approach for Mushroom


16
Inline
2−Phase Iterative
14

12
*
10
Distance

* *
6
*
4
*

0
1x2 2x2 1x3 2x3 1x4 2x4
Hiding Scenarios

Fig. 21 The distance between the inline and the two-phase iterative algorithm

algorithm is that it guarantees the least amount of side effects to an extended set of hiding
problems, when compared to the inline approach.
The next set of experiments involves the comparison of the inline approach against the
two-phase iterative algorithm. After the execution of the first iteration of the algorithm, we
keep track of the attained solution and the respective impact on the dataset. Then, we allow
the algorithm to proceed to up to  subsequent iterations. If the algorithm fails to identify
an exact solution after the  runs, we refer to the stored solution of the first iteration of the
algorithm to attain the sanitized database D. This way, we achieve both to limit the runtime
of execution of the algorithm to a level that it remains tractable (through the use of ) and to
ensure that this algorithm will constantly outperform the inline scheme and provide superior
hiding solutions.
Figure 21 presents the performance comparison of the two algorithms. A star above a
column in this graph indicates a suboptimal hiding solution. Based on the attained results, we
can make the following observations. First, by construction, the two-phase iterative scheme
is constantly superior to the inline algorithm, since its worst performance equals the perfor-
mance of the inline scheme. Second, as the experiments indicate, there are several settings in
which the two-phase iterative algorithm finds an exact hiding solution with a small increment
in the distance compared to that of the inline approach. Third, by construction, the two-phase
iterative algorithm can capture all the exact solutions that were also identified by the inline

123
296 A. Gkoulalas-Divanis, V. S. Verykios

Parallel (TSIA + Tsolve) vs serial runtime (sec)


) vs serial runtime (sec)

) vs serial runtime (sec)


Performance gain in BMS−1 dataset Performance gain in the BMS−2 dataset Performance gain in the Mushroom dataset
400 250 100
T T T
SIA SIA 90 SIA
350 Tsolve Tsolve T
solve
Tserial 200 Tserial 80 T serial
300
70
250 150 60

200 50
solve

solve
150 100 40
+T

+T
30
100
SIA

SIA
50 20
Parallel (T

Parallel (T
50 10

0 0 0
2x2 2x3 2x4 3x2 3x3 3x4 2x2 2x3 2x4 3x2 3x3 3x4 2x2 2x3 2x3 3x2 3x3 3x4
Hiding Scenarios Hiding Scenarios Hiding Scenarios

Fig. 22 Performance gain through parallel solving, when omitting the V portion of the produced CSP

approach. This fact makes the two-phase iterative algorithm superior when compared to the
inline approach.

9.3 Evaluating the structural decomposition approach

In this section, we evaluate the structural decomposition approach that was proposed as part
of the decomposition and parallelization framework of Sect. 7.1. In all experiments, we use
the inline algorithm to produce the original CSP that facilitates knowledge hiding.
To conduct the experiments, we assume that we have all the necessary resources to proceed
to a full-scale parallelization of the initial CSP. This means that if our original CSP can
potentially break into P independent parts, then we assume the existence of P available
processors that can run independently, each one solving one resultant CSP. Thus, the overall
runtime of the hiding algorithm will equal the summation of (i) the runtime of the serial
algorithm that produced the original CSP, (ii) the runtime of the Structure Identification
Algorithm (SIA) that decomposed the original CSP into numerous independent parts, (iii) the
time that is needed to communicate each of the resulting CSPs to an available processor, (iv)
the time needed to solve the largest of these CSPs, (v) the communication time needed to return
the attained solutions to the original processor (hereon called “master”) that held the whole
problem, and finally (vi) the time needed by the master processor to calculate the summation
of the objective values returned to compute the overall solution of the problem, that is:

Toverall = THA + TSIA + Tspread + Tsolve + Tgather + Taggregate (17)

In following experiments, we capture the runtime of (ii) and (iv), namely TSIA and Tsolve ,
since we consider both the communication overhead (Tspread + Tgather ) and the overall
solution calculation overhead (Taggregate ) to be negligible when compared to these run
times. Moreover, the runtime of (i) (i.e., THA ) does not change in the case of parallelization,
and therefore, its measurement in these experiments is of no importance. To allow us to
compute the benefit of parallelization, we include in the results the runtime Tserial of solving
the entire original CSP without prior decomposition.
In our first set of experiments (presented in Fig. 22), we wanted to ensure breaking of
the original CSP into a controllable number of components, and thus, we excluded from
the original CSP all the constraints involving itemsets from set V (see Fig. 2). Based on
this methodology, to break the original CSP into P parts, one needs to identify P mutually
exclusive (in the universe of items used) itemsets to hide. However, based on the number of
supporting transactions for each of these itemsets in DO , the size of each produced com-
ponent may vary significantly. As one can observe in Fig. 22, the time that was needed
for the execution of the SIA algorithm and the identification of the independent compo-

123
Hiding sensitive knowledge without side effects 297

Parallel (TSIA + Tsolve) vs serial runtime (sec)


Parallel (TSIA + Tsolve) vs serial runtime (sec)
Parallel (TSIA + Tsolve) vs serial runtime (sec)
Performance gain in the BMS−1 dataset Performance gain in the BMS−2 dataset Performance gain in the Mushroom dataset
350 120
TSIA TSIA TSIA
250
300 Tsolve Tsolve Tsolve
100 Tserial
Tserial Tserial
250 200
80
200 150
60
150
100
40
100
50 20
50

0 0 0
2x2 2x3 3x2 3x3 2x2 2x3 3x2 3x3 2x2 2x3 3x2 3x3
Hiding Scenarios Hiding Scenarios Hiding Scenarios

Fig. 23 Performance gain through parallel solving of the entire CSP

nents is negligible compared to the time needed for solving the largest of the resultant CPSs.
Moreover, by comparing the time needed for the serial and that needed for the parallel solving
of the original CSP, one can notice how beneficial is the decomposition strategy in reducing
the runtime that is required by the hiding algorithm. For example, in the 2 × 2 hiding scenario
for BMS-1, serial solving of the CSP requires 218 s, while parallel solving requires 165 s.
This means that by solving the CSP in parallel using two processors, we reduce the solution
time by 53 s.
In our second set of experiments, shown in Fig. 23, we included the V part of the CSP,
produced by the inline algorithm. As one can observe, there are certain situations in which
the original CSP cannot be decomposed (Tsolve = 0). In such cases, one has to apply either
the decomposition approach using articulation points or the weighted graph partitioning
algorithm, to parallelize the hiding process.

10 Conclusions and future work

In this paper, we investigated the issue of exact knowledge hiding and proposed three schemes
that are suitable for identifying exact solutions of high quality. We then introduced a decom-
position and parallelization framework, which can be applied to all the three schemes and
which dramatically improves the runtime of the hiding algorithm. Our proposed frame-
work uses structural decomposition to partition the original CSP into numerous independent
components. Furthermore, it offers two novel approaches for further breaking of these com-
ponents into a set of dependent CSPs. By exploiting the features of the objective function,
we provided a way of joining the partial solutions of the CSPs and deriving the overall hiding
solution. Finally, through experimental evaluation on three real-world datasets, we were able
to demonstrate both the effectiveness of the proposed hiding schemes toward achieving high
quality solutions and the benefit of the structural decomposition process toward speeding up
the hiding process.
As future work, we intend to shed light on several issues regarding the hiding process and
its parallelization, and provide well-founded answers to questions such as the following: (i)
When it is reasonable to further break an independent component? (ii) How can one ensure
a load-balanced breaking of a large component? (iii) Which is the most prominent load-
balancing strategy? (iv) How can we further reduce the inequalities involved in the CSP and
still attain an exact solution (provided that one exists)?

Acknowledgments The authors would like to thank Dimitrios Syrivelis and Yannis Tsolakis for their
valuable comments and overall support in this work. Moreover, we would like to thank the anonymous
reviewers for their thoughtful comments and suggestions that improved the quality of this work.

123
298 A. Gkoulalas-Divanis, V. S. Verykios

References

1. Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng (TKDE)
8(1):962–969
2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Procee-
dings of the 20th International Conference on Very Large Databases (VLDB), pp 487–499
3. Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceedings of the 2000 ACM SIGMOD
International Conference on Management of Data, pp 439–450
4. Atallah M, Bertino E, Elmagarmid A, Ibrahim M, Verykios VS (1999) Disclosure limitation of sensitive
rules. In: Proceedings of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX),
pp 45–52
5. Bayardo R (1998) Efficiently mining long patterns from databases. In: Proceedings of the 1998 ACM
SIGMOD International Conference on Management of Data
6. Bertino E, Fovino IN, Povenza LP (2005) A framework for evaluating privacy preserving data mining
algorithms. Data Mining Knowl Discov (DMKD) 11(2):121–154
7. Cheung D, Xiao Y (1998) Effect of data skewness in parallel mining of association rules. In: Proceedings
of the 2nd Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data
Mining (PAKDD), pp 48–60
8. Clifton C, Kantarciog̈lu M, Vaidya J (2002) Defining privacy for data mining. National Science Foundation
Workshop on Next Generation Data Mining (WNGDM), pp 126–133
9. Clifton C, Marks D (1996) Security and privacy implications of data mining. In: Proceedings of the 1996
ACM SIGMOD International Conference on Management of Data, pp 15–19
10. Dasseni E, Verykios VS, Elmagarmid AK, Bertino E (2001) Hiding association rules by using confidence
and support. In: Proceedings of the 4th International Workshop on Information Hiding, pp 369–383
11. Evfimievski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules. In:
Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp 343–364
12. Farkas C, Jajodia S (2002) The inference problem: a survey. ACM SIGKDD Exploration Newsl 4(2):6–11
13. Fienberg S, Slavkovic A (2005) Preserving the confidentiality of categorical statistical data bases when
releasing information for association rules. Data Mining Knowl Discov (DMKD) 11(2):155–180
14. Gkoulalas-Divanis A, Verykios VS (2006) An integer programming approach for frequent itemset hiding.
In: Proceedings of the 2006 ACM Conference on Information and Knowledge Management (CIKM)
15. Gkoulalas-Divanis A, Verykios VS (2007) A hybrid approach to frequent itemset hiding. In: Proceedings
of the 2007 IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp 297–304
16. Han E-H, Karypis G, Kumar V (2007) Scalable parallel data mining for association rules. In: Proceedings
of the 1997 ACM SIGMOD International Conference on Management of Data, pp 277–288
17. ILOG CPLEX 9.0 User’s Manual (2003) ILOG Inc, Gentilly, France
18. Kantarciog̈lu M, Clifton C (2004) Privacy-preserving distributed mining of association rules on horizon-
tally partitioned data. IEEE Trans Knowl Data Eng (TKDE) 16(9):1026–1037
19. Kargupta H, Datta S, Wang Q, Sivakumar K (2005) Random-data perturbation techniques and privacy-
preserving data mining. Knowl Inform Syst (KAIS) 7(4):387–414
20. Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs.
SIAM J Sci Comput 20(1):359–392
21. Kohavi R, Brodley C, Frasca B, Mason L, Zheng Z (2000) KDD-Cup 2000 organizers’ report: Peeling
the onion. SIGKDD Explorations 2(2): 86–98. http://www.ecn.purdue.edu/KDDCUP
22. Lee G, Lee K, Chen A (2001) Efficient graph-based algorithms for discovering and maintaining associa-
tion rules in large databases. Knowl Inform Syst (KAIS) 3(3):338–355
23. Menon S, Sarkar S, Mukherjee S (2005) Maximizing accuracy of shared databases when concealing
sensitive patterns. Inform Syst Res 16(3):256–270
24. Morgenstern M (1988) Controlling logical inference in multilevel database and knowledge-base systems.
In: Proceedings of the 1988 IEEE Symposium on Security and Privacy, pp 245–255
25. Moustakides G, Verykios VS (2006) A max-min approach for hiding frequent itemsets. In: Proceedings
of the 6th IEEE International Conference on Data Mining (ICDM), pp 502–506
26. Oliveira SRM, Zaïane OR (2002) Privacy preserving frequent itemset mining. In: Proceedings of the 2002
IEEE International Conference on Privacy, Security and Data Mining (CRPITS), pp 43–54
27. Oliveira SRM, Zaïane OR (2003) Protecting sensitive knowledge by data sanitization. In: Proceedings of
the Third IEEE International Conference on Data Mining (ICDM), pp 211–218
28. Parthasarathy S, Zaki M, Ogihara M, Li W (2001) Parallel data mining for association rules on shared-
memory systems. Knowl Inform Syst (KAIS) 3(1):1–29

123
Hiding sensitive knowledge without side effects 299

29. Pontikakis E, Theodoridis Y, Tsitsonis A, Chang L, Verykios VS (2004) A quantitative and qualitative
analysis of blocking in association rule hiding. In: Proceedings of the 2004 ACM Workshop on Privacy
in the Electronic Society (WPES), pp 29–30
30. Rizvi S, Haritsa JR (2002) Maintaining data privacy in association rule mining. In: Proceedings of the
28th International Conference on Very Large Databases (VLDB)
31. Saygin Y, Verykios VS, Clifton C (2001) Using unknowns to prevent discovery of association rules. ACM
SIGMOD Record 30(4):45–54
32. Sun X, Yu PS (2005) A border-based approach for hiding sensitive frequent itemsets. In: Proceedings of
the Fifth IEEE International Conference on Data Mining (ICDM), pp 426–433
33. Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitioned data. In:
Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp 639–644
34. Verykios VS, Bertino E, Fovino IN, Provenza LP, Saygin Y, Theodoridis Y (2004a) State-of-the-art in
privacy preserving data mining. ACM SIGMOD Record 33(1):50–57
35. Verykios VS, Emagarmid AK, Bertino E, Saygin Y, Dasseni E (2004b) Association rule hiding. IEEE
Trans Knowl Data Eng (TKDE) 16(4):434–447
36. Xu S, Zhang J, Han D, Wang J (2006) Singular value decomposition based data distortion strategy for
privacy protection. Knowl Inform Syst (KAIS) 10(3):383–397
37. Yokoo M, Durfee E, Ishida T, Kuwabara K (1998) The distributed constraint satisfaction problem: for-
malization and algorithms. IEEE Trans Knowl Data Eng (TKDE) 10(5):673–685
38. Zaïane OR, El-Hajj M, Lu P (2001) Fast parallel association rule mining without candidacy generation.
In: Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM), pp 665–668
39. Zou Q, Chu W, Johnson D, Chiu H (2002) A pattern decomposition algorithm for data mining of frequent
patterns. Knowl Inform Syst (KAIS) 4(4):466–482

Author Biographies

Aris Gkoulalas-Divanis received the Diploma degree in Computer


Science from the University of Ioannina, Greece, in 2003, and the
MS degree from the University of Minnesota in 2005. Since 2005,
he is a Ph.D. student in the Department of Computer and Communi-
cation Engineering at the University of Thessaly, Volos, Greece. Aris
Gkoulalas-Divanis has served as a research assistant in both the Depart-
ment of Computer Science and Engineering (2003–2005), University
of Minnesota and in the School of Informatics (2006), University of
Manchester. Since 2001, he has participated in several IST/EU projects
including Childcare, Citation, and GeoPKDD. His current research inter-
ests are in the fields of privacy in location-based services and privacy
preserving data mining, with focus on methodologies for association rule
hiding and trajectory hiding. He is a student member of the IEEE and
the ACM.

Vassilios S. Verykios received the Diploma degree in Computer Engi-


neering from the University of Patras, Greece, in 1992, and the MS and
Ph.D. degrees from the Purdue University in 1997 and 1999, respectively.
In 1999, he joined the Faculty of Information Systems in the College of
Information Science and Technology at Drexel University, Pennsylvania,
as a tenure track assistant professor. Since 2005, he is an assistant pro-
fessor in the Department of Computer and Communication Engineering
at the University of Thessaly, Volos, Greece. His main research inter-
ests include knowledge-based systems, privacy and security in advanced
database systems, data mining, data reconciliation, parallel computing,
and performance evaluation of large-scale parallel systems. Dr. Verykios
has published and presented over 50 papers in major referred journals and
in the proceedings of international conferences and workshops, and he
has served in the program committees of several international scientific
events. He is a member of IEEE, ACM, and UPE.

123

You might also like