You are on page 1of 10

FMC: An Approach for Privacy Preserving OLAP

Ming Hua, Shouzhi Zhang, Wei Wang, Haofeng Zhou, Baile Shi
Fudan University, China {minghua, shouzhi_zhang, weiwang1, haofzhou, bshi}@fudan.edu.cn

Abstract. To preserve private information while providing thorough analysis is one of the significant issues in OLAP systems. One of the challenges in it is to prevent inferring the sensitive value through the more aggregated non-sensitive data. This paper presents a novel algorithm FMC to eliminate the inference problem by hiding additional data besides the sensitive information itself, and proves that this additional information is both necessary and sufficient. Thus, this approach could provide as much information as possible for users, as well as preserve the security. The strategy does not impact on the online performance of the OLAP system. Systematic analysis and experimental comparison are provided to show the effectiveness and feasibility of FMC.

Introduction

Online analytic processing (OLAP) is an important infrastructure for advanced data analysis and knowledge discovery. While most of the previous studies on OLAP focus on OLAP models, data cube and data warehouse construction, maintenance and compression, as well as efficient query answering methods, it is critical to investigate the problem of privacy preserving in OLAP query answering. Example 1 (Motivation) Consider a table about the patient cases in some hospitals as shown in Table 1. <*,*>
Table 1. a table about the patient cases
Hospital Forest Forest Memorial Memorial Disease Lung cancer Diabetes Diabetes Heart attack Number of cases 16 63 87
<f,l> <f,d>
# 198

<f,*>
79

<m,*>
119

<*,l>
16

<*,d>
150

<*,h>
32

<m,d>
#

<m,h>
#

32

Fig. 1. The data cubes based on Table 1

Suppose the hospitals do not want to make the population of individual diseases public, but agree to share the total number of all cases in a hospital or the total number of a certain disease in all hospitals. That is, in the data cube based on Table 1, the

value of cells <f,l>, <f,d>, <m,d> and <m,h> should be hidden from users (as shown in Figure 1. <f, l> stands for the cell <forest, lung cancer> and so do other cells). A simple and direct security policy is to decline all the access to the sensitive cells. However, such a declining-direct-access policy is insufficient to preserve the privacy. Since just parts of the measure are hidden, the structure of the cube could be found out from the rest columns of the fact table, so the sensitive values could be revealed through other unprotected cells. For example, the value of <f,l> is exactly the same as that of <*,l>, since <*,l> only aggregates this record. Moreover, subtracting the value of <f,l> from that of <f,*> discloses the value of <f,d>. Now, the problem becomes, Can we make up a better security policy so that the privacy is strictly preserved? Moreover, we want such a policy to hide as few information as possible. We call it the privacy preserving OLAP problem. In this paper, we tackle the problem by hiding a minimal set of unprotected cells involved in determining the value of confidential cells, so that the precondition of information leakage will no longer hold. For example, if we hide the cells <*,l> and <*,h> in Figure 1, the value of the sensitive cells <f,l>, <f,d>, <m,l> and <m,d> will never be obtained by only accessing the remainder unprotected cells. Compared to the privacy control problems in statistical database and data mining, there are several new challenges for the privacy preserving OLAP problem, and we make the following contributions. 1) Sensitive data items can be distributed at different granularity level in OLAP. We propose a general model and solution that can handle this case. 2) It is crucial for OLAP systems to provide users with as much information as possible while protecting the sensitive data. We prove that our algorithm only hides the necessary data. 3) OLAP applications usually require short response time. We eliminate the inference before users interacting with the system, so that the algorithm would not affect the online performance of the OLAP system. The rest of the paper is organized as follows. In Section 2, we formulate the problem of privacy preserving OLAP. Then Section 3 provides the overview of the solution. The key techniques are discussed in section 4 and section 5. Extensive experimental results are reported in Section 6. Finally, we draw the conclusion in Section 7.

1.1

Related Work

Inference control methods in statistical databases are classified into two categories [1]. Restriction based techniques include auditing all queries [2], suppressing sensitive data [3] and so on. Perturbation based techniques include adding noise to source or outputs to affect the precision of detail data [4]. Inference control for OLAP systems received less attention. However, Lingyu Wang et al. have systematically studied this problem: [1] derives sufficient conditions

for non-compromisability in sum-only data cubes; [5] discusses the inference problem caused by the multi-dimensional range queries; [6] proposes a method to eliminate both unauthorized accesses and malicious inferences.

Problem Definition

A data cube consists of a set of dimensions and measures with aggregate functions defined on it. In this paper, we mainly focus on the SUM function. Each node of the data cube is called a cuboid, and a tuple in the cuboid is called a cell. Two cuboids C1 and C2 follow the partial order (i.e., C1 C2), iff on each dimension, either they share the same attribute, or C2 has a higher-level of attribute in the dimension hierarchy. In this case, we say C2 is an ancestor of C1, and C1 is a descendant of C2. C2 is a father of C1, and correspondingly, C1 is a son of C2, if C1C2, and there isnt any cuboid C such that C1C and CC2. These definitions apply to cells as well. In Example 1, cuboids <Hospital, Disease> <Hospital, *>, and the cells <f,l> <f,*>. Decided by the multi-dimensional data model, the access control in OLAP systems lies in cuboids and cells. We define the confidential information as a forbidden set in the form of {c1, , cm}, where ci is a cell of the data cube. We assume that the forbidden set includes all the confidential cells and their descendants, since a confidential cell could also be computed by simply aggregating all its descendants. All the cells not included in the forbidden set compose the available set, which is accessible for users. For example, the available set in Example 1 includes all the cells except <f,l>, <f,d>, <m,d> and <m,h>. However, we have shown in Example 1 that some confidential information (such as <f,l> and <f,d>) could be obtained by combining the cells in the available set. We define the available set as well as all the information derived from it as the available set closure. Definition 1 [Available Set Closure]. Given an available set A, the Available Set Closure C(A) is defined as: 1. If cell cA, cC(A); 2. If cell cC(A), k cC(A), k is a real number; 3. If cells c1,c2C(A), c1+c2C(A); When the available set closure and the forbidden set have intersections, inference occurs. In this case, we also say that the forbidden set is compromised. The cells in the available set that cause the inference are called the source of the inference. Definition 2 [Compromisability]. Given a data cube L and a forbidden set F in L, F is compromised when C(L- F)F. To prevent the compromisability, we hide some cells in the source, so that all the sensitive cells couldnt be computed through the incomplete source. However, the hidden cells may also be inferred by higher granular cells, therefore, more cells should be hidden to protect them. Finally we could find a set of cells in addition to the forbidden set, and any cell outside them would not cause inference to the cells inside.

Definition 3 [Minimal Cover (MC)]. Given a data cube L and a Forbidden Set F in L, a set S is defined as the Minimal Cover of F (represented as MC(F)) if: 1. SL-F; 2. C(L-F-S)(F+S)=. 3. SS, C(L-F-S)(F+S) The minimal cover is a subset of the available set, and the second condition requires that after hiding the minimal cover, the remainder cells would not cause inference to both the minimal cover and the forbidden set. The third condition claims that any subset of the minimal cover couldnt satisfy the second one, which guarantees that all the cells in the minimal cover are indispensable to eliminate the inference. Problem Statement. Given a data cube L and a forbidden set F, the privacy preserving OLAP problem is to find a minimal cover MC(F) of F, which prevents F from being compromised while prohibiting as few information as possible.

Overview of Privacy Preserving OLAP Procedure

From the definitions, it is clear that the minimal cover should be free of inference to both the forbidden set and itself; otherwise, one can disclose sensitive information by first inferring the values of minimal cover, and then getting to the forbidden set. A subset of the minimal cover that is only free of inference to the forbidden set is called the minimal partial cover. We take the following two steps to firstly find the minimal partial cover of the forbidden set, and then extend it to the minimal cover to preserve absolute security. Step 1 Finding the minimal partial cover for the forbidden set. We find the minimal partial cover MPC of the forbidden set by linear system theory, such that hiding MPC would eliminate all the inference direct to the forbidden set, but just hiding any subset of MPC would not work. Step 2 Extending the minimal partial cover to the minimal cover. We then take MPC found in step 1 as the new forbidden set, and repeat finding the minimal partial cover for the newly hidden cells until no more cells need to be hidden.

Finding Minimal Partial Cover

In this section, we will discuss how to find the minimal partial cover for a forbidden set. First, we define the vector code to represent each cell in the cuboid as follows. v Definition 5 [Vector Code]. Given a cuboid C, the vector code c for cell c in C or Cs father cuboids is defined as (a1, , an), where n is the number of cells in C, and ai= 1 if c is the ith cell in C (cC) or ai= 1 if c aggregates the ith cell in C (cFather(C)). otherwise otherwise 0 0

For example, in the cuboid <Hospital, Disease> in Figure 1, the vector code of cell v <f,*> is (1,1,0,0), and the vector for <f, l> is (1,0,0,0). The cell corresponding to c v v could be inferred by c1, , cn, if vector codes c 1 , , c n can be linearly combined v into the vector code c . To determine whether it would happen, we discuss the following three cases of the solution of equation (1): (x1, , xn are real numbers).

v Equation (1) has no solutions. Cell c corresponding to c couldnt be computed


with any other cells, so no additional information needs to be hidden. v Equation (1) has only one non-zero solution. c could be computed with a cerv v tain combination of c 1 , , c n . If xi, , xj are the non-zero components of the v v solution, then the corresponding cells c i , , c j are indispensable to inferring v v v c . Therefore, just hiding one of c i , , c j could prevent the inference. Equation (1) has more than one non-zero solutions. To eliminate all the inference, we need to hide one cell whose corresponding component of solution X is always non-zero. If there isnt such kind of cells, we need to find a set of cells at least one of which is used in each solution.

x1 c 1 ++ xn c n =[ c 1 , , c n ][x1, , xn]T= c

(1)

4.1

An Example

Based on linear system theory [7], we develop a method to eliminate the inference to certain cells. The method is illustrated in the following example. Example 2. We try to find the minimal partial cover for cell <f,d> in Example 1, and the security requirements are the same. Suppose c1=<f,*>, c2=<m,*>, c3=<*,l>, v v c4=<*,d>, and c5=<*,h>. The corresponding vectors are c 1 , , c 5 . v v v 1. We construct the equation by making A=[ c 1 ,, c 5 ], b= c ( vector code of <f,d>). AX= 1
0 1 0 1 0 0 1 0 1 0 1 0 1 0 0 0 X= 0 1 0 0 0 1 0

(2)

2. The solution of equation (2) is X=X0+kX1, where X0=[1,0,-1,0,0]T, X1=[-1,-1, v 1,1,1]T, and k is a real number. If the ith component of X is non-zero, then c i is used to compute <f,d>. For example, if we take k=0, then X=[1,0,-1,0,0]T, v v v (i.e., c = c 1 - c 3 ), which is exactly the case depicted in Example 1. 3. We try to find a component of X that is always non-zero, or find a set of components at least one of which is non-zero in each X. If k=0: X=X0, the first and third components are non-zero. If k0: by carefully choosing a value for k, the first or the third component can be zero, but the other components will never be zero. Hence, a cell in {c1, c3} and another one in {c2, c4, c5} form the minimal partial cover of <f,d>. For example, if we hide {c1, c5}, <f,d> wouldnt be compromised.

Input: The forbidden set F, and the cuboid C Output: A minimal partial cover MPC of F Method: 1: 2: 3: 4: 5: 7: for each cell c in F if Ax= c has solutions construct the coefficient matrix A=[ c 1 c n ]

find the solutions X of Ax= c

find the set of components Mc at least one of which is non-zero in each X return MPC=
cF

U Mc

Fig. 2. Algorithm 1 FMPC: finding a minimal partial cover

4.2

Algorithm

Now, let us generalize the algorithm of finding the minimal partial cover (Figure 2). Given a forbidden set F in cuboid C, first construct the coefficient matrix A using the unprotected cells in C or Cs fathers. Then for each cell c in F, if Ax=c has solutions, find the set of components in the solutions at least one of which is non-zero in each X. Here we use linear system theory [7] to find such cells. The solutions of Ax=c can be represented as x=x0+[x1, , xr][k1, , kr], where x1, , xr is the basic solutions of Ax=0, and X0 is a certain solution of Ax=c. There are r independent components in X, taking zero in x0 and taking 1 respectively in each xi (i=1, , r). For example, in figure 3, the last three components are independent. Suppose X0[i] and X2[i] are non-zero in all the ith components of X0 to X3, and X2[j] is the independent component taking 1 in X2, then either X[i] or X[j] is used in X, and the corresponding cells are the minimal partial cover. X X X X X
# # # X [i ] X 0[ i ] k 1 0 # # # = + k 2 # 0 k 3 1 X [ j ] 0 0 # 0 0
0 1

# # 0 1 0

X 2[ i ]

# 0 # 0 0 1

Independent Components

Fig. 3. An example of minimal partial cover

Lemma 1. Given X=X0+[X1, , Xn-r][k1, , kn-r]T, the (r+1)th to nth component of X are the independent components. If X0[i]0, and only Xd1[i], , Xdj[i] of X1[i], , Xn-r[i] are non-zero (d1, , dj{1, , n-r} and i<r+1), then: 1. At least one of the components X[i], X[r+d1], , X[r+dj] in X would be non-zero. 2. Any subset of components X[i], X[r+d1], , X[r+dj] could all be zero in X. Lemma 2. Algorithm 1 returns a minimal partial cover of the forbidden set FS. (The proof of Lemma 1 and Lemma 2 are not provided here due to the limit of space.)

Extending the Minimal Partial Cover to Minimal Cover

In this section, we employ a level-wise framework to extend the minimal partial cover to the minimal cover to each cuboid of the cube with some optimizing strategies.

5.1

Two Optimizing Strategies

Eliminating Single-son Inference. A cell is called a single-son cell if it has only one child in its son cuboid. All the single-son fathers of the forbidden set are definitely sensitive. In Example 1, if we hide the two single-son cell <*,l> and <*,h>, all inferences will be eliminated. Thus, in our algorithm we first add all the single-son fathers of the sensitive cells to the minimal cover. It may both eliminate a large part of inference and reduce the number of cells we must check for inference. Finding Candidate Range. In algorithm 1, we check all the fathers and unprotected siblings of the forbidden cells for inference. However, not all of them are dangerous. Example 3. A two-dimensional cube is shown in Figure 4(a). The cell <a2,b1> marked with * in the cuboid <A,B> is sensitive. We construct the coefficient matrix A for cuboid <A,B> (as shown in Figure 4(b)). The column vectors of A are related with 8 father cells and 5 unprotected cells in cuboid <A,B>. However, only the column vector A[1], A[2], A[5], A[6], A[9] and A[10] are probable to infer the value of <a2,b1>, because others have all zeros in the corresponding components. We call the sub matrix formed by A[1], A[2], A[5], A[6], A[9], A[10] and the non-zero components of them the candidate range of the forbidden set (surrounded with dashed in Figure 4(b)). The candidate range could be found by first setting it to the father cells of the forbidden set, and then iteratively add in the cells which intersect with the candidate range.
<*,*>

<a1,*>

<a2,*>

<a3,*>

<a4,*>

<*,b1>

<*,b2>

<*,b3>

<*,b4>

<a1,b1>

<a1,b2>

<a2,b1>* <a3,b3>

<a3,b4>

<a4,b3>

< a1,b1 > 1 < a1,b2 > 1 < a2,b1 >*0 A= < a3,b3 > 0 < a3,b4 > 0 < a4,b3 > 0

0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1

(a) A two-dimensional data cube

(b) The coefficient matrix for cuboid <A,B>

Fig. 4. A two-dimensional data cube

5.2

Algorithm

We use a level-wise framework to extend the minimal partial cover to minimal cover. As shown in the Algorithm 2 in Figure 5, we first rank the cuboids in the cube ac-

Input: The forbidden set FS Output: A minimal cover MC of FS Method: 1: for each cuboid C* in the cube 2: while FSC* 3: Add single son father to MC 4: find the candidate range CR for FS 5: m=FMPC(FS, CR) //m is the minimal partial cover of FS returned by FMPC 6: FS=FS-FSC* //inference to FSC* has been eliminated 7: MC=MCm 8: FS=FSm //the minimal partial cover should be protected 9: return MC
Fig. 5. Algorithm 2 (FMC: a level wise algorithm to find a minimal cover)

cording to the ascend order of the granularity level. Then, for each cuboid, we apply the two optimizing strategies, and invoke Algorithm 1 to find the minimal partial cover of the forbidden set in this cuboid. The returned minimal partial cover should be further checked for inference. This process should be repeated until there isnt any new minimal partial cover in the current cuboid. Theorem 1. Algorithm 2 returns a minimal cover of the forbidden set FS. (The proof of Theorem 1 is based on Lemma 1 and Lemma 2, and is not provided here due to the limit of space.)

Experimental Results

Implementation. All experiments are conducted on a Pentium4 2.80 GHz PC with 512MB main memory, running Microsoft Windows XP Professional. The algorithm is implemented using Borland C++ Builder 6 with Microsoft SQL Server 2000. Data Set. We used the synthetic data sets and real data set TPC-H benchmark for our experiments. In synthetic data sets, we generated data from a Zipfian distribution1, skew of the data (z) was varied over 0, 1, 2 and 3. The sizes of the data sets vary from 20000 to 80000 cells, with 3 dimensions and 4 granularity levels in one dimension. Comparison on Different Zipf Parameter. We apply FMC to TPC-H benchmark and the synthetic datasets whose parameter z=0, 1, 2, and 3. We randomly select 1% of the cells in two cuboids as the forbidden set, and compared the additional cells hidden by FMC and SeCube (L. Wang et al. 2004). Figure 6(a) shows the results. When z=0, the data is uniformly distributed, fewer additional cells need to be hidden

The generator is obtained via ftp.research.microsoft.com/users/viveknar/tpcdskew

than that in the skewed case. Because some values of the dimension appear less often in the skewed dataset, these sparse data are the main cause of inference.
0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 number of additional hidden cells/number of all cells SeCube FMC
number of additional hidden cells/number of all cells
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

SeCube FMC

TPCH

2%

4%

6%

8%

10%

Z factor of zipfian distribution

Size of forbidden set (%)

(a) Compare on different zipf factors

(b) Compare on different forbidden set size

Fig. 6. Size of additional protected cells / size of cube

We also evaluate the effectiveness of the two optimizing strategies. Figure 7(a) with the size of candidate range shows that at most 50% of the cube needs to be check for inference. Figure 7(b) shows the number of single-son inference cases. Since it takes a significant part in all inference cases, to eliminate the single-son inference first will contribute to the approach greatly.
1
0.7

Size of candidate range / size of the cube

0.8 0.6 0.4 0.2 0 0 1 2

number of single-son inference / all inferences

Candidate Range

0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2

Single-son Inference

TPCH

TPCH

Z factor of zipfian distribution

Z factor of zipfian distribution

(a) Size of candidate range/size of cube

(b) single-son inference/all inference cases

Fig. 7. Experimental result of two optimizing strategies


Size of candidate range Runtime(millisecond)
0.5 0.45 0.4 0.35 0.3 2% 4% 6% 8% 10%
1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 0

Candidate Range

FMC
4% 6% 8% 10%

2%

Size of forbidden set (%)

Size of forbidden set (%)

Fig. 8. Size of candidate range / size of cube

Fig. 9. Runtime of FMC

Comparison on Varied Forbidden Set. We set the zipf parameter to z=1, and change the size of forbidden set. Figure 6(b) shows the size of additional cells hidden by SeCube [6] and FMC, where FMC hide fewer cells than SeCube in all cases. Figure 8 demonstrates the candidate range on different forbidden set size. The size of candidate range stays below 40% in all cases, which means that we only need to check 40% of the whole cube for inference. We also tested the runtime of FMC for different size of forbidden set (Figure 9).

Conclusions

In this paper, we present an effective and efficient algorithm to address the privacy preserving OLAP problem. The main idea is to hide part of the data causing the inference, so that the sensitive information could no longer be computed. We could guarantee that all the information we hide is necessary, and thus as much information as possible can be provided for users while protecting the sensitive data. All work will be done before users interacting with the system, and thus, it would not affect the online performance of the OLAP system. Our algorithm is partially based on the linear system theory, so the correctness could be strictly proved. Experimental results also demonstrate the effectiveness of the algorithm. Future work includes applying the method to other aggregation functions and improving the efficiency of the algorithm. We also plan to extend the work to solve the inference problem caused by involving two aggregation functions in one cube.

References
1. L. Wang, D. Wijesekera: Cardinality-based Inference Control in Sum-only Data Cubes. Proc. of the 7th European Symp. on Research in Computer Security, 2002. 2. F. Y. Chin, G. Ozsoyoglu: Auditing and inference control in statistical databases. IEEE Trans. on Software. Eng. pp. 574-582 (Apr. 1982) 3. L.H. Cox: Suppression methodology and statistical disclosure control. Journal of American Statistic Association, 75(370):377385, 1980. 4. D. E. Denning: Secure statistical databases under random sample queries. ACM Trans. on Database Syst. Vol. 5(3) pp. 291-315 (Sept. 1980) 5. L. Wang, Y. Li, D. Wijesekera, S. Jajodia: Precisely Answering Multi-dimensional Range Queries without Privacy Breaches. ESORICS 2003: 100-115 6. L. Wang, S. Jajodia, D. Wijesekera: Securing OLAP data cubes against privacy breaches. Proc. IEEE Symp. on Security and Privacy, 2004, pages 161-175. 7. K. Nicholson: Elementary Linear Algebra. Second Edition, McGraw Hill, 2004.

You might also like