You are on page 1of 7


com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG


Model Ranking the Classification Accuracy of Data Mining Algorithms for Anonymized Data
M.Sridhar, Dr. B.Raveendra Babu
Abstract - In recent years, data has been increased in many systems and to protect such data before its release the data should be anonymized. To ensure the privacy of data, in the literature there were many techniques proposed through the privacy preserving data mining algorithms, secure multiparty and information hiding. For anonymizing the data the available methods like randomization, k-anonymous, l-diversity and t-closeness. This paper focuses on the performance of classification accuracy and computational time of the data with and without k-anonymization to rank the best data mining algorithm. The classification accuracy is calculated using Naive Bayes’, Bagging, K-nearest Neighbor, PART, oneR, and J48. Index terms - Classification accuracy, k-anonymous, Privacy preserving data mining, k-nearest Neighbor.




rivacy is an important aspect to one who wants to make use of data that involves individuals sensitive data, especially when data collection is becoming easier and data mining sophisticated techniques are becoming more and more efficient*1,2+. Now a day’s many organizations such as hospitals, government, need to release sensitive data like medical records (or census records) for research purpose and public benefit purpose [3]. Most privacy computation methods use some form of data transformation to ensure the privacy preservation. Some commonly used anonymization techniques are given below:

The data perturbation method: Modifying the data so that it no longer represents real individuals, as the data doesn’t reflect real values. The individual’s privacy remains intact, even if a data item is linked to an individual. A primary perturbation technique is data swapping—exchanging data values between records in ways that preserve certain statistics but destroy real values [24]. Full swapping can prevent effective mining, but the authors concluded that limited swapping has a minimal impact on the results. The key for privacy preservation is that you don’t know which records are correct; you simply have to assume that the data doesn’t contain real values. An alternative is randomization wherein additional noise is included to hide the real values [4, 5, 6, and 7]. Because the data no longer reflects real world values, the individual’s privacy is maintained. Hence, techniques derive aggregate distributions from perturbed records. Data mining techniques are developed later to work with such aggregate distributions. The k-anonymity and l-diversity: The k-anonymity model was developed due to the possibility of

indirect record identification from public databases where combinations of record attributes are used to identify individual records. In k-anonymity method, data representation granularity is reduced with techniques like generalization and suppression. This granularity is reduced such that any record maps onto at least k other data records. In generalization, attribute values are generalized to reduce representation granularity. For example, a date of birth can be generalized to a range like year of birth, in order to reduce identification risks. In the suppression method, the attribute value is removed. It is clear that these methods lower identification risks through the use of public records while reducing the transformed data applications accuracy. The l-diversity model handles weaknesses in the kanonymity model. As protecting identities to the kindividual level is not the same as protecting corresponding sensitive values, when there is homogeneity of sensitive values inside a group. To do this, a concept of sensitive values, intra-group diversity is promoted within the anonymization scheme [8]. Hence, the l-diversity technique was proposed to maintain minimum group size of k, and also to focus on maintenance of sensitive attributes diversity. The t-closeness model is an improvement on the l-diversity concept. In [9], a tcloseness model was proposed using property that, distance between distributions of sensitive attribute inside an anonymized group could not be different from global distribution by more than a threshold. The Earth Mover distance metric is used to quantify distance between two distributions. Further, tcloseness approach is more effective than other privacy preserving data mining methods for numeric attributes.

© 2012 Journal of Computing Press, NY, USA, ISSN 2151-9617



Distributed privacy preservation: In some cases, individual entities will derive aggregate results from data sets partitioned across entities. The goal in distributed methods for privacy-preserving data mining is allowing computation of aggregate statistics over a data set without compromising individual data set privacy, within different participants. Thus, participants may collaborate to obtain aggregate results, without trusting each other with regard to distribution of own data sets. Such partitioning can be horizontal where records are distributed across multiple entities or vertical where attributes are distributed across multiple entities. While individual entities may not share entire data sets, they could consent to limited sharing through various protocols. These methods ensure privacy for every individual entity, while getting aggregate results over entire data. Downgrading Application Effectiveness: In some cases, though data may be un-available, application output like association rule mining, classification or query processing may lead to privacy violations. This has resulted in research regarding downgrading application effectiveness by either data/application modifications. Examples of such techniques are association rule hiding *10+, classifier downgrading [11], and query auditing [12]. Two approaches are used for association rule hiding: Distortion: In distortion, [13], entry for a given transaction is changed to another value. Blocking: In blocking [14], entry remains the same but is incomplete. Classifier downgrading modifies data so that classification accuracy is reduced, while retaining data utility for other applications. Query auditing, denies one or more queries from a sequence. The tobe-denied queries are chosen so that underlying data sensitivity is preserved. Data anonymization techniques were under investigation recently, for structured data, including tabular, graph and item set data. They published detailed information, permitting ad hoc queries and analyses, while at the same time guaranteeing sensitive information privacy against a variety of attacks. This paper focuses on investigating the classification accuracy and computational time of the data with and without k-anonymization to rank the efficiency of privacy preserving data mining to extract the desired patterns from the data sets. The classification accuracy is calculated using Naïve Bayes’, Bagging, KNN, PART, oneR, and J48. The proposed method is evaluated using Adult data set. The paper is organized as follows: Section 2 reviews some of the

related work in the literature, section 3 details the methodology, section 4 gives the results and section 5 concludes the paper.

Aggarwal, et al., [15] reviewed the state-of-the-art methods for privacy. In the study the following was discussed:  Methods for randomization, kanonymization, and distributed privacypreserving data mining.  Cases where the output of data mining applications requires to be sanitized for privacy-preservation purposes.  Methods for distributed privacy-preserving mining for handling horizontally and vertically partitioned data.  Issues in downgrading the effectiveness of data mining and data management applications  Limitations of the problem of privacy. Bayardo, et al., [16] proposed an optimization algorithm for k-anonymization. Optimized kanonymity records are NP-hard due to which significant computational challenges are faced. The proposed method explores the space of possible anonymization and develops strategies to reduce computation. The census data was used as dataset for evaluation of the proposed algorithm. Experiments show that the proposed method finds optimal k-anonymization using a wide range of k. The effects of different coding approaches and quality of anonymization and performance are studied using the proposed method. LeFevre, et al., [17] proposed a suite of anonymization algorithms for producing an anonymous view based on a target class of workloads. It consists of several data mining tasks and selection predicates. The proposed suite of algorithms preserved the utility of the data while creating anonymous data snapshot. The following expressive workload characteristics were incorporated:  Classification & Regression, for predicting categorical and numeric attributes.  Multiple Target Models, to predict multiple different attributes.  Selection & Projection, to warrant that only a subset of the data remains useful for a particular task. The results show that the proposed method produces high-quality data for a variety of workloads.

© 2012 Journal of Computing Press, NY, USA, ISSN 2151-9617



Inan, et al., [18] proposed a new approach for building classifiers using anonymized data by modeling anonymized data as uncertain data. In the proposed method, probability distribution over the data is not assumed. Instead, it is proposed to collect necessary statistics during anonymization and release them with anonymized data. It was demonstrated that releasing statistics is not violative of anonymity. The aim was to compare accuracy of: (1) various classification models constructed over anonymized data sets and transformed through various heuristics (2) approach of modeling anonymized data as uncertain data using expected distance functions. The experiments investigate relationship between accuracy and anonymization parameters: the anonymity requirement, k; the number of quasiidentifiers; and, especially for distributed data mining the data holders number. Run experiments were conducted on Support Vector Machine (SVM) classification methods and Instance Based learning (IB) methods. For implementation and experimentation Java source code of the libSVM SVM library, and IB1 instance-based classifier from Weka was used. For experiments, DataFly was selected as is a widely known solution. Experiments which spanned many alternatives in both local and distributed data mining settings revealed that our method performed better than heuristic approaches used to handle anonymized data. Fung., et al., [19] presented a Top-Down Specialization (TDS) approach, a practical algorithm to determine a general data version that covers sensitive information and is for modeling classification. Data generalization is implemented by specializing/detailing information level in a topdown manner till violation of a minimum privacy requirement. TDS generalizes by specializing it iteratively from a most general state. At every step, a general (i.e. parent) value is specialized into a specific (i.e. child) value for a categorical attribute. Otherwise an interval is divided into two sub-intervals for continuous attribute. Repetition of this process occurs till specialization leads to violation of anonymity requirement. This top-down specialization is efficient to handle categorical and continuous attributes. The proposed approach exploits the idea that data has redundant structures for classification. While generalization could terminate some structures, others come to help. Experimental results show that classification quality can be preserved for highly restrictive privacy requirements. TDS gels well with large data sets and complex anonymity requirements. This work has applicability in public and private

sectors where information is shared for mutual benefits and productivity.

3.1 Dataset The ‘Adult’ dataset used for evaluation is obtained from UCI Machine Learning Repository [20]. The dataset contains 48842 instances, with both categorical and integer attributes from the 1994 Census information. The Adult dataset contains about 32,000 rows with 4 numerical columns. The columns and their ranges are: age {17 – 90}, fnlwgt {10000 – 1500000}, hrsweek {1 – 100} and edunum {1 – 16}. The age column and the native country are anonymized using the principles of kanonymization. Table 1 and 2 show the original data and the modified attribute data. The dataset is classified using 10 fold cross validation the original and the K-anonymized dataset.
TABLE 1 THE ORIGINAL ATTRIBUTES OF ADULT DATASET Age 39 50 38 53 28 37 49 52 31 42 Native-country United-States United-States United-States United-States Cuba United-States Jamaica United-States United-States United-States Class <=50K <=50K <=50K <=50K <=50K <=50K <=50K >50K >50K >50K

TABLE 2 THE K-ANONYMOUS DATASET Native-country United-States United-States United-States United-States US-oth United-States African United-States United-States United-States Class <=50K <=50K <=50K <=50K <=50K <=50K <=50K >50K >50K >50K

Age Adult middle aged Adult middle aged Adult Adult middle aged middle aged Adult Adult

© 2012 Journal of Computing Press, NY, USA, ISSN 2151-9617



3.2 Naïve Bayes’ A Naïve Bayes’ classifier is a simple probabilitybased algorithm *25+. It uses the Bayes’ theorem, but assumes that instances are independent of each other, which is a rather unrealistic assumption in the practical world. Despite their simplicity, Naïve Bayes’ classifiers often work surprisingly well in complex real-world situations. The Naïve Bayes’ algorithm offers fast model building and scoring both for binary and multiclass situations for relatively low volume of data. The algorithm makes predictions using Bayes’ theorem, which incorporates evidence or prior knowledge in its prediction. 3.3 Bagging Bagging improves classification [23] by combining models learned from different subsets of a dataset. Overfitting and variance is considerably reduced on application of bagging. The instability in the classifier is used to perturb the training set producing different classifiers using the same learning algorithm. For a training set A of size t, n number of new training sets Ai (t’ < t) is generated. The subsets are generated by sampling of the instances from A uniformly and by replacement. Some instances are repeated in each Ai due to sampling with replacement and are called bootstrap samples. The n number of models are fitted using all the n number of subsets (bootstrap samples). The output is obtained by averaging the above result. 3.4 k-nearest Neighbor
In the k-nearest Neighbor classification, the algorithm finds a set of k objects in the training set that are the closest to the input and classifies the input based on the majority of the class in that group [21]. The main elements required for this approach are: a set of labeled objects, distance metrics and number of k. Following is the k-nearest neighbor classification algorithm: Input: D with k sets of training objects

3.5 J48 J48 algorithm is a version of the C4.5 decision tree learner [22]. Decision tree models are formed by J48 implementation. The decision trees are built using greedy technique and analyzing the training data. The nodes in the decision tree evaluate the significance of the features. Classification of input is performed by following a path from root to leaves in the decision tree, resulting in a decision of the inputs class. The decision trees are built in a topdown fashion by selecting the most suitable attribute every time.
The classification power of each feature is computed by an information-theoretic measure. On choosing a feature, subsets of the training data is formed based on different values of the selected feature. Process is iterated for every subset until majority of the instances belong to the same class. Decision tree induction generally learns a high accuracy set of rules which ensure accuracy but produce excessive rules. Thus, the trees are pruned to generalize a decision tree to gain balance of flexibility and accuracy. J48 utilize two pruning methods; subtree replacement and subtree raising. In subtree replacement, nodes are replaced with a leaf, working backwards towards the root. In subtree raising, a node is moved upwards towards the root of the tree.

3.6 PART PART obtains rules from partial decision trees. The advantage of PART is that global optimization is avoided while producing accurate, compact rule sets [22]. The decision tree learning uses the divideand-conquer strategy while the rule learning uses separate-and-conquer strategy. In separate-andconquer strategy, a rule is obtained and the corresponding instances are removed, and rules are created recursively for all the instances in the set.
In PART, only a partial decision tree is built instead of a fully grown decision tree. The difference is that in a partial decision tree contain branches to undefined subtrees. The construction and pruning operations are incorporated to find a ‚stable‛ subtree which cannot be simplified further. On construction of the stable subtree, tree building stops and one rule is obtained from it.

 x, y  and test object  x' , y ' 

Process: compute distance d x' , x between test object and every object. Select Dz  D, set of k closest training objects to test object. Output: y '  arg max

 


xi , yi Dz

I  v  yi 

Where v is the class label, I is a indicator function.

3.7 OneR An easy way to find very simple classification rules from a set of instances [22] is 1R or 1-rule (oneR). A one-level decision tree is generated which expresses

© 2012 Journal of Computing Press, NY, USA, ISSN 2151-9617



a set of rules for one particular attribute. OneR characterizes the structure in data. It is a simple method and comes up with good rules. The simple rules generally achieve surprisingly high accuracy. The rules obtained from oneR test a single attribute and branch. Thus, the algorithm generates a single rule for each predictor in the dataset and the rule with smallest error is selected. Each branch matches to a different value of the attribute and assigns most frequent class. Then the error rate is determined by counting the errors that occur on the training data. Different set of rules is generated for every attribute and one rule for every value of the attribute. On evaluating the error rate for each attributes’ rule set, the best is chosen.

TABLE 4 COMPUTATIONAL TIME Technique used Naïve Bayes’ Bagging KNN PART oneR J48 Computational Time Without With anonymization anonymization 0.08 3.46 0.01 38.01 0.13 2.56 0.07 3.39 0.01 9.46 0.09 1.82

The classification accuracy and computational time is obtained from Naïve Bayes’, Bagging, K-nearest neighbor, PART, oneR, and J48 is tabulated in Table 3, Table 4 and is shown in Fig 1 and Fig 2. From the Fig 1, it is seen that the classification accuracy does not change considerably and are within manageable limits for all the classifiers. The classification accuracy variation is not more than 0.4 except for Naïve Bayes’ for any classifier. The performance of the algorithm is best for J48 and the second best is PART and the least is KNN. From Fig 2, it is seen that the computational time also does not change considerably for both data sets, but for the method PART it is seen that there is a drastic change for the anonymized data when compared to original data. The execution time for original data is 38.01 minutes but for the anonymized data it is 9.46 minutes only. It seen that the execution time for anonymized data is efficient. The performance of the algorithms is best for KNN and the second best is Naïve Bayes’ and the least is PART. TABLE 3 CLASSIFICATION ACCURACY Classification accuracy Technique Without With used anonymization anonymization Naïve Bayes’ Bagging KNN PART oneR J48 82.85 84.67 78.98 84.58 80.16 85.75 79.26 84.76 79.12 85.15 80.16 85.44 Fig 2: Computational time of various techniques.

Fig 1: Classification accuracy of various techniques.

© 2012 Journal of Computing Press, NY, USA, ISSN 2151-9617



The receiver operating curves (ROC) for both data sets are tabulated in the table 5 and are shown in Fig 3. It is seen that the performance of the algorithm is best for the Bagging and the second best is the J48 and the least one is oneR. TABLE 5 RECEIVER OPERATING CURVES (ROC) Receiver operating curves Technique used Naïve Bayes’ Bagging KNN PART oneR J48 Without anonymizatio n 0.89 0.90 0.71 0.87 0.60 0.89 With anonymizatio n 0.87 0.90 0.72 0.90 0.60 0.88

The TPR, TFR, Precision, Recall and F-measure are tabulated in Table 6 and the performance of average accuracy of the algorithms is shown in Fig 4. It is seen that TPR, TNR measures are same for algorithm oneR and for the remaining algorithms it shows that the variation is not more than 0.2% for TPR and 0.5% for TNR. The Precision measure is same for the algorithms Bagging, KNN, and oneR and the remaining algorithms it is not more than 0.4%. The Recall measure is same for the Naïve Bayes’, PART, and oneR and for the remaining it is not more than 0.1%. The F-measure is same for all algorithms except for Naïve Bayes’.

Fig 4: The average performance of various techniques Fig 3: ROC of various techniques

Technique used Naïve Bayes’ Bagging KNN PART oneR J48 Without Anonymization TPR 0.93 0.92 0.86 0.91 1.00 0.93 TNR 0.52 0.63 0.56 0.65 0.21 0.64 Precision 0.85 0.88 0.86 0.89 0.79 0.89 Recall 0.93 0.92 0.86 0.91 1.00 0.93 Fmeasure 0.89 0.90 0.86 0.90 0.88 0.91 Avg. 0.646 0.848 0.808 0.844 0.78 0.854 TPR 0.95 0.93 0.87 0.93 1.00 0.94 TNR 0.57 0.61 0.57 0.61 0.21 0.59 With Anonymization Precision 0.81 0.88 0.86 0.88 0.79 0.87 Recall 0.93 0.93 0.87 0.91 1.00 0.94 Fmeasure 0.87 0.90 0.86 0.90 0.88 0.91 Avg. 0.826 0.85 0.806 0.846 0.776 0.85

© 2012 Journal of Computing Press, NY, USA, ISSN 2151-9617



In this paper, it is proposed to compare the performance of classification accuracy and the computational time of various data mining algorithms with and without anonymized datasets. And also, model ranking the best data mining algorithms by using WEKA experimenter, which provides an environment for testing and compare several data mining algorithms. The experimental result demonstrates that the classification accuracy of classifiers is not diminishing due to the data anonymization.

[1] Verykios V. S., Elmagarmid A., Bertino E., Saygin Y.„ Dasseni E.: Association Rule Hiding. IEEE Transactions on Knowledge and Data Engineering, 16(4), 2004. Sweeney L.: Privacy-Preserving Bio-terrorism Surveillance. AAAI Spring Symposium, AI Technologies for Homeland Security, 2005. Newton E., Sweeney L., Malin B.: Preserving Privacy by De-identifying Facial Images. IEEE Transactions on Knowledge and Data Engineering, IEEE TKDE, February 2005. Liew C. K., Choi U. J., Liew C. J. A data distortion by probability distribution. ACM TODS, 10(3):395411, 1985. Warner S. L. Randomized Response: A survey technique for eliminating evasive answer bias. Journal of American Statistical Association, 60(309):63–69, March 1965. Agrawal R., Srikant R. Privacy-Preserving Data Mining. Proceedings of the ACM SIGMOD Conference, 2000. Agrawal D. Aggarwal C. C. On the Design and Quantification of Privacy Preserving Data Mining Algorithms. ACM PODS Conference, 2002. Machanavajjhala A., Gehrke J., Kifer D., and Venkitasubramaniam M.: l-Diversity: Privacy Beyond k-Anonymity. ICDE, 2006. Li N., Li T., Venkatasubramanian S: t-Closeness: Privacy beyond k-anonymity and l-diversity. ICDE Conference, 2007. Verykios V. S., Elmagarmid A., Bertino E., Saygin Y.„ Dasseni E.: Association Rule Hiding. IEEE Transactions on Knowledge and Data Engineering, 16(4), 2004. Moskowitz I., Chang L.: A decision theoretic system for information downgrading. Joint Conference on Information Sciences, 2000. Adam N., Wortmann J. C.: Security-Control Methods for Statistical Databases: A Comparison Study. ACM Computing Surveys, 21(4), 1989. Oliveira S. R. M., Zaiane O., Saygin Y.: Secure Association-Rule Sharing. PAKDD Conference, 2004.






[14] Saygin Y., Verykios V., Clifton C.: Using Unknowns to prevent discovery of Association Rules, ACM SIGMOD Record, 30(4), 2001. [15] C. C. Aggarwal and P. S. Yu (eds.), PrivacyPreserving Data Mining: Models and Algorithms (Springer, New York, 2008). [16] Bayardo R. J., Agrawal R.: Data Privacy through Optimal k-Anonymization. Proceedings of the ICDE Conference, pp. 217–228, 2005 [17] LeFevre K., DeWitt D., Ramakrishnan R.: Workload Aware Anonymization. KDD Conference, 2006. [18] A. Inan, M. Kantarcioglu, and E. Bertino. Using anonymized data for classification. In ICDE, 2009. [19] Fung B., Wang K., Yu P.: Top-Down Specialization for Information and Privacy Preservation. ICDE Conference, 2005. [20] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository []. Irvine, CA: University of California, School of Information and Computer Science. [21] Fix E, Hodges JL, Jr (1951) Discriminatory analysis, nonparametric discrimination. USAF School of Aviation Medicine, Randolph Field, Tex., Project 21-49-004, Rept. 4, Contract AF41(128)-31, February 1951 [22] Witten, I. H. and Frank, E. (2011). Data Mining: Practical Machine Learning Tools and Techniques, 3nd ed. Morgan Kaufmann, San Fransico, CA. [23] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. [24] J. Vaidya and C. Clifton, ‚Privacy-preserving data mining: Why, How, and When‛. IEEE Security and Privacy, 2004. [25] Shawkat Ali A B M, Wasimi Saleh A. (2009). Data Mining: Methods and Techniques, CENGAGE Learning. M.Sridhar is currently working as an Assistant Professor in the R.V.R & J.C College of Engineering, Guntur. Research interests are Data Mining, Information Security and Image Processing. Dr. B.Raveendra Babu is currently working as professor in the VNR VJIET, Hyderabad. His research interest includes Data Mining, Image Processing and Information Security. He published more than 50 papers in various International and National Journals.








© 2012 Journal of Computing Press, NY, USA, ISSN 2151-9617