Paper-1 Classification - Analysis of Anonymization For Privacy Data

International Journal of Computational Intelligence and Information Security, October 2011 Vol. 2, No.
10
Classification Analysis of Anonymization for Privacy Data

Name: N. SambaSiva Rao College: Varadhaman College of Engineering Designation: Prinicipal Name: Syed Arfath Ahmed College: Muffakam Jah College of Engineering & Technology Desingation: Asst Prof Email id: ahmed_info521@yahoo.co.in
Name: Anupma Tammi College: Varadhaman College of Engineering Qualification: M.Tech Email: anupamatammi@gmail.com
Abstract Data mining is knowledge of huge amounts of data classification is fundamental problem in data analysis on continuous and categorical value. Training classifier requires accessing a large collection of data. Mining the particular person specific data such a employer data or bank-holder customer records may occur a threat to individual privacy. Removing explicit identifying information such as name is still possible to link released records back to their identities by matching some combination of nonidentifying attributes like Gender, Date of Birth. An approach is combine to link attacks called k-anonymization, is anonymizing the linking attributes so that at least k released records match each the value. Existing work attempted to find an optimal algorithm k-anonymization that minimizes some data distortion metric, minimizing the distortion to the training data is not relevant to the classification that requires extracting the structure of prediction on the future data. We analyze k-anonymization algorithm for classification on future data efficiently protects the system Keywords Data Mining, Privacy, K-anonymization, Classification
INTRODUCTION I Data mining can be excellent on data represented in quantitative, textual, or multimedia forms. Data mining applications can use a variety of parameters to examine the data. They include association (patterns where one event is connected to another event, such as purchasing a pen and purchasing paper), sequence or path analysis (patterns where one event leads to another event, such as the refrigerator and purchasing stabilizer), classification (identification of new patterns, such as coincidences between duct tape purchases and plastic sheeting purchases), clustering (finding and visually documenting groups of previously unknown facts, such as geographic location and brand preferences), and forecasting (discovering patterns from which one can make reasonable predictions regarding future activities, such as the prediction that people who join an athletic).[2] As an application, compared to other data analysis applications, such as structured queries (used in many commercial databases) or statistical analysis software, data mining represents a difference of kind rather than degree. Many simpler analytical tools utilize a verification-based approach, where the user develops a hypothesis and then tests the data to prove or disprove the hypothesis. For example, a user might hypothesize that a customer who buys a hammer, will also buy a box of nails. The effectiveness of this approach can be limited by the creativity of the user to develop various hypotheses, as well as the structure of the software being used. Privacy-preserving data mining finds numerous applications in surveillance which are naturally supposed to be privacy-violating applications. The key is to design methods [4] which continue to be effective, without compromising security. In [4], a number of techniques have been discussed for bio-surveillance, facial de-dentification, and identity theft. More detailed discussions on some of these issues may be found in. Most methods for privacy computations use some form of transformation on the data in order to perform the privacy preservation. Typically, such methods reduce the granularity of representation in order to reduce the privacy. This reduction in granularity results in some loss of effectiveness of data management or mining algorithms. Data sharing in globally networked systems poses a threat to individual privacy and confidentiality. Example shows that linking medication records with voter list can uniquely identify a persons name and her medical information. To avoid the conflicts in privacy data one of method introduced the personal information protection and electronic document act [6] to protect a wide spectrum of information like age, race, income, evaluations Consider a table T about patient's information on Birthplace, Birth year, Sex, and Diagnosis. If a description on fBirthplace;Birthyear;Sexg is so specific that not many people match it, releasing the table may lead to linking a unique record to
International Journal of Computational Intelligence and Information Security, October 2011 Vol. 2, No. 10
an external record with explicit identity thus identifying the medical condition and compromising the privacy rights of the individual [5]. Suppose that the attributes Birthplace, Birth year, Sex, and Diagnosis must be released (say, to some health research institute for research purposes). One way to prevent such linking is masking the detailed information of these attributes as follows:
SECTION II 2. Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to learn based on data, such as from sensor data or databases. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data. Hence, machine learning is closely related to fields such as statistics, probability theory, data mining, pattern recognition, artificial intelligence, adaptive control, and theoretical computer science. Data mining interfaces support the following supervised functions 2.1. Classification: A classification task begins with build data (also known as training data) for which the target values (or class assignments) are known. Different classification algorithms use different techniques for finding relations between the predictor attributes' values and the target attribute's values in the build data. Decision tree rules provide model transparency so that a business user, marketing analyst, or business analyst can understand the basis of the model's predictions, and therefore, be comfortable acting on them and explaining them to others Decision Tree does not support nested tables. Decision Tree Models can be converted to XML. NB makes predictions using Bayes' Theorem, which derives the probability of a prediction from the underlying evidence. Bayes' Theorem states that the probability of event A occurring given that event B has occurred (P(A|B)) is proportional to the probability of event B occurring given that event A has occurred multiplied by the probability of event A occurring ((P(B|A)P(A)). Adaptive Bayes Network (ABN) is an Oracle proprietary algorithm that provides a fast, scalable, non-parametric means of extracting predictive information from data with respect to a target attribute. (Non-parametric statistical techniques avoid assuming that the population is characterized by a family of simple distributional models, such as standard linear regression, where different members of the family are differentiated by a small set of parameters.) 2.2. Support vector Machine: Support vector machine (SVM) is a state-of-the-art classification and regression algorithm. SVM is an algorithm with strong regularization properties, that is, the optimization procedure maximizes predictive accuracy while automatically avoiding over-fitting of the training data. Neural networks and radial basis functions, both popular data mining techniques, have the same functional form as SVM models, however, neither of these algorithms has the wellfounded theoretical approach to regularization that forms the basis of SVM. 2.3. Association Rule: association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases . Piatetsky-Shapiro[7] describes analyzing and presenting strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Agrawa[12]l et al. introduced association rules for discovering regularities between products in large scale transaction data recorded by point-of-scale (POS) systems in supermarkets. For example, the rule{ onions, potatoes}=>{beef} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy beef. Association model is often used for market basket analysis, which attempts to discover relationships or correlations in a set of items. Market basket analysis is widely used in data analysis for direct marketing, catalog design, and other business decision-making processes. Traditionally, association models are used to discover business trends by analyzing customer transactions. However, they can also be used effectively to predict Web page accesses for personalization. For example, assume that after mining the Web access log, Company X discovered an association rule "A and B implies C," with 80% confidence, where A, B, and C are Web page accesses. If a user has visited pages A and B, there is an 80% chance that he/she will visit page C in the same session. Page C may or may not have a direct link from A or B. This information can be used to create a dynamic link to page C from pages A or B so that the user can "click-through" to page C directly. This kind of information is particularly valuable for a Web server supporting an e-commerce site to link the different product pages dynamically, based on the customer interaction. 2.4. Clustering: Cluster is a number of similar objects grouped together. It can also be defined as the organization of dataset into homogeneous and/or well separated groups with respect to distance or equivalently similarity measure. Cluster is an aggregation of points in test space such that the distance between any two points in cluster is less than the distance between any two points in the cluster and any point not in it. There are two types of attributes associated with clustering, numerical and categorical attributes. Numerical attributes are associated with ordered values such as height of a person and speed of a train. Categorical attributes are those with unordered values such as kind of a drink and brand of car. Clustering is available in flavors of
Hierarchical Partition (non Hierarchical) In hierarchical clustering the data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object. Hierarchical Clustering is subdivided into agglomerative methods, which proceed by series of fusions of the n objects into groups, and divisive methods, which separate n objects successively into finer groupings.
SECTION III 3. Problem definition: Data Mining is techniques to mining the huge amounts of data, problem we consider a data provider wants to release a person-specific table to the public for modeling the class label, each transaction data is either a categorical or a continuous attribute. Data provider wants to protect against linking an individual to sensitive information either within or outside the table through some identifying attributes called a QID.
Education ANY Secondary Junior Sec 9th 10th Senior Sec 11th 12th Bachelor University Grad School Masters Doctorate [1-35) [1-37)
Work_Hrs ANY [37-99]
[35-37)
Sex ANY Male Female
<QID1={Education,Sex},4> <QID2={Sex,Work_Hrs},11>)
Figure 1 shows the example of class label identity Classification is a technique provides the continuous and categorical value depends on the QID. 3.1. Classification Techniques: In Classification, training examples are used to learn a model that can classify the data samples into known classes. The Classification process involves following steps: a. Create training data set. b. Identify class attribute and classes. c. Identify useful attributes for classification (relevance analysis). d. Learn a model using training examples in training set. e. Use the model to classify the unknown data samples. A. Decision Tree Decision tree support tool that uses tree-like graph or models of decisions and their consequences [13][14], including event outcomes, resource costs, and utility, commonly used in operations research, in decision analysis help to identify a strategy most likely to reach a goal. In data mining and machine learning, decision tree is a predictive model that is mapping from observations about an item to conclusions about its target value. The machine learning technique for inducing a decision tree from data is called decision tree learning.
Outlook
Sunny Overcast Rain
Humidity
Yes
Wind
High No
Normal Yes
Strong No
Weak Yes
Figure 2 is the example of decision tree In above fig(2) tree is classified into five leaf nodes. In a decision tree, each leaf node represents a rule. The following rules are as follows in figure(2) Rule 1: If it is sunny and the humidity is high then do not play. Rule2 : If it is sunny and the humidity is normal then play. Rule3 : If it is overcast, then play. Rule4 : If it is rainy and wind is strong then do not play. Rule5 : If it is rainy and wind is weak then play. B. ID3 Decision Tree Iterative Dichotomiser is an algorithm to generate a decision tree invented by Ross Quinlan, based on Occams razor. It prefers smaller decision trees(simpler theories) over larger ones. However it does not always produce smallest tree, and therefore heuristic. The decision tree is used by the concept of Information Entropy The ID3 Algorithm is : 1)Take all unused attributes and count their entropy concerning test samples 2) Choose attribute for which entropy is maximum 3)Make node containing that attribute ID3 (Examples, Target _ Attribute, Attributes) Create a root node for the tree If all examples are positive, Return the single-node tree Root, with label = +. If all examples are negative, Return the single-node tree Root, with label = -. If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples. Otherwise Begin A = The Attribute that best classifies examples. Decision Tree attribute for Root = A. For each possible value, vi, of A, Add a new tree branch below Root, corresponding to the test A = vi. Let Examples(vi), be the subset of examples that have the value vi for A If Examples(vi) is empty common target value in the examples Else below this new branch add the sub tree ID3 (Examples(vi), Target_ Attribute, Attributes {A} End Return Root C.CART: CART (Classification and regression trees) was introduced by Breiman, (1984). It builds both classifications and regressions trees. The classification tree construction by CART is based on binary splitting of the attributes. It is also based on Hunts model of decision tree construction and can be implemented serially (Breiman, 1984). It uses gini index splitting measure in selecting the splitting attribute. Pruning is done in CART by using a portion of the training data set (Podgorelec et al, 2002). CART uses both numeric and categorical attributes for building the decision tree and has in-built features that deal with missing attributes (Lewis, 200). CART is unique from other Hunts based algorithm as it is also use for regression analysis with the help
of the regression trees. The regression analysis feature is used in forecasting a dependent variable (result) given a set of predictor variables over a given period of time (Breiman, 1984). It uses many single variable splitting criteria like gini index, symgini etc and one multi-variable (linear combinations) in determining the best split point and data is sorted at every node to determine the best splitting point. The linear combination splitting criteria is used during regression analysis. SALFORD SYSTEMS implemented a version of CART called CART using the original code of Breiman, (1984). CART has enhanced features and capabilities that address the short-comings of CART giving rise to a modern decision tree classifier with high classification and prediction accuracy. 3.2. Application of Classification Techniques: Allows for continuous valued attributes dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values assigning the most common value of the attribute and also assign probability to each of the possible values Attribute construction create new attributes based on existing ones that are sparsely represented reduces fragmentation repetition and replication.
SECTION IV 4.1. What is Anonymization: The randomization method is a simple technique which can be easily implemented at data collection time because the noise added to a given record is independent of the behavior of other data records. 4.2. K-Anonymity framework: Data records are made available to removing key identifiers in many applications such as the name and social-security numbers from personal records. However other kinds of attributes can be used in order to accurately identify the records. For example attributes such as age Zip-code and sex are available in public records such as census rolls. When these attributes are also available in a given data set they can be used to infer the identity of the corresponding individual. In k-anonymity techniques [8], we reduce the granularity of representation of these pseudo-identifiers with the use of techniques such as generalization and suppression. In the method of generalization, the attribute values are generalized to a range in order to reduce the granularity of representation. For example, the date of birth could be generalized to a range such as year of birth, so as to reduce the risk of identification. In the method of suppression, the value of the attribute is removed completely. It is clear that such methods reduce the risk of identification with the use of public records, while reducing the accuracy of applications on the transformed data. 4.3.Top-Down Refinement Algorithm: Top-Down Refinement algorithm is a preprocessing step compress the given table by removing all attributes not in QID and collapsing duplicates into a single row with the class. Which shows the below steps? Table 4.1: Algorithm: Top-Down Refinement (TDR) Algorithm 1 Top-Down Refinement (TDR) 1: Initialize every value of Dj to the top most value or suppress every value of Dj to j or include every continuous value of Dj into the full range interval, where Dj UQIDi. 2: Initialize Cutj of Dj to include the top most value, Supj of Dj to include all domain values of Dj , and Intj of Dj to include the full range interval, where Dj UQIDi.
3: while some x [Cutj ;[Supj ;[Intj] is valid & beneficial do 4: 5: 6: Find the Best refinement from (UCutj ; USupj ; UIntji). Perform Best on T and update (UCutj ; USupj ; UIntj). Update Score(x) and validity for x (UCutj,USupj,UIntji).
7: end while 8: return Masked T and (UCutj ; USupj ; UIntj).
4.4. Comparative Study on Clustering & Classification: Machine Learning technique is classified into supervised and unsupervised learning, classification is supervised based on class label that continuous and categorical objects to be mined which classifies the training and testing data. Clustering is a unsupervised technique not having any class label grouping the objects based on similarity function. E.g Euclidean distance i.e similar objects in cluster and dissimilar objects in other cluster. Our proposed system mining the data using unique identity is a class label, example QID in student table SID is class label.
CONCLUSION V Our work shows the problem of ensuring individuals anonymzing while releasing person-specific data for classification technique. Existing work optimal based on closed from of cost metric does not have classification requirements. Proposed system identifies two observations specific to classification information. Classification identifies the based on class label for training data and testing. Therefore, not all data items are equally useful for classification and less useful data items provide the room for anonymizing the data without compromising the utility. With these observations, we presented a K-anonymization approach to iteratively refine the data from a general state into a special state, guided by maximizing the trade-off between information and anonymity. This algorithm approach serves a natural and efficient structure for handling categorical and continuous attributes and multiple anonymity requirements. Analysis showed that our approach effectively preserves both information utility and individual's privacy and scales well for large data sets.
Reference
[1] Two Crows Corporation, Introduction to Data Mining and Knowledge Discovery, ThirdEdition (Potomac, MD: Two Crows Corporation, 1999); Pieter Adriaans and Dolf Zantinge,Data Mining (New York: Addison Wesley, 1996). [2] For a more technically-oriented definition of data miningsee http://searchcrm.techtarget.com/gDefinition/0,294236,sid11_gci211901,00.html]. [3] John Makulowich, Government Data Mining Systems Defy Definition, Washington Technology, 22 February 1999, ttp://www.washingtontechnology.com/news/13_22/tech_ features/393-3.html]. [4] Sweeney L.: Privacy Technologies for Homeland Security. Testimony before the Privacy and Integrity Advisory Committee of the Deprtment of Homeland Security, Boston, MA, June 15, 2005. [5] P. Samarati, .Protecting respondents' identities in microdata release,. In IEEE Transactions on Knowledge Engineering (TKDE), vol. 13, no. 6, 2001, pp. 1010.1027. [6] The House of Commons in Canada, .The personal information protection and electronic documents act,. April 2000, http://www.privcom.gc.ca/. [7]Piatetsky-Shapiro, G. (1991), Discovery, analysis, and presentation of strong rules, in G. Piatetsky-Shapiro & W. J. Frawley, eds, Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA. [8] Samarati P.: Protecting Respondents Identities in Microdata Release. IEEE Trans. Knowl. Data Eng. 13(6): 1010-1027 (2001). [9] A. Meyerson and R. Williams, .On the complexity of optimal kanonymity, . in Proc. of the 23rd ACM Symposium on Principles of Database Systems (PODS), 2004, pp. 223.228. [10] L. Sweeney, .Data_y: A system for providing anonymity in medical data,. in Proc. of the International Conference on Database Security, 1998, pp. 356.381.
N. SambaSiva Rao Ph.D(CSE) from Anna University, M.E(CSE) from MNREC, Allahabad, M.Tech(PSE) from REC Warangal. He has 30years of Academic Experience, guided many UG&PG Students, and represented Kakatiya University Co-ordinator for NSS summer camp. He have presented papers at National & International Conferences, eights journals published. Currently working as a principal at Varadhman College of Engineering, research areas include Databases, Software Engineering, Networks, Power Electronics, Data Mining.
Syed Arfath Ahmed Asst Prof in Muffakam Jah College of Engineering & Technology, Hyderabad M.Tech from Shadan College of Engineering, Hyderabad B.Tech from Kshatriya College of Engineering, armoor. He has guided many UG&PG students interested areas include Network Security, Data Mining, Software Engineering.
Anupma Tammi pursuing M.Tech at Vardhaman College of Engineering B.Tech from Kshatriya College of Engineering. Her areas of interest include Software Engineering, Database, Network Security, currently focusing on Data Mining& Warehousing
10

Paper-1 Classification - Analysis of Anonymization For Privacy Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper-1 Classification - Analysis of Anonymization For Privacy Data

Uploaded by

Copyright:

Available Formats

International Journal of Computational Intelligence and Information Security, October 2011 Vol. 2, No.

Classification Analysis of Anonymization for Privacy Data

Work_Hrs ANY [37-99]

Sex ANY Male Female

Sunny Overcast Rain

7: end while 8: return Masked T and (UCutj ; USupj ; UIntj).

You might also like