You are on page 1of 6

International Journal of Computational Intelligence and Information Security, January 2014 Vol. 5, No.

1 ISSN: 1837-7823

A Survey On Anonymity Based Solutions For Privacy Issues In Web Mining

S.Shakila1 and Dr. Gopinath Ganapathy2 1 Assistant Professor, Government Arts College, Tiruchirappalli 2 Professor and Head, School of Computer Science and Engineering, Bharathidasan University, Tiruchirappalli E-mail : shakila.rrc@gmail.com1 Abstract
The process of web mining has both advantages and disadvantages. The current paper provides a general overview of the web mining process along with techniques that help in the process of web mining. It also discusses the privacy issues posed due to the process of web mining and also the solutions proposed to overcome these issues. It also provides researches directions that can be explored for providing efficient and safe web mining process. Keywords: Web mining; web farming; privacy issues; Information fusion; k-anonymity; clickstream; anonymization



The World Wide Web is an enormous storehouse of information. It is generally said that we are data rich and information poor. With the rapid growth of the World Wide Web, an ever increasing number of electronic commerce activities are now being conducted through Web sites. These electronic commerce activities continuously generate a large amount of Web log data. Extraction of valuable information from this data will help provide better experience to the users. The data in the web is in general, in the raw format. i.e. it cannot be directly used. Web mining is the process of obtaining useful information from Web data. Data resources for Web mining include the Web page content (such as texts, pictures, and videos in the page), link structure (mainly refers to links between pages) and log files (such as Web server log, agent server log). The process of Web mining can analyze the data and provide the more valuable information. The major purpose of extracting information from the web is to help users get information, improving the site structure and providing personalized service, etc. This is a win-win situation, since both the user and the organization can benefit from it. But the major setback lies in the fact that this information fusion can be performed by the adversaries to obtain identities. Hence anonymizing the data has become mandatory. Data anonymization is a technique that will not take away the original field layout (position, size and data type) of the data being anonymized, so the data will still look realistic in test data environments. The current mining system has been refined from a data mining system to a behaviour mining system [11]. The paper discusses the methods that help perform information extraction, privacy issues due to information extraction, need for anonymity and the challenges faced during anonymization; and future directions of research.


Web Mining: The Process

Web mining can be broadly divided into three types, namely; Web content mining, Web structure mining and Web usage mining [1]. Web usage mining is the process of selecting the usage patterns using the information from the log files. The usage mining provides information about the origin, identity and the browsing behavior of each user. Web content mining is extracting web document contents. It is basically performed by using text mining techniques on the web content. Web structure mining is the process of mining the pages and links and interconnections of hypertext in the documents. Figure 1 shows the web mining taxonomy as described in [1]. This is one of the technologies that has gained prominence recently due to the introduction of semantic web. The Semantic Web is the standard that encourages the use of semantic content (meaningful content) in web pages. According to the W3C, "The Semantic Web provides a common framework that allows data to be shared 10

International Journal of Computational Intelligence and Information Security, January 2014 Vol. 5, No. 1 ISSN: 1837-7823 and reused across application, enterprise, and community boundaries [2]. The use of semantic content provides a new dimension to the conventional web searches. It provides the much needed flexibility of meaningful searches to the user. Further, the usage of Semantic content will also help in the process of automation. [3] presents an automated system based on ontology that helps in service composition. It also facilitates in efficient resource discovery[4]. The inclusion of semantic content will also help in building efficient e-learning resources [5]. Web information fusion is usually from the following two approaches, one is to integrate the Web content, structure, and usage data for surfing behavior analysis; the other is to integrate Web usage data with traditional customer, product, and transaction data for purchasing behavior analysis. A Clickstream is the recording of the parts of the screen a computer user clicks on while web browsing or using another software application. As the user clicks anywhere in the webpage or application, the action is logged on a client or inside the web server, as well as possibly the web browser, router, proxy server or ad server. Mining the user click-stream for user behavior, and using it to adapt the look-and-feel of a site to a readers needs was first proposed by Perkowitz and Etzioni [6]. Clickstream analysis is also useful for web activity analysis, software testing, market research, and for analyzing employee productivity. Using this method, not only the users purchases, but their complete behaviors can be determined.

Figure 1: Web Mining Taxonomy[1]

This helps in the process of Web Farming. Web farming is the systematic refining (or cultivating) of information resources on the Web for effectively mining process [7]. Web farming focuses on how to design the website and collect the information effectively for further mining process. Web farming is basically concentrated on collecting the Web usage data. In contrast to web mining, the web farming technology concentrates right from the process of web site framework to the placement of contents. Hence the data collection becomes much easier and much efficient. It helps in mining information from a user, using which customized user experience can be provided. This helps industries to deal with customers as a single entity rather than a component belonging to a group of similar entities. The web farming can be efficiently performed with the Clickstream analysis [12]. Further, Web Mining has been taken to a whole new level with the basis of community mining [8][13]. This is due to the advent of many social networking sites. The process of community mining not only brings about the interests of a user, it can also build an accurate and a complete user profile. [9] provide a collaborative 11

International Journal of Computational Intelligence and Information Security, January 2014 Vol. 5, No. 1 ISSN: 1837-7823 agent based recommender system that helps in the process of recommending products to the users based on their interests. Collaborative Web search (CWS) exploits repetition and regularity within the query-space of a community of like-minded searchers in order to improve the quality of search results [10].


Privacy Issues

The most important problem in the mining of web data is that the user is not aware that they are providing their personal information. Further, they are also not aware of how the gathered data is being used. They neither the opportunity to consent or withhold for its collection and use. This invisible information gathering is common on the Web. Knowledge discovered using this means can cause great threat to the privacy of the people. Further, this data can be misused of be used for purposes other than the consented use. The ethical issues can arise not only due to the usage of personal details, it can also arise due to the usage of some technical details or other information. Each of the data collection techniques (web usage, content and structure mining) when exist in isolation, do not pose any problems to the user, and maintain the users anonymity. But when they are combined together, they provide a detailed user profile. The process of combining the web usage, content and structure in-order to determine the users information is called information fusion. [21] describes an Information Foraging approach that helps in better information attainment. From the security perspective, the Web farming also poses a great threat to the user, since their behaviors and their location information can be easily obtained from web farming. When considering confidential data such as medical data, banking data, employee data and the like, allowing the access of complete information is not encouraged. Hence anonymization of data is a very important process that is to be performed before sharing the data. [25] presents all the privacy issues existing and the legal issues that are to be considered before applying them. According to [25], a fundamental part of informational privacy is to determinate whether private data qualify as personal data or personally identifiable data. In general terms, this concept implies any information referring to an individual or allowing in some manner a connection with or identification of a particular person. All the above mentioned issues exist in web mining, foraging, farming and information fusion. Hence to cope up with them various techniques have also been introduced.


Privacy Solutions

These privacy issues have also lead to the development of privacy measurement technique [14]. It serves as the basis for the Privacy Enhancing Technologies(PET). Anonymous communication networks [15,16], anonymous credentials [17], anonymous electronic cash [18], multiparty computation [19] and oblivious transfer protocols [20] are some examples of general purpose PETs whose development roughly originates from the fields of security and cryptography. [26] discusses various privacy issues during the process of information fusion. It discusses the data protection issues, and methods to cope with them and ways to evaluate the protected data. It also discusses the use of information fusion in the area of data protection, using the aggregation methods and the record linkage methods. The data protection approaches are divided as Perturbative methods, Non-perturbative methods, Cryptographic methods and Synthetic data generation. Additive noise inclusion, rank swapping, rounding, microaggregation, PRAM etc. can be used for data protection. Data anonymization is the process of destroying tracks, or the electronic trail, on the data that would lead an eavesdropper to its origins. The anonymization of query logs is an important process that needs to be performed prior to the publication of such sensitive data. Data sharing requires balancing many privacy, security, and legal interests. Anonymization of data can mitigate privacy and security concerns and comply with legal requirements. According to [27], a good anonymization is a process that is better usable and offers higher performance. A generic algorithm cannot be proposed for this process, since usability and performance generally relies on the target application that uses the technique. Even though anonymization erases tracks of the user, the usage of this data is mostly for research (data mining) purpose. Hence achieving a balance is very important. Anonymization should be performed on the users personal information and not on the actual to-be-mined data. The level of anonymity introduced in the data plays a crucial role. Too much anonymization leads to the data becoming unusable. Less anonymization makes the users personal information vulnerable. One of the techniques that has gained prominence is the k-anonymity. The k-anonymity technique helps in preserving the microdata. According to [32], k-anonymity requires that every cell in the microdata table should 12

International Journal of Computational Intelligence and Information Security, January 2014 Vol. 5, No. 1 ISSN: 1837-7823 be similar to no less than k cells. The major advantage of this approach is that, it preserves the truthfulness of the data. [22] ensures the k-anonymity of the users in the query log, while preserving its utility. To ensure anonymity [22] uses a well known principle in statistical disclosure control, i.e. k-anonymity [23,24]. It states that each query to the anonymized data should return at least k equal records. [22] uses the technique of microaggregation. It clusters multiple queries and replaces their values with the centroids of the clusters. It does not perform query removal hence it can be considered to be more efficient. Moreover, it uses user level anonymization for better privacy. A common feature of these algorithms is that they manipulate the data by using generalization and suppression. Datafly[29], Argus[28] used heuristics to provide k-anonymity but precision is not guaranteed in any of the approaches. Iyengar [30] proposed a Genetic based approach that provided optimal results, but it assumed single dimensional generalizations (SDG), hence is not effective for all types of applications.

Algorithm Samarati[23] Sweeney[24] Bayardo-Agrawal[33] LeFevre-et-al[34] Aggarwal-et-al[35] Meyerson-Williams[36] Aggarwal-et-al[37] Iyengar[30] Winkler[38] Fung-Wang-Yu[39] Exponential Exponential Exponential Exponential O(kn2) O(n2k) O(kn2)

Time complexity

Limited number of iterations Limited number of iterations Limited number of iterations

Table 1: K-anonymity Approaches (n is the number of tuples)[32]


Research Directions

The future research directions include proposal of techniques that provides anonymity to the data to the maximum extent without providing any changes to the usable data. We can use onion based approaches for providing optimal security. The Onion Routing [31] approach is used as the base for this approach. Data is repeatedly encrypted and then used for analysis. This helps in efficient protection of data. Due to the layered application, the processing is divided into layers, hence allocation of lesser processing in each layer would be sufficient to provide the same amount of security. Security incorporation based on partial anonymity can be explored. Partial anonymity refers to performing the encryption only on the necessary data. This helps in reducing the processing and unnecessary transformations of data. This method also requires the preprocessing techniques that assists in the identification of the attributes that correspond to identity data and the ones that correspond to the normal data. The previous proposal can be fine tuned and reduction in complexity can be brought about by introducing partial anonymity. Anonymization is performed before fusion to reduce its complexity. Introduction of parallelization techniques will provide faster and more efficient results. Due to the implicitly parallelizable nature of the previously discussed technique, parallelization can be incorporated to perform the processing in each layer. This will help impose better and faster security to the system.



International Journal of Computational Intelligence and Information Security, January 2014 Vol. 5, No. 1 ISSN: 1837-7823 The process of information retrieval from the data or information scent left by a user the web mining is discussed with its pros and cons. The advantages being better customer service and flexible service, the disadvantages are leakage of a users personal data. To solve this problem various techniques have been proposed to anonymize the available data. Striking a balance between anonymization and data usability is a must, due to the fact that completely anonymized data becomes unusable for any mining process. This paper discusses various anonymization techniques and research directions that might prove to be useful in dealing with anonymizing the data. It also proposes a layered approach that might prove to be an efficient technique when it comes to multiple anonymization.



[1] Jaideep Srivastava, Prasanna Desikan, Vipin Kumar, Web Mining - Concepts, Applications & Research Directions. [2] "W3C Semantic Web Activity", November 7, 2011, World Wide Web Consortium(W3C). [3] Han, S.N., Lee, G.M.,Crespi N., Industrial Informatics, Feb. 2014, Semantic Context-Aware Service Composition for Building Automation System, IEEE Transactions on (Volume:10 , Issue: 1 ), 752 761. [4] Ruta, M. ; Scioscia, F. ; Loseto, G. ;Di Sciascio, E., Industrial Informatics , Feb. 2014, Semantic-Based Resource Discovery and Orchestration in Home and Building Automation: A Multi-Agent Approach, IEEE Transactions on (Volume:10 , Issue: 1 ), 730 741. [5] Zhujuan Li, 2014, Construction of Personalized Network Learning Resource Service System Based on Semantic Web Services , Proceedings of the 2012 International Conference on Cybernetics and Informatics Lecture Notes in Electrical Engineering Volume 163, pp 2431-2438. [6] M. Perkowitz and O. Etzioni, 1999, Adaptive Web Sites: Conceptual Cluster Mining, In IJCAI, pages 264269. [7] R. D. Hackathorn, 1998, Web Farming for the Data Warehouse (Morgan Kaufmann),. [8] Qiang yang, Xindong Wu, 2006,10 Challenging Problems in Data Mining Research, International Journal of Information Technology & Decision Making Vol. 5, No. 4 (2006) 597604, World Scientific Publishing Company. [9] Daniela Godoy, Anal Ia Amandi, 2008,Collaborative Web Search Based on User Interest Similarity, International Journal of Cooperative Information Systems Vol. 17, No. 4 (2008) 495521, World Scientific Publishing Company. [10] B. Smyth, 2007, A community-based approach to personalizing Web search, IEEE Computer 40(8) 4250. [11] Zhengxin Chen ,From Data Mining to Behavior Mining, 2006 ,International Journal of Information Technology & Decision Making Vol. 5, No. 4 ,703711, World Scientific Publishing Company. [12] Jia Hu, Ning Zhong, 2008,Web Farming With Clickstream, International Journal of Information Technology & Decision Making Vol. 7, No. 2 ,291308, World Scientific Publishing Company. [13] Yacine Slimani , Abdelouahab Moussaoui ,Yves Lechevallier, Ahlem Drif , 2011, A community detection algorithm for Web Usage Mining Systems, Fourth International Symposium on Innovation in Information & Communication Technology. [14] Javier Parra-Arnau , David Rebollo-Monedero, Jordi Forn, Measuring the privacy of user profiles in personalized information systems, Future Generation Computer Systems. [15] X. Bai, Predicting consumer sentiments from online text, Decision Support Systems 50 (2011) 732742. [16] S. Bay,M. Pazzani, 1999, Detecting change in categorical data:mining contrast sets, Proc. 5th Internat. Conf. Knowledge Discovery Data Mining (KDD-99), pp. 302306, San Diego, CA. [17] D. Blei, (2012), Probabilistic topic models, Communications of the ACM 55 ,7784. [18] D. Blei,A. Andrew, M. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993 1022. [19] R. Bucklin, J. Lattin, A. Ansari, S. Gupta, D. Bell, E. Coupey, J. Little, C. Mela, A.Montgomery, J. Steckel, 2002 , Choice and the Internet: fromclickstreamto research stream, Marketing Letters 13 245258. [20] R. Bucklin, C. Sismeiro, 2003, A model of Web site browsing behavior estimated on clickstream data, Journal of Marketing Research 40 (2003) 249267. [21] J.A. McCart, B. Padmanabhan, D.J. Berndt, 2013,Goal attainment on long tail web sites: An information foraging approach, Decision Support Systems xxx (2013) xxxxxx. [22] Guillermo Navarro-Arribas , Vicen Torra , Arnau Erola, Jordi Castell-Roca, 2012, User k-anonymity for privacy preserving data mining of query logs, Information Processing and Management 48 ,476487 [23] Samarati, P ,2001,Protecting respondents identities in microdata release, IEEE Transactions on Knowledge and Data Engineering, 13(6), 10101027. [24] Sweeney L., 2002, k-Anonymity: A model for protecting privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557570. [25] Juan D. Velsquez , 2013,Web mining and privacy concerns: Some important legal issues to be consider before applying any data and information extraction technique in web-based environments, Expert Systems with Applications xxx ,xxxxxx.


International Journal of Computational Intelligence and Information Security, January 2014 Vol. 5, No. 1 ISSN: 1837-7823
[26] Guillermo Navarro-Arribas , Vicen Torrab, 13 (2012), Information fusion in data privacy: A survey, Information Fusion,235244. [27] M. Ercan Nergiz Chris Clifton, December 2007, Thoughts on k-Anonymization, Data & Knowledge Engineering Volume 63, Issue 3, Pages 622645 25th International Conference on Conceptual Modeling (ER 2006) [28] A. Hundepool and L. Willenborg,1996, and t-argus: software for statistical disclosure control, in Third International Seminar on Statistical Confidentiality. [29] L. Sweeney, 1997, Guaranteeing anonymity when sharing medical data, the datafly system, in Proc., Journal of the AmericanMedical Informatics Association. Hanley & Belfus, Inc. [30] V. Iyengar, Transforming data to satisfy privacy constraints , 2002, in Proc., the Eigth ACM SIGKDD Intl Conf. on Knowledge Discovery and Data Mining. [31] Peng Zhoua,, Xiapu Luoa, Ang Chenb, Rocky K.C. Changa, 9 December 2013, SGor: Trust graph based onion routing, Computer Networks, Volume 57, Issue 17, Pages 35223544 [32] V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati , 2007, k-Anonymity, Springer US, Advances in Information Security. [33] Bayardo RJ, Agrawal R , 2005,Data privacy through optimal k-anonymization, In Proc. of the 21st International Conference on Data Engineering (ICDE'05), pp. 217{228, Tokyo, Japan. [34] LeFevre K, DeWitt DJ, Ramakrishnan R , 2005,Incognito: Effcient full- domain k-anonymity, In Proc. of the 24th ACM SIGMOD International Conference on Management of Data, pp. 49{60, Baltimore, Maryland, USA. [35] Aggarwal G, Feder T, Kenthapadi K, Motwani R, Panigrahy R, Thomas D, Zhu A , 2005,Anonymizing tables, In Proc. of the 10th International Conference on Database Theory (ICDT'05), pp. 246{258, Edinburgh, Scotland. [36] Meyerson A, Williams R , 2004,On the complexity of optimal k-anonymity, In Proc. of the 23rd Acm-SigmodSigact-Sigart Symposium on the Principles of Database Systems, pp. 223{228, Paris, France. [37] Aggarwal G, Feder T, Kenthapadi K, Motwani R, Panigrahy R, Thomas D, Zhu A, 2005, Approximation algorithms for k-anonymity, Journal of Privacy Technology, paper number 20051120001. [38] Winkler We, 2002,Using simulated annealing for k-anonymity, Technical Report 7, U.S. Census Bureau. [39] Fung B, Wang K, Yu P , 2005,Top-down specialization for information and privacy preservation, In Proc. of the 21st International Conference on Data Engineering (ICDE'05), Tokyo, Japan.