Professional Documents
Culture Documents
=
(1)
Where || denotes the Frobenius norm of the enclosed
argument
4.2 Position Difference
The order of the value of the data element changes
after data distortion. We use several metrics to measure
the position difference of the data element.
4.2.1 RP- RP parameter is used to represent the average
change of rank for all attributes after data distortion. For
a dataset M with q data object and p attributes. Let
i
j
G is
the rank (in ascending order) of the j
th
element in
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 190
attribute i. Similarly
i
j G is the rank of the corresponding
distorted element. Then the RP parameter is given by:
= =
=
a
i
b
j
i
j
i
j
G G
b a
RP
1 1
*
1
(2)
4.2.2 RK- RK parameter represents the percentage of
elements ttheir rank in each column after distortion.
= =
=
a
i
b
j
i
j
Rk
b a
RK
1 1
*
1
(3)
Where 1
i
j
Rk =
if
i
j
G =
i
j G otherwise
0
i
j
Rk =
4.2.3 CP- CP parameter is used to measure how the rank
of the average value of each attributes varies after data
distortion. CP represents the change of rank of the
average value of the attributes. CP parameter is given by:
=
=
a
i
i i
ON ON
a
CP
1
1
(4)
Where
i
ON and
i ON represent the rank of the
average value of i
th
attribute before and after data
distortion respectively.
4.2.4 CK- Similar to RK, CK is used to measure the
percentage of the attributes that keep their rank of
average value after data distortion. CK parameter is given
by:
=
=
a
i
i
Ck
a
CK
1
1
(5)
Where
1
i
Ck = If
i
ON =
i ON
Otherwise 0
i
Ck =
4.3 Utility Measure
After the conduction of certain perturbation the data
utility measures indicate the accuracy of data mining
algorithms on distorted data. In this paper we choose the
accuracy of a J48 [5] as our data utility measure.
5. TANGENT HYPERBOLIC NORMALIZATION
(TANH)
Tanh normalization is a data transformation method.
Tanh normalization method maps the scores in the range
[-1, 1] in a non linear transformation. In this method the
values around the mean of the scores are transformed by a
linear mapping and a compression of the data is
performed for the high and low values of the scores. The
tanh Normalization is based on Hampel estimators [24]
and is given by
)
`
+ |
.
|
\
|
= 1 tanh
2
1
y
K Y
Tanh
(6)
Where and are, respectively, the mean and standard
deviation estimates, of the genuine score distribution
introduced by Hampel and K is a suitable constant. The
advantage of Tanh normalization is the suppression of the
effect of outliers, which is absorbed by the compression of
the extreme values.
6. DATASET
I am interesting to apply the proposed method on four
real life dataset obtained from the University of California
Irvine (UCI), Machine Learning Repository [6]. Datasets
are the Glass Identification, Habermans survival data,
Bupa Liver Disorders and Iris Dataset. The summaries of
the original database are given in Table 1.
Table 1: The summary of the database
7. THE PROPOSED METHOD
In the proposed method the original dataset is presented
by matrix of dimension m x n. The rows of the matrix
represent objects and the column of the matrix represent
attributes.
Now the original data matrix must be first transformed by
Tanh normalization to get transformed or distorted data
matrix whose size has the same size m x n as the original
data matrix. The Tanh normalization transformed each
element of the original data matrix into the specific range
between [-1, 1].
After applying the proposed Tanh normalization based
method on original data matrix each element of the
original data matrix has been now distorted into the
specific range between [-1, 1]. Now we obtained a
distorted data matrix which is similar to the original data
matrix, but not identical. More importantly the distorted
data matrix preserves the property of original data matrix.
Figure 1 Block Diagram of Privacy Preserving Method
S
N
Database
Number
of
Instances
Number
of
Features
Number
of
Classes
1 Iris 150 4 3
2
Glass
Identificatio
n
214 10 7
3
Bupa Liver
Disorders
345 6 2
4
Habermans
Survival
data
306 3 2
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 191
8. EXPERIMENT
I will conduct experiments to evaluate the performance of
data distortion method. I am interesting to choose four
real-life Databases obtained from the University Of
California Irvine (UCI), Machine Learning Repository
[6]. Datasets are the Glass Identification, Habermans
survival data, Bupa Liver Disorders and Iris Dataset. We
use WEKA (Waikato Environment for Knowledge
Analysis) [7] software to test the accuracy of distorted
method. The privacy parameters will measure by a
separate Java programme. I will construct the classifier
for J48 classification, and a 10-fold cross validation to
obtain the classification results.
9. CONCLUSION
In this paper we proposed Tanh (Tan Hyperbolic)
normalization based data distortion in privacy preserving
data mining system. In this method we distorted original
dataset before applying data mining application. I will
apply the data distortion method on four real life dataset.
The privacy parameters are used to calculate the degree of
privacy protection i.e. how much privacy will be preserve.
It is interesting to compare the proposed method with the
existing privacy preserving method and compare the
result with our proposed method .
References
[1] M. Chen, J . Han, and P. Yu, Data mining: An Overview
from a database Prospective, IEEE Trans. on Knowledge
and Data Engineering, vol. 8, no. 6, pp. 866-883, Dec.
1996.
[2] S. Xu, J . Zhang, D. Han, J . Wang, Data distortion for
privacy protection in a terrorist analysis system,
Proceeding of the IEEE International Conference on
Intelligence and Security Informatics, pp. 459-464, 2005.
[3] W. Frankes and R. Baeza-Yates, Information Retrieval:
Data Structures and Algorithms, PrenticeHall,
Englewood cliffs, NJ , 1992.
[4] J . Han, and M. Kamber, Data Mining: Concepts and
Techniques, Second edition, 2006, Morgan Kaufmann,
USA.
[5] Ian H. Witten, Eibe Frank,Data Mining Practical
Machine Learning Tools and Techniques, Second Edition,
2005.
[6] UCI Machine Learning Repository
http://archive.ics.uci.edu/ml/datasets.html
[7] The Weka Machine Learning Workbench.
http://www.cs.waikato.ac.nz/ml/weka
[8] J ie Wang, Weijun Zhong, J un Zhang, NNMF- Based
Factorization Techniques for High-Accuracy Privacy
Protection on Non-negative-valued Dataset, Proceeding of
IEEE Conference on Data Mining, International Workshop
on Privacy Aspects of Date Mining (PADM2006), pp.513-
517, 2006.
[9] T. Dalenius and S.P. Reiss, Data-Swapping: A Technique
for Disclosure Control, J ournal of Statistical Planning and
Inference, vol. 6, pp. 73-85, 1982.
[10] Saif M. A. Kabir, Amr M. Youssef, Ahmed K. Elhakeem,
On data distortion for privacy preserving data mining,
Proceedings of IEEE Conference on Electrical and
Computer Engineering (CCECE 2007), PP. 308-311, 2007.
[11] Shuting Xu, Shuhua Lai,Fast Fourier transform based
data perturbation method for privacy protection,
Proceeding of IEEE International Conference on
Intelligence and Security Informatics, pp. 221-224, 2007.
[12] Lian Liu, J ie Wang, J un Zhang,Wavelet-Based Data
Perturbation for Simultaneous Privacy-Preserving and
Statistics-Preserving, Proceeding of IEEE International
Conference on Data Mining Workshop, PP. 27-35, 2008.
[13] Zhenmin Lin, J ie Wang, Lian Liu, J un Zhang,
Generalized random rotation perturbation for vertically
partitioned data sets, Proceeding of the IEEE Symposium
on Computational Intelligence and Data Mining, pp:159-
162, 2009.
[14] Bo Peng, Xingyu Geng, J un Zhang, Combined data
distortion strategies for privacy-preserving data mining,
Proceeding of the IEEE International Conference on
Advanced Computer Theory and Engineering (1CACTE),
PP. V1-572 - V1-576, 2010.
[15] S. Xu, J . Zhang, D. Han and J . Wang, Singular value
decomposition based data distortion strategy for privacy
protection, ACM J ournal of Knowledge and Information
Systems, vol. 10, no. 3, pp. 383-397, 2006.
[16] Weimin Ouyang, Qinhua Huang, Privacy Preserving
Sequential Pattern Mining Based on Secure Multi-party
Computation, Proceedings of the 2006 IEEE International
Conference on Information Acquisition, pp. 149-154,
August 20 - 23, 2006.
[17] C. K. Liew, U. J . Choi, and C. J . Liew, A Data Distortion
by Probability Distribution, ACM Transaction on
Database Systems (TODS), vol. 10, no. 3, pp. 395-411,
Sep. 1985.
[18] R. Agrawal and R. Srikant, Privacy-preserving data
mining, Proceeding of the ACM SIGMOD Conference on
Management of Data, pp. 439450, May 2000.
[19] L. Sweeney, k-Anonymity: A Model for Protecting
Privacy, International J ournal on Uncertainty,Fuzziness,
and Knowledge-Based Systems, vol. 10, no. 5, pp. 557-
570, 2002.
[20] K. Chen and L. Liu, Privacy preserving data classification
with rotation perturbation, Proceeding of the 5th IEEE
International Conference on Data Mining (ICDM 2005),
pp. 589-592, 2005.
[21] Yingjie Wu, Zhihui Sun, Xiaodong Wang, Privacy
Preserving k-Anonymity for Re-publication of Incremental
Datasets, 2009 World Congress on Computer Science and
Information Engineering, pp. 53 60,2009
[22] Chieh-Ming Wu, Yin-Fu Huang and J ian-Ying Chen,
Privacy Preserving Association Rules by Using Greedy
Approach, World Congress on Computer Science and
Information Engineering, pp 61-65, 2009
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 192
[23] Bo Peng, Xingyu Geng, J un Zhang, Combined data
distortion strategies for privacy-preserving data mining,
Proceeding of the IEEE International Conference on
Advanced Computer Theory and Engineering (1CACTE),
pp. V1-572 - V1-576, 2010.
[24] J ahan, Narsimha and Rao, Data perturbation and feature
selection in preserving privacy Ninth International
Conference on Wireless and Optical Communications
Networks (WOCN), pp 1-6, 2012
[25] Pui K. Fong and J ens H. Weber-J ahnke, Senior Member ,
Privacy Preserving Decision Tree Learning Using
Unrealized Data Sets , Proceeding of the IEEE
Transactions On Knowledge And Data Engineering,
VOL. 24, NO. 2, pp. 353 -364, FEB 2012
[26] Xuyun Zhang, Laurence T. Yang, Senior Member, A
Scalable Two-Phase Top-Down Specialization Approach
for Data Anonymization using MapReduce on Cloud ,
Proceeding of the IEEE Transactions On Parallel And
Distributed Systems, TPDSSI-2012-09-0909 , pp. 1-11,
2013
[27] Sara Hajian and J osep Domingo-Ferrer, A Methodology
for Direct and Indirect Discrimination Prevention in Data
Mining, Proceeding of the IEEE Transactions on
Knowledge and Data Engineering, VOL. 25, NO. 7, pp.
1445 -1459, J ULY 2013
AUTHOR
Mr. Santosh Kumar Bhandare presently working
as Asst. Professor in Department of Computer
Science & Information Technology at Shri Dadaji
Institute of Technology & Science Khandwa
(M.P.) 450001 India. The degree of B.E. secured
in Computer Science & Engineering from Madhav Institute of
Technology & Science Gwalior in 2006, M.Tech in Computer
Science & Engineering from Samrat Ashok Technological
Institute Vidisha in 2011. Research Interest includes Data
Mining, Network Security and Cloud Computing