You are on page 1of 12

Preservation of the privacy of data on cloud by using

Key map based data anonamysa


1 2
L.Balamurali & Dr.N. Jayaveeran
1
Research Scholar, PG & Research Department of Computer
Science, KhadirMohideen College, Adirampattinam.
2
Research Advisor
1
muralikmccsphd@ gmail. com
2
njveeran@ gmail. com

Abstract: Cloud Computing is the nascent technology and providing vast number of services to the
users from remote server machines with low costs. Among the diverse amount of services, the data storage
is the main issue that is being crooned by every researcher in the field of Cloud Computing. Different
problems ascended involving security and privacy issues on data storage. Data leakage is the intimidating
problem on data privacy and giving a fear to all the cloud users who store their data on cloud. Leading cloud
services like Google and Amazon, and all the social network sites are implementing several techniques
to protect the leakage from the cloud storages. This paper implements a framework for protecting the data
leakage with the help of Key-Mapped Data Anonymization method.

Keywords: Cloud Computing, Data Leakage, Anonymization, Privacy, de-identification.

1. INTRODUCTION
Cloud Computing is the growing field and providing vast number of services to the
users through the Internet. Three types of services are offered by the Cloud Computing
environment namely Platform as a Service (Paas), Infrastructure as a Service (IaaS)
and Software as a Service (SaaS). Among the diverse amount of services, the data
storage is the main issue that is being crooned by every researcher in the field of Cloud
Computing. Data centralization and outsourcing become a trend due to the
advent of Cloud Computing environment [1].
Cloud Computing environment offers on-demand services to users and allow them to
keep their important data on remote servers. The environment also allows organizations
like companies and hospitals etc. to store their data on remote storage for research and
public processing [2]. The advancement in technologies like hardware, software and
infrastructure from the past years raised the Cloud Computing environment into a
flourishing industry [3].
The affordable benefit of Cloud Computing to the users is storing and processing of
data through online as an on the fly process without any substantial investment [4]. Cloud
Computing environment extends the traditional distributed process to incorporate
information processing and resources as a payable service with the help of the Internet
[5]. Even, the field of Cloud Computing reduces a bunch of burdens like scalability,
resources and costs etc., there is a great fear on data privacy occurred maximum by
means of data leakage. Around 88% of the users are worrying about their data on cloud
and demanding more privacy in today Cloud Computing environment [6].
The data stored on cloud data storage contains sensitive information called as
Personally Identified Information (PII) and are easily disclosed with the identification of
special identifiers called quasi-identifiers [7]. Quasi-identifiers are attributes jointly used
to identify the personal information from various resources [2]. Several techniques have
been adopted to protect the privacy of data on cloud and preventing the data leakage.
Volume 8, Issue 3, 2020 http://aegaeum.com/ Page No:
1251
AEGAEUM JOURNAL ISSN NO: 0776-
3808

The present paper tries to construct a framework for protecting the privacy of data on
cloud and there by preventing the data leakage from the cloud. The proposed framework
uses a famous data anonymization technique called data pseudonymization which
generates fake data for each individual original data along with the mapping key which
preserves the original information.
The section 2 of the present paper discusses the various data anonymization
techniques and the section 3 deals the related work relevant to the present study. Section
4 discusses the proposed methodology of the work and the section 5 discuses the
results obtained from the constructed framework. Last section of this paper concludes
the present study with future enhancement.

2. DATA ANONYMIZATION TECHNIQUES


Several Data Anonymization techniques have been adopted and applied to protect the
privacy on data along with the unique set of characteristics. All the techniques are
working with the moto of hiding the PII. All leading Cloud Computing environments
like Amazon, Google and Yahoo etc. are utilizing either one or mixture of the
techniques to give reliable cloud storage to their users [8].
2.1 De-identification
De-identification is the process used to protect the PII being connected with the
personal information. De-Identification is also referred to as Data Anonymization, while
applying the techniques on general data [8].
2.2 Masking
Masking or suppression is the mostly used anonymization technique used in the field
of business in which the important part of the data is hidden with a pattern of
characters. Masking is used for identifying data without actual data manipulation [8].
2.3 Generalization
Generalization is the one of the anonymization techniques used for replacing the
individual values by the broader categories of values. Instead of single entity value a
range has been defined by the generalization to protect and hide the PII [8].
2.4 K-Anonymization
K-Anonymization mixes both generalization and masking in order to protect the
privacy on data stored in data storage. [8].
2.5 Scrambling
Scrambling is the anonymization technique involving a random mixture or obfuscation
of characters [8].
2.6 Pseudonymization
Pseudonymization is the widely used technique on data storages including the Cloud
Computing environment in which the data is not attributed to a specific subject without
the use of more information. The technique splits the de-identified data from the other
information and freely allows the handlers to use personal data without the threatening of
invading on the rights of data [8]. In practical generating of fake data for the original data
with mapping of keys is deployed for effective Pseudonymization.
Apart from the above said techniques Tokenization, encoding, blurring and hashing are
also used for anonymization process to de-identify the personal data which could be the
major cause of data leakage on cloud. The overall anonymization techniques entirely
removed the PII from the original data and it could be more difficult to decrypt for
Volume 8, Issue 3, 2020 http://aegaeum.com/ Page No:
1252
AEGAEUM JOURNAL ISSN NO: 0776-3808

processing on any client side of Cloud Computing environment. So proper mapping


could be implemented in order to protect and prevent the privacy of data on cloud
storages.
The overall features of all anonymization techniques have been summarized in
the Table. 1.
Table.1 Data Anonymization Techniques

S.No. Name Description


1 De-Identification Prevents Personal Identities from
information
2 Masking/Suppression Hide the information with a
pattern of characters
3 Generalization Replaces individual values with
the range of values
4 K - Anonymization Mixes both Generalization and
Masking techniques.
5 Scrambling Mixes random or obfuscation of
characters
6 Pseudonymization Splits the de-identified data from
other meaningful information.

3. RELATED WORKS
The problem of protecting the data leakage on cloud is a major problem and several
works have been carried out for protecting the privacy and preventing the data leakage.
Renu Sara George and Sabitha proposed a method for anonymization and
deanonymization processes in a secured enclave and there by reducing the computing
power of the client’s side [1].
G. Ateniese et.al devised a Provable Data Possession (PDP) framework model
with the help of RSA-homomorphic tags [9] . Gabriel Ghinita et.al an
effective framework for privacy preservation [2]. Sweeny and Samarati
proposed k- anonymity model for hiding the PII from the original data
[10]. Proofs Of Retrievability (POR) method was formulated by Juels A and
Kaliski which mainly encodes files and inserts a set of random blocks called
sentinels [11] .
In another work G.Ateniese et.al formulated a high efficient and securable PDP
method purely based on symmetric cryptography [12]. Q.Wang et.al constructed a
new approach which permits a Third Party Auditor (TPA) to verify the integrity of
dynamic data on a cloud storage [13]. A remote integrity checking protocol was
proposed by Z.Hao et.al with public verifiability[14]. Data privacy is the most
impairment to the users of Cloud Computing [6]. A decoupling process was carried
out by J-S Xu et.al for securing the data on cloud [15]. Safwan Mahmud Khan and
Kevin W. Hamlen proposed a decoupling process from the meta data for
maintaining the data privacy and there by preventing from the data leakage [3] .

An effective method called Tarzan was proposed by M.J.Freedman and R.Morris


which protects the data against the timing attack by using the generation of artificial
covers traffic by masking time pattern with noises [16]. Intel IT is carrying data
anonymization by removing or encrypting the PII from the original data [17]. A
work which represents the data as cyphertext without any decryption has been
proposed by Q.Liu et.al [18].
All the relevant related works of the present study are concentrated on protecting
the privacy of data on cloud with different techniques and methods. The present
study uses Pseudonymization for producing equivalent fake data to every original
Volume 8, Issue 3, 2020 http://aegaeum.com/ Page No:
1253
AEGAEUM JOURNAL ISSN NO: 0776-3808

data and there by maintaining the Pseudonymized data on cloud with a key mapping
principle over the original to preserve the meaning.

4. METHODOLOGY
The present work proposes a framework in which the entire data is
Pseudonymized with fake data along with the mapped key for retaining the original
information. The Pseudonymized data is stored on cloud with fake details which
cannot be retained the original information, incase the data might be subject for
leakage. The overall approach is depicted in Figure 1.

Select Original Data

Identify PII and clean all PII

Generate Pseudonymized Data

Map the Pseudonymized data with original by using a key

Upload the Pseudonymized data on Cloud Storage

Figure 1. Overall Methodology

The first step of proposed methodology shown in Figure 1. is the selecting of


Original data through which the data to be stored on cloud storage. The second
step is the identification of PII and removing all PII from the data and keeping the
general data which would not give any meaningful information. The third step of the
methodology on Figure 1. is the generation of Pseudonymized or fake data for every
original data and the next step is mapping the Pseudonymized data by using a key with
the original data. After Pseudonymization the converted data is stored on cloud storage
by keeping the mapped data privately on client’s side.
The present work used the annual income of adults dataset collected from kaagle and
containing 15 columns with 48,843 rows [19]. A sample data from the adults dataset
is shown in Figure 2.

Volume 8, Issue 3, 2020 http://aegaeum.com/ Page No:


1254
AEGAEUM JOURNAL ISSN NO: 0776-3808

Figure 2. Sample data from adults.csv dataset


The dataset shown in Figure 2. Contains some PII columns which lags the privacy of
each entity in all rows. The PII columns to be considered are work-class, education,
marital-status and occupation etc. which seems to be frightening the privacy of entities.
Suppose, the data is to be leaked unauthentically, one can easily detect the personal
information of each adult represented in the dataset.
The methodology shown in Figure 1. Is developed as an algorithm called Map-
Pseudonymize with the elaborated steps and is shown in Table 2.
Table 2. Map-Pseudonymize Algorithm

Algorithm Map-Pseudonymize (Data)

Step 1: Select the Dataset


Step 2: Identify the Personally Identified Information (PII) from the columns
Step 3: Cleans the PII from the Data
Step 4: Generate the Pseudonymized data
Step 5: Map the Pseudonymized data with a key from original dataset
Step 6: Upload the Pseudonymized data on cloud storage
Step 7: Stop

5. RESULTS AND DISCUSSIONS


The prescribed algorithm in Table 2. is applied over the adult.csv dataset and the
following are observed for the effective privacy protection and maintaining the meaning
of data after Pseudonymization. The overall work is explained step by step as follows.
5.1 Removal of PII
The notified PII columns from the adult.csv dataset from Figure 2 is first removed and
the result is shown in Figure 3.

Figure 3. Data after removing PII columns

The de-identification process shown in Figure 3. contains columns which are not
saying any personal information to the readers. The age column in Figure 3. Is indirectly
specifying little information to the readers and is generalized for further anonymization.
The data after age generalization is shown in Figure 4.

Volume 8, Issue 3, 2020 http://aegaeum.com/ Page No:


1255
AEGAEUM JOURNAL ISSN NO: 0776-3808

Figure 4. Data after the generalization of age column

5.2 Generate Pseudonymized data and map with key on original data
The data shown in Figure 4. is already in anonymized form and still contains fewer
columns to be subjected for anonymization. For example, fnlwgt represent the final
weight of each row and may lead to find individual items. Suppose, that column is to be
removed out, then no column left on the dataset to preserve the information of each
row after the removal of PII elements.
The fnlwgt is not to removed but hidden as a key for protecting and preserving the
meaning of each row on the dataset. The fnlwgt is kept as a key and subjected for
Pseudonymization or generating equivalent fake values. The entire data after
Pseudonymization is shown in Figure 5.

Figure 5 Data after Pseudonymization of fnlwgt column

The fnlwgt column after Pseudonymization contains fake values as shown in Figure 5
An equivalent mapping process is also created with the original value of fnlwgt to
preserve the original information of each row. The mapping of original fnlwgt with
the faked value is shown in Figure 6.

Figure 6 Mapping of original fnlwgt key with the faked values

The entire data after full data anonymization with the help of pseudonymization
(faking) is not interpreting any sensitive or private information, because all columns are
containing general information. The mapped value is kept in offline in order to preserve
the original information and meaning of data after anonymization.

6. CONCLUSIONS
The present work on this paper proposed a framework for protecting the privacy of
data during the data leakage from the cloud storage. The work effectively proved that
removal of all PII and adjusting of columns as a range hide all sensitive information to the
readers, in case the data might be subjected for leakage from cloud. The present work
preserves information from the original data in offline by maintaining a separate data

Volume 8, Issue 3, 2020 http://aegaeum.com/ Page No:


1256
AEGAEUM JOURNAL ISSN NO: 0776-3808

source. Suppose, there are larger keys to be maintained for perseverance of meaning,
the offline side needs more memory. Anonymizing the data for cloud storage
without maintaining any offline supporting resource is the beyond the scope of the present
work.

REFERENCES
[1] Renu Sara George and Sabitha S, “Data Anonymization and Integrity checking in
Cloud Computing”, IEEE,2013.
[2] Gabrial Ghinita et.al, “Fast Data Anonymization with low information loss”,ACM
library,2007,pp 758- 769
[3] Safwan Mahmud Khan and Kevin W. Hamlen, “AnonymousCloud: A Data
Ownership Privacy Provider Framework in Cloud Computing”, IEEE,2012, pp 170-
176.
[4] M. Armbrus et.al , “A view of cloud computing,” Communications of the ACM
(CACM), vol. 53, no. 4, pp. 50–58, 2010
[5] Amazon, “Amazon elastic compute cloud (Amazon EC2),”
http://aws.amazon.com/ec2, 2012.
[6] Fujitsu Research Institute, “Personal data in the cloud: A global survey of
consumer attitudes,” http://www.fujitsu.com/
global/news/publications/dataprivacy.html, October 2010.
[7] A. Froomkin. The Death of Privacy. Stanford Law Review, 52(5):1461–1543, 2000.
[8] Ruslan Korniichuk, “Easy-to-use GDPR guide for Data Scientist. Part 2/2”,
https:// medium. com/@ korniichuk/ gdpr- guide-2- 7c399b44 ba3,2017 .
[9] G . Ateniese et. al, ” Provable data possession at untrusted stores” , Proceedings of
the 14th ACM conference on Computer and communications security, CCS’07
,New York, USA, ACM, 2007, pp. 598-609.
[10] P. Samarati and L. Sweeney. “Generalizing Data to Provide Anonymity when
Disclosing Information (abstract)”. PODS, 1998.
[11] Juels A and Kaliski BS, ”PORs: proofs of retrievability for large files”,
Cryptology ePrint archive, June 2007, Report 2007/243.
[12] G. Ateniese, et al., ”Scalable and efficient provable data possession”, Proceedings
of the 4th international conference on Security and privacy in communication
networks, Istanbul, Turkey, 2008.
[13] Q. Wang et.al, ”Enabling Public Auditability and Data Dynamics for Storage
Security in Cloud Computing”, IEEE Transactions On Parallel And Distributed
Systems, Vol. 22,No. 5, May 2011.
[14] Z. Hao et.al ,”A Privacy-Preserving Remote Data Integrity Checking Protocol with
Data Dynamics and Public Verifiability”, IEEE Transactions On Knowledge And
Data Engineering, Vol. 23,No. 9, September 2011.
Volume 8, Issue 3, 2020 http://aegaeum.com/ Page No:
1257
AEGAEUM JOURNAL ISSN NO: 0776-3808

[15] J.-S. Xu et.al, “Secure document service for cloud computing,” in Proceedings of
the 1st International Conference on Cloud Computing (CloudCom), 2009,pp. 541–546.
[1 6 ] M. J. Freedman and R. Morris, “Tarzan: A peer-to-peer anonymizing network
layer,” in Proceedings of the 9th ACM Conference on Computer and
Communications Security (CCS), 2002,pp. 193–206.
[17] L. Sweeney,” k-Anonymity: A Model for Protecting Privacy”, International Journal
on Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5,pp. 557-570,
2002.
[18] Q. Liu et.al, “Secure and privacy preserving keyword searching for cloud
storage services,” Journal of Network and Computer Applications (JNCA), vol. 35,
no. 3,pp. 927–933, 2012.
[19] adults.csv, https://www.kaggle.com/wenruliu/adult-income-dataset,2017

Volume 8, Issue 3, 2020 http://aegaeum.com/ Page No:


1258

You might also like