You are on page 1of 12

COPY RIGHT

2021 IJIEMR.Personal use of this material is permitted. Permission from IJIEMR must

be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new
collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted
component of this work in other works. No Reprint should be done to this paper, all copy
right is authenticated to Paper Authors
IJIEMR Transactions, online available on 20th Jan 2021. Link
:http://www.ijiemr.org/downloads.php?vol=Volume-10&issue=ISSUE-01

DOI: 10.48047/IJIEMR/V10/I01/32
Title: Privacy preservation for Abstracting Anonymization Techniques using
Generalization Algorithm
Volume 10, Issue 01, Pages: 157-161.
Paper Authors
B.M. Iran, Dr. K.Bhavana Raj, Deepu J.J.Lazarus

USE THIS BARCODE TO ACCESS YOUR ONLINE PAPER

To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic


Bar Code

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 www.ijiemr.org


Privacy preservation for Abstracting Anonymization
Techniques using Generalization Algorithm
1
B.M. Iran, 2Dr. K.Bhavana Raj, 3Deepu J.J.Lazarus
1
Research Scholar, Department of Computer Science & Technology, Sri Krishnadevaraya
University, Anantapuramu, A.P. India.
2
Assistant Professor, Institute of Public Enterprise, Shamirpet, Hyderabad.
3
Research Scholar, Department of Computer Science & Technology, Sri Krishnadevaraya
University, Anantapuramu, A.P. India.
bhavana_raj_83@yahoo.com

Abstract— The ongoing improvements around open information featured the significant issue of
anonymization with regards to information distributing. Many research endeavors were given to the
meaning of strategies performing such an anonymization. Anyway the determination of the most
applicable method and the satisfactory calculation is perplexing. Effective choice relies right off the
bat upon the capacity of information distributers to comprehend the anonymization systems and their
related calculations. In this paper, we center around the decision of a calculation among the various
ones executing one of the anonymization systems, to be specific speculation. Through a reflection
procedure displayed in this paper, we give information distributers improved depictions for the
speculation system and its calculations. These depictions encourage the comprehension of the
calculations by information distributers having low programming aptitudes. We present additionally
some other use instances of these deliberations just as an experimentation directed to approve them.
Keywords : Anonymization, Programming aptitude.

INTRODUCTION
The open information activity is an open door prompting execution assessment. In addition, a
for making an immense measure of data few reviews proposed in the writing are
available to end clients. In any case, information utilization arranged. They for the most part
distributers are confronting the issue of break down various anonymization procedures
discharging helpful information without trading featuring their focal points and downsides so as
off protection. The eventual fate of their to propose explore headings. Others are method
business relies upon their capacity both to offer situated. Their correlations of calculations are
helpful information to open, to examiners, to for the most part devoted to scientists wishing
scientists, and so on and to pick up the trust of to take a shot at information anonymization and
information proprietors. The last requires in this manner are not open to information
executing forms that forestall the abuse of their distributers having commonly low aptitudes in
delicate data. Thusly, information on information anonymization. Moreover, existing
anonymization methods gets essential for devices are obscure. Regardless of whether they
information distributers. In specific, through propose a few procedures, they execute, more
this information, they should pick the suitable often than not, just a single calculation for every
calculation given their unique circumstance. strategy without portraying it. Additionally, the
vast majority of these apparatuses don't give
Scrambling calculations are various. They are direction to the selection of calculations or
commonly portrayed gratitude to their launch in procedures.
a given setting. Papers portraying these
calculations center around the experimentations

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 157


Notwithstanding, this assistance comprises in A sensitive attribute (SA) represents data that
exhibiting an assessment of the anonymized individuals don’t want to divulgate, such as
informational collection coming about because medicalinformation. Non-sensitive attributes
of the client's unique informational collection. (NSA) are all attributes that are not included in
On the off chance that the client is unsatisfied, previous categories. Forinstance, at Fig. 1a
the instrument offers him/her the likelihood to representing the original data set to be
change the information parameters of the anonymized, the attributes “Age” and
calculation. In this way, its utilization requires a “Education” mayconstitute a QI. The attribute
lot of aptitude. At long last, the extent that we “Disease” is a sensitive attribute (SA).
know, there exists neither one of the
knowledges bases where information Explicit Quasi Identifier Sensitive
distributers could look for the missing data nor identifier attribute
Name Age Education Disease
ways to deal with direct them in the Ajay 20 10th Flu
anonymization procedure. Our exploration Chand 20 9th Flu
questions are: How can an information Ram 28 9th Diabetes
Krishna 30 9th Cancer
distributer pick an anonymization system and,
Sudhir 23 11th Cancer
in the arrangement of calculations actualizing Basha 23 11th Cancer
this procedure, an anonymization calculation? (a)
An initial move toward responding to these
inquiries is to furnish information distributers Age Education Disease
with improved depictions of the methods and [19,23] Junior Diabetes
the calculations, permitting them to get them, [19,23] Junior Cancer
even without the necessary programming [27,30] Junior Flu
abilities. We are persuaded that this progression [27,30] Junior Flu
is important for basic leadership. These [19,23] Senior Cancer
disentangled depictions are acquired through a [19,23] Senior Cancer
reflection procedure comprising in removing (b)
and formalizing information inserted in the
writing so as to make it accessible through data
assets. Confronting the wealth of the writing, in
this paper, we focus our exertion on the
speculation system for relational databases.
2 Preliminaries
Privacy is one of the major concerns when
publishing or sharing data. It refers to different (c)
forms of disclosureregarding the type of
published or shared content. Identity disclosure,
for instance, can occur when publishing
orsharing personnel data. In our research, we
focus on microdata (atomic data elements
describing the individualobjects) contained in
relational databases. Each tuple has a value
(microdata) for each relational attribute. The
(d)
lattercan be an explicit identifier, a quasi- Fig. 1. (a) Original data; (b) generalized data; (c) &
identifier, a sensitive attribute or a non-sensitive (d) generalization hierarchies
one. An explicit identifier (EI) A quasi-
identifier (QI) is an attribute set which, when Companies may implement specific techniques
linked to external information, enables the re- to protect their data from disclosure risk. Most
identificationof individuals whose identifiers of them are knownas privacy preserving data
were removed ({sex, zip code, and birthdate} is publishing (PPDP) or mining (PPDM)
a well-known quasi-identifier inmany data sets). techniques. To our knowledge, the mostfamiliar

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 158


techniques for microdata anonymization are: From a process point of view, we can notice
data swapping,adding noise, micro- that some algorithms are completely automatic.
aggregationandgeneralization. In this paper, we Moreover, some ofthem are bottom up
focus on the generalization technique and its processes i.e. small groups of tuples are
algorithms. The generalization isapplied on a constituted and then iteratively merged until
QI. Each QI attribute could be either continuous eachgroup contains at least k rows (k-anonymity
or categorical. A continuous attribute is satisfaction). Others are top down processesi.e.
numerical andmay take an infinite number of they start from agroup containing all rows and
different real values (e.g. “Age” in Fig. 1a). A iteratively split each group into two subgroups
categorical attribute takes a value in alimited set while preserving k-anonymity.Regarding the
and arithmetic operations on it do not make outputs, some algorithms compute an optimal k-
sense (e.g. “Education” in Fig. 1a). anonymity solution but they are limited to small
Generalization techniquerequires the definition datasets[20]. Others are based on heuristics and
of a hierarchy for each attribute of the QI. Each thus do not guarantee the optimality. Finally,
hierarchy contains at least two levels. Theroot is the algorithms perform threedifferent
the most general value. It represents the highest generalizations that we define as: full-domain,
level. The leaves correspond to the original data sub-tree and multidimensional generalization.
values andconstitute the lowest level. As an Full-domainmeans that, for a given QI column,
example, the tree at Fig. 1c represents a all the values in the output table belong to the
generalization hierarchy of the same level of the generalizationhierarchy. Sub-
attribute“Education”. The node “Junior” is at tree means that values sharing the same direct
the level 1 of the Education hierarchy. To avoid parent in the hierarchy are necessarily
possible re-identification ofindividuals, several generalized atthe same level, taking the value of
privacy models have been proposed: k- one of their common ancestors. Finally, in
anonymity, l-diversity, t-closeness, etc. In this multidimensional generalizations, twoidentical
paper, weplace special emphasis on k- values in the original table may lead to different
anonymity since all generalization algorithms generalized values (i.e. are not always
are based on it. Let k be an integer. Atable generalized at thesame level). In terms of usage
satisfies k-anonymity if each release of data is scenario, let us note that bottom up
such that every combination of values of QI can generalization, top down specialization
be indistinctlymatched to at least k andInfoGain Mondrian produce data for
individuals11. As an example, the table in Fig. classification tasks. LSD Mondrian is used
1b is a generalization of the original table of when regression must be performedon data.
Fig.1a satisfying 2-anonymity, regarding (Age,
Education) QI.The generalization technique is 3 Abstraction approach
implemented thanks to several different We aim at providing data publishers with a deep
algorithms. The best known are:-argus, Datafly, knowledge of generalization algorithms’
Samarati’s algorithm, Incognito, Bottom Up behavior. Hence, weperformed an abstraction
Generalization, Top Down process of all these algorithms, allowing us to
Specialization,Median Mondrian, Infogain map them into a common frame. Ashighlighted
Mondrian and LSD Mondrian. In our previous by Wing: “The abstraction process introduces
work, we have compared thesealgorithms in layers. In computer science, we work
terms of process models, having four main simultaneouslywith at least two, usually more,
constituents: pre-requisites, inputs, process layers of abstraction: the layer of interest and
logic, andoutputs. Regarding the inputs, all the layer below; or the layer of interestand the
generalization algorithms require at least: (i) to layer above”. In our case, using an example
set the value of k (corresponding tok- artifact of the algorithm (layer below), we
anonymity), (ii) to declare which columns defined anabstraction of the algorithm artifact
constitute the QI, and (iii) to provide the (layer of interest). Moreover, in software and
generalization hierarchies. information systems engineering,the
motivations for abstraction are manifold. In our

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 159


research, two main reasons strengthen our number of distinct values, (b) and checks
choices. First, throughthis abstraction process, whetherthe resulting table complies with the k-
we aim to describe the different steps of the anonymity. If the number of tuples which do not
generalization algorithms as simply as satisfy k-anonymity is equalor lower than k,
possibleto facilitate their appropriation and then these tuples are removed and the algorithm
adoption. Second, we wish to highlight the stops. Otherwise, the algorithm performs
requirements of each algorithm inorder to anotheriteration of generalization. In the
introduce them as meta-data. Taking into description of Fig. 2a, let PT represent the table
account these two motivations, we built an to be anonymized, k the kanonymityconstraint
abstraction byparameterization. As defined by threshold, DGHAi the generalization hierarchy
Navrat et al., “Abstraction by parameterization of the attribute Ai, and MGT the resulting Fig.
extracts an essential core of somecomputational 2(a) focuses more on the implementation of
elements and reifies them as a named element of Datafly than on its functioning principle.
its own, leaving parameters to be filled in when
theabstraction is instantiated”. Moreover, in
INPUT: Private Table 𝑃𝑇; quasi-identifier 𝑄𝐼 = (𝐴𝑖, … . , 𝐴𝑛), 𝑘 constraint; hierarchies 𝐷𝐺𝐻𝐴𝑖 , where
order to reach an overall understanding of these 𝑖 = 1, … , 𝑛.

algorithms, we conductedan inductive process, Output: 𝑀𝐺𝑇, a generalization of 𝑃𝑇𝑄𝐼 with respect to 𝑘
Assumes: 𝑃𝑇 ≥ 𝑘
also called generalization in Navrat et al that Method:
presents all these algorithms using a 1. 𝒇𝒓𝒆𝒒 ← a frequency list contains distinct sequences of values of 𝑷𝑻[𝑸𝑰], along with the number
of occurrences of each sequence.
commondescription. These two sub-processes 2. 𝑾𝒉𝒊𝒍𝒆 𝒕𝒉𝒆𝒓𝒆 𝒆𝒙𝒊𝒔𝒕𝒔 sequences in 𝑓𝑟𝑒𝑞 occurring less than 𝑘 times that account for more than
𝑘 tuples 𝒅𝒐
are described in the following paragraphs. 2.1. 𝑳𝒆𝒕 𝐴i be attribute in 𝑓𝑟𝑒𝑞 having the most number of distinct values
2.2. 𝒇𝒓𝒆𝒒 ← generalize the values of 𝐴i in 𝒇𝒓𝒆𝒒
3. 𝒇𝒓𝒆𝒒 ← suppress sequences in 𝑓𝑟𝑒𝑞 occurring less than 𝑘 times.
4. 𝒇𝒓𝒆𝒒 ← enforce 𝑘 requirement on suppressed tuples in 𝒇𝒓𝒆𝒒
3.1. Abstraction by parametrization of the 5. 𝑅𝑒𝑡𝑢𝑟𝑛 𝑀𝐺𝑇 ← construct table from 𝒇𝒓𝒆𝒒

generalization algorithms
In most papers, the generalization algorithms (a)
are presented in such a way that they can be
directly translated intoa program. Usually, they 1. 𝑨𝒍𝒈𝒐𝒓𝒊𝒕𝒉𝒎 𝑻𝑫𝑺
2. Initialize every value in 𝑻 to the top most value.
are usually partially instantiated using an 3. Initialize Cut i to include the top most value.
example of a table to be anonymized. Their 4. 𝒘𝒉𝒊𝒍𝒆 some 𝑥 ∈ ⋃Cut i is valid and beneficial 𝒅𝒐
basicprinciples are textually described. 5. Find the Best specialization from ⋃Cut i .
6. Perform Best on 𝑻 and update ⋃Cut i .
Therefore they are dedicated to computer 7. Update 𝑆𝑐𝑜𝑟𝑒(𝑥) and validity for 𝑥 ∈ ⋃Cut i .
scientists or to professionals having 8. 𝒆𝒏𝒅 𝒘𝒉𝒊𝒍𝒆
aprogramming background. To produce a more 9. 𝒓𝒆𝒕𝒖𝒓𝒏 Generalized 𝑻 and ⋃Cut i .
abstract description of the generalization (b)
algorithms, we have filteredaway all irrelevant
information and sometimes added information Fig. 2. (a) Datafly Algorithm13; (b) Top-Down
in order to facilitate the extraction of Specialization Algorithm (TDS)
content.Since an algorithm is a dynamic artifact,
we chose to represent it via a flowchart. The The abstractpresentation at Fig. 3a
latter is quite helpful inunderstanding the logic highlights the basic principle and therefore
of complex problems. For lack of space, we facilitates the understanding. In this
present the results of our abstraction process abstraction,Datafly generalizes one attribute
foronly three algorithms: Datafly, Top Down
at a time. At each iteration, it selects (as
specialization (TDS) and Median Mondrian
algorithms. Datafly was thefirst algorithm able
mentioned in step 2) the one that has thebest
to meet the k-anonymity requirement for a big score (the highest number of distinct values
set of real data. It combines generalization of in the current table). The execution stops
dataand suppression of tuples in order to avoid when the k-anonymity isreached. The
an excessive generalization which would reduce second example describes the TDS
data usefulness. At eachiteration, DataFly (a) algorithm extracted from Fung et al.16 (Fig.
generalizes the attributes having the highest 2b). It browses thegeneralization hierarchy

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 160


from top to bottom. A high generalization of candidate specializations (the process is a top
all the values of the down process in the sense that it browses the
generalizationhierarchies from top to down). At
the initialization step (line 3) CUTi is a single
User

set, denoted CUTi in thealgorithm. Valid and


beneficial specializations (denoted x) are
specializations that do not violate the k-
anonymityand that respect the classification
N>K constraint (TDS is dedicated to data mining
usage, and especially to
classificationalgorithms). Best specializations
Datafly

are valid and beneficial specializations that,


N≥K
moreover, reach the best score(denoted Score
(x) in the algorithm) in terms of security and
N=K
quality. This score is computed using the trade-
offmetric20. The abstraction process of this
algorithm leads to the model sketched at Fig.
3b.Through this abstraction we exhibit the fact
(a) that TDS starts from a table T’ which represents
a high level ofgeneralization, giving priority to
security at the expense of quality. Then in order
to find the best compromisebetween quality and
User

security (the trade-off metric serves this


purpose), TDS tries to get closer to the actual
values ofT.
The last example of generalization
algorithm is Median Mondrian[17]. Its principle
is to divide the set ofindividuals (tuples)
represented in the table into groups such that
each group contains at least k individuals.
Datafly

G>0 G=0
Then,the individuals of the same group will
have the same value for their QI via the
generalization process. Moreprecisely,
individuals (tuples of the original table) are
represented, thanks to the values of their QI, in
amultidimensional space where each dimension
corresponds to an attribute of the QI (Fig. 4a).
(b) The splitting of thespace into areas generates
the groups. It is performed using the median
Fig. 3. (a) Abstraction of Datafly algorithm; (b)
value of the attribute.
Abstraction of TDS algorithm
Age

30
original table preserves k-anonymitybut can 27

negatively impact the quality of the resulting 23


Area - 1
Division

table in terms of classification. Therefore 19


TDSperforms iterations to find the best Area - 2

specializations i.e. those that not only satisfy k- 9 10 11 12


Education

anonymity but also generate lessanonymity loss, (a)


thus enabling better classification quality.
Intuitively, in this algorithm, CUTi represents
all the

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 161


User
There is no division allowed
whatever the dimension

At least one division allowed for at least one


(b) dimension

Datafly
Fig. 4. Principle of Median Mondrian and its
algorithm

At each iteration, the algorithm chooses a At least one partition marked


«unprocessed»
No partition marked
«unprocessed»

dimension and checks the possibility of splitting


a group into twogroups (i.e. splitting the area on
the median value of this dimension). A group
can be divided into two groups if eachresulting Fig. 5. Abstraction of Median Mondrian
group contains at least k individuals. If the
division is not possible, the corresponding In order to validate our contribution, we
group is marked. Thesplitting process switches conducted an experiment. The objective was to
to another dimension when all groups are evaluate theunderstandability of our abstraction
marked for the current dimension. It stops by comparing it with those found in the
whenall dimensions have been explored. Then literature. We performed thecomparison for two
the algorithm performs the possible algorithms, namely Datafly and Median
generalizations, replacing the differentvalues in Mondrian. A total of 12 participants were
the same area with the value of their first recruited.They were all either post-graduate
common parent in the generalization hierarchy students or researchers in computer science.
(recodingprocess). As shown at Fig. 4b, the Therefore, all of them were familiarwith
algorithm is recursive. At the first iteration, algorithmic and programming techniques but
“partition” contains all the tuples ofthe table to not aware of anonymization algorithms. To
anonymize. The function “choose_dimension()” avoid any biasedinterpretation of the results, we
(resp. “frequency_set(partition, dim)”) returns have provided the participants with the same
thechosen dimension (resp. the set «fs» of types of representations: ourabstractions have
values taken by a given dimension “dim” in a been transformed into textual representation
given partition “partition”). (algorithms).

The function find_median(fs) returns the value The experiment lasted about four hours. First
“splitVal” of the median. “t.dim” is the value of the participants had to fill a questionnaire about
a given dimension“dim” for the tuple “t”. The their level ofknowledge in anonymization
summary consists of the generalization of a set techniques (they had to evaluate their level
of values belonging to the samepartition. It is using a scale of 1 to 10) and theirprogramming
defined by a value range where the lower limit skills (they had to say if they have ancient,
(resp. upper limit) corresponds to the smallest recent or very recent programming skills).
value(resp. to the largest value) in the partition. Second, allparticipants were given a brief
Our abstraction of Median Mondrian leads to presentation on anonymization with emphasis
the activity diagram at Fig.5. on the generalization technique. Thenthey
received a copy of the slides and sheets of
papers for taking down notes. Third, they were

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 162


divided into twohomogeneous groups based on ฀Do you think that one algorithm helped you in
their programming profiles. We have defined understanding another one?
three profiles (1 for ancient, 2 forrecent, and 3 The participants had also to assign to each
for very recent). Then we provided all the algorithm a level of difficulty on a scale from 1
participants with a small table to anonymize and to 10
asked themto execute manually three algorithms
mentioned in their sheet without time limit. For Following the experiment, we started analyzing
the first group of participantswe proposed, the collected information. 12 participants
successively, our abstraction of Datafly, then executed threealgorithms each. We rejected
the algorithm of Figure 2a and, finally, the three illegible executions out of the 36. For each
MedianMondrian of Figure 4b. The second algorithm, the legible executions havebeen
group had to execute our abstraction of Median grouped into three classes. The first (resp. the
Mondrian, followed by theMedian Mondrian of second one) gathered all the correct executions
Figure 4b and Datafly of Figure 2a. We attached (resp. all thepartial executions). The last class
to Median Mondrian and Datafly theexplanation contained all the erroneous executions. For the
proposed by their authors. Whenever a partial executions we have deducedfrom the
participant met a problem to complete the comments of the participants three reasons for
execution of an algorithm, he (or she) was blocking. The first one is related to the
invited to indicate on his/her sheet the reason interpretation of the while instruction of the
for blockage. In the last phase of theexperiment original Datafly (Fig. 2a). The second one was
process, the participants had to answer the the disability to understand the data
following questions: structure”freq” in the same algorithm. Finally,
the third reason for blocking was the double
฀Was the initial presentation sufficient for recursion of the original MedianMondrian (Fig.
being able to execute the algorithms? If your 4b). Table 1 summarizes the results. For each
answer is no, pleaseexplain which information algorithm and each profile, the percentages
was missing. oferroneous, correct and partially correct
฀Did you detect similarities between some of executions are mentioned.
the algorithms you had to execute?
฀Did you detect identical algorithms (even if
differently described)?

Datafly of Fig. 3a Datafly of Fig. 2a Median Mondrian of Fig. 5 Median Mondrian of Fig. 4b

Class Erroneous Correct Partial Erroneous Correct Partial Erroneous Correct Partial Erroneous Correct Partial

Profile 1 20% 0% 0% 20% 0% 0% 20% 20% 0% 20% 0% 10%

Profile 2 0% 40% 0% 0% 20% 20% 0% 20% 0% 0% 20% 10%

Profile 3 0% 40% 0% 0% 20% 20% 0% 40% 0% 0% 40% 0%

Table 1. Synthesis of data collected from the experiment

All the participants who have executed our Mondrian. They unanimously mentioned
abstractions didn’t face blocking in the post questionnaire that
difficulties. Moreover, only thosethat had ourabstraction helped them understanding
ancient knowledge obtained erroneous the original algorithms. Moreover they
executions and they are few. Only those have ranked the original algorithmsas more
who first performed theexecution of difficult to understand than our
Datafly abstraction have then proposed a abstractions. All the participants that met a
correct execution of the original Datafly. blockage in the original Datafly(40%) have
The sameobservation emerged for Median not executed our abstraction. It is the same

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 163


for the original Median Mondrian. Finally, allowed us eliciting similarities and
over the 20%who delivered an erroneous grouping algorithms into categories. Each
execution for the original Median category is also associated to anabstract
Mondrian, 50% had not used our representation. Hence we defined a one-to-
abstraction. many mapping between this representation
and the differentalgorithms belonging to
To sum up, this analysis revealed that the this category.
lack of intelligibility of the original
algorithms impacts all We have first grouped the algorithms into
participantswhatever their programming categories regarding their basic principles
skills. Threats to validity must be and their type of resultinggeneralization.
mentioned due to the limited size of The first category called MR Recoding1
ourexperimentation group and the groups together the algorithms generating
restriction of the study to only two multidimensionalgeneralizations (Median
algorithms. However, this first Mondrian, InfoGain Mondrian and LSD
analysisencourages us to persevere in this Mondrian). Their basic specificity is to
abstraction effort. We wanted to check if considertogether all the attributes of the
our algorithm representation was easierto QI.
understand than the classical one.
Therefore, our pool of testers was
composed of persons with
User

programmingskills. Thus, they were able


to understand both representations. We
evaluated both perceived usefulness
andobjective usefulness. Perceived
usefulness was captured through direct G=0

queries regarding how they could G>0

understandthe underlying logic of each


Datafly

algorithm. Objective usefulness was


measured through the correctness of the
resultobtained by participants. Thus, if
users with programming skills prefer better
perform our abstraction, how muchmore N>k N≤k

data publishers with less programming N=0

skills will be at ease with it.

3.2. Generalization process of the


abstracted algorithms. Fig. 6. Homogenized abstraction of Datafly
The nine algorithms reviewed in our algorithm
research are all based on generalization
techniques. Our abstraction processled us
to the identification of three categories They iteratively divide the set of tuples into
using the parameterization and the groups such that each group satisfiesk-
generalization process describedabove. anonymity. Conversely, in the two other
This abstraction process followed a categories (LR Recoding and TR Recoding),
the generalizations are notmultidimensional.
bottom-up logic and a one-to-many
Moreover, at each iteration, they only deal
mapping. Abstracting each algorithm with one attribute of the QI. LR Recoding

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 158


gathersalgorithms based on a lattice structure own metric to select the best generalization,
representing all the possible generalizations of we introduced a parameter called “appropriate
the original table. Samaratiand Incognito metric”
algorithms belong to this category. Finally, TR in the abstraction of TR Recoding (Fig. 7).
Recoding includes Datafly, μ-argus, bottom- This parameter takes the value “Distinct
upgeneralization and top down specialization metric” for DataFly and “Tradeoffmetric” in
algorithms. Their specificity is to build TDS. Thus, the abstraction of Fig. 7 may be
directly and iteratively ananonymized table. In instantiated into the four algorithms thanks to
order to provide each category with an a correctparameterization. The same process
abstraction, we had to homogenize the allowed us to obtain an abstraction for all nine
nineabstractions. For instance, we had to algorithms and for the threerecoding
transform TR Recoding abstractions since categories. For space reasons, we cannot
some algorithms of this category aretop down provide the reader with all the abstractions.
processes and others are bottom up, some of Following thisbottom-up logic, we have
them include local or global suppressions and performed the same abstraction effort to
others onlyperform generalizations. Thus, we homogenize the three types of recoding.
have introduced a parameter allowing or Wegenerated the abstraction of Fig. 8a which
disallowing suppressions. For example,Datafly in fact represents the generalization technique.
allows suppressions (Fig. 6). Figure 8b instantiates thisprocess for
Recoding TR. Step numbers refer to Fig. 7.
Thus our recurring abstraction process
User

resulted in ataxonomy of generalization


algorithms (Fig. 8). We have defined an
abstraction by parameterization
usingflowcharts for all the nodes of this
G=0 taxonomy.
G>0
Datafly

User

Top down
otherwise
algorithm
Datafly

Iteration Stop condition


condition
Suppression condition

Fig. 7. Abstraction of TR Recoding Fig. 8. (a) Abstraction of the generalization


algorithms technique;

We also rephrased some instructions by Parameterization Instantiation for


components TR recoding
defining new concepts in order to reach a Manual parameterization step 1
unique abstraction for eachcategory of Automatic parameterization step 2
algorithms and thus to allow parameterization. Pre generalization step 3 and 4
For instance, in order to homogenize the Generalization step 5
concepts “candidate generalization” and “best Post generalization step 6 and 7
candidate generalization” with Datafly Fig 8 (b) An instantiation of the generalization
technique
concepts, we introduced in the latter
theconcepts of “candidate attribute” and “best
candidate attribute”. As another example of
Thus our recurring abstraction process
standardization, since eachalgorithm has its resulted in a taxonomy of generalization
algorithms (Fig. 8). We havebuilt an

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 159


abstraction by parameterization using for thedevelopment of tutorials whose purpose
flowcharts for all the nodes of this is to assist data stewards, students and
taxonomy. researchers in learning how to
4. Conclusion useanonymization algorithms and then to
An extensive attention has been paid to provide them with a well-informed usage of
privacy protection by statistics and computer existing anonymization tools. Toensure a
science communities overthese past years. A better transfer of knowledge, these tutorials
large body of research deals with could be contextualized according to the level
anonymization techniques and algorithms. Our of expertise oftheir potential users. Moreover,
research is notdevoted to new algorithms or we are convinced that our taxonomy can be a
new techniques. We want to help data starting point for the design of anontology of
publishers with low programming skills anonymization techniques and algorithms.
inunderstanding the existing techniques and This ontology will exhibit conflicts between
algorithms. This help is made possible thanks techniques andthen will contribute to a
to simplifiedrepresentations associated to each definition of combined anonymization
technique and algorithm. In this paper, we scenarios for a whole database. Finally, we
focused on the generalizationtechnique and on believethat the availability of such knowledge
nine of its algorithms, since it is one of the should facilitate the adoption of new
most used techniques for tabular data. The anonymization techniques andminimize the
simplifiedrepresentations have been obtained loss of competencies that arise when an
through an abstraction process described in the employee leaves the company.
paper. The abstraction effortallowed us to Finally, we also expect to develop ontology of
detect similar behaviors between the nine techniques andalgorithms. This ontology could
algorithms and thus to define three main be the main component for a decision support
categories ofgeneralization algorithms. We system helping a data publisher inselecting the
also defined an abstract model for each suitable technique and algorithm given a
category and finally an abstraction of context.
thegeneralization technique. Finally, thanks to References
our categorization, we built a taxonomy of 1. Fung, B. C. M., Wang, K., Chen, R., Yu, P.
these generalizationalgorithms, helping a S.: Privacy preserving data publishing: a
novice to understand how all these survey of recent developments. In ACM
anonymization algorithms may be ComputingSurveys (CSUR), Vol. 42(14)
differentiated. For spacereasons, the paper (2010)
only contains some abstraction models. We 2. Ilavarasi. B., Sathiyabhama A. K., Poorani.
conducted a controlled experiment, leading S.,.A survey on privacy preserving data
toencouraging results since participants found mining techniques. In Int. Journal of Computer
that the proposed abstractions were very Scienceand Business Informatics, 7(1), (2013)
useful to understand thealgorithms whatever 3. Xu, X., Ma, T., Tang, M.,Tian, W.: A
their programming skills. These abstractions survey of privacy preserving data publishing
constitute a first step towards the design using generalization and suppression. In Int.
ofpatterns that will be part of a catalog. The Journal onApplied Mathematics &
latter will be made available through a Information Sciences, Vol 8(3) pp 1103-1116
guidance approach we plan todesign. A pattern (2014)
documents either a technique or an algorithm 4. Patel, L., Gupta, R.: A Survey of
through its intent, its context of use, its inputs, Perturbation Technique for Privacy-Preserving
and of Data. In Int. Journal of Emerging
its process with an illustrative example. The Technology andAdvanced Engineering, Vol
first three constituents of the pattern are 3(6) (2013)
extracted from our previousAnother use case 5. Vinogradov, S., Pastsyak, A.: Evaluation of
of our abstractions lies in the context of e- Data Anonymization Tools. The 4th Int.
learning. Our abstractions may serve as a basis Conference on Advances in Databases,

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 160


Knowledge, andData Applications DBKDA 17. Lefevre, K., Dewitt, D. J., Ramakrichnon,
(2012) R.: Mondrian multidimensional k-anonymity.
6. Poulis G., Gkoulalas-Divanis A., Loukides In 22nd IEEE Int. Conference on Data
G., Skiadopoulos S., Tryfonopoulos Engineering(ICDE) (2006)
C.:SECRETA: A System for Evaluating and 18. Lefevre, K., Dewitt, D. J., Ramakrichnon,
ComparingRElational and Transaction R.: Workload-aware anonymization. In 12th
Anonymization algorithms. EDBT 2014. ACM SIGKDD (2006)
7. Fienberg S, McIntyre J.: Data Swapping: 19. Ben Fredj F., Lammari, N., Comyn-
Variations on a Theme by Dalenius and Reiss. Wattiau, I.: Characterizing Generalization
J. Domingo-Ferrer and V. Torra (Eds.): PSD Algorithms- First Guidelines for Data
2004,LNCS 3050, pp. 14–29, Springer (2004) Publishers. In 6thInternational Conference on
8. Brand R.: Microdata protection through Knowledge Management and Information
noise addition. Inference Control in Statistical Sharing (KMIS) (2014)
Databases. LNCS 2316, pp 97-116, Springer 20. Benjamin, C. M., Fung, Wang, K., Chen,
(2002) R., Yu, P. S.: Privacy-Preserving Data
9. Defays D., Nanopoulos P.: Panels of Publishing: A Survey on Recent
enterprises and confidentiality: the small Developments. ACMComputing Surveys, Vol.
aggregates method. In Proc. 92nd Symposium 42(4) (2010)
on Design andAnalysis of Longitudinal 21. Wing, J. M.: Computational Thinking and
Surveys. (1993) Thinking About Computing. Philosophical
10. Samarati, P.: Protecting respondents’ Transactions of the Royal Society, vol. 366,
identities in microdata release. IEEE Trans. on pp. 3717-3725 (2008)
Knowledge and Data Engineering, Vol 13(6), 22. Návrat, P., Filkorn, R.: A Note on the Role
(2001) of Abstraction and Generality in Software
11. Sweeney, L.: k-Anonymity: A model for Development. Journal of Computer Science,
protecting privacy. In Int. Journal of Vol 1(1),pp 98-102 (2005)
Uncertainty, Fuzziness and Knowledge-Based
Systems 10 (5),pp 557-570 (2002)
12. Undepool, A., Willenborg, L.: μ- and ฀-
argus: Software for statistical disclosure
control. 3rd Int. Seminar on Statistical
Confidentiality (1996)
13. Sweeney, L.: Achieving k-Anonymity
Privacy Protection Using Generalization and
Suppression. International Journal of
Uncertainty,Fuzziness and Knowledge-Based
Systems 10(5), pp 571-588 (2002)
14. Lefevre, K., Dewitt, D. J., Ramakrichnon,
R.: Incognito: Efficient full-domain k-
anonymity. ACM Int. Conf. on Management
of Data (2005)
15. Wang, K., Yu, P. S., Chakraborty, S.:
Bottom-up generalization: A data mining
solution to privacy protection.4th IEEE Int.
Conf. on DataMining (2004)
16. Fung, B. C. M., Wang, K., Yu, P. S.: Top-
down specialization for information and
privacy preservation. In 21st IEEE Int. Conf.
on DataEngineering (ICDE) pp 205–216
(2005)

Vol 10 Issue01, Jan2021 ISSN 2456 – 5083 Page 161

You might also like