You are on page 1of 6

Sanitization Techniques used in preserving Sensitive data

Abstract: The discovery of association personal information. Facts, or even


rules from lager databases has proven patterns that are not supposed to be
beneficial for companies since such disclosed.
rules can be very effective in revealing The current status in Data mining
actionable knowledge that leads to research reveals that, one of the current
strategic decisions. In tandem with this technical challenges is the development
benefit, association rule mining can also of techniques that incorporate security
pose a threat to privacy protection. and privacy issues.
The main problem is that from
non-sensitive information or Basic concepts
unclassified data, one is able to infer
sensitive information, including Transactional Database
personal information, facts, or even A Transactional database is a DBMS,
patterns that are not supposed to be where one can roll back if the writes on
disclosed. This scenario reveals a the database are not completed
pressing need for techniques that ensure properly. This is highly volatile. This
privacy protection, while facilitating deals with day to day transactions. This
proper information accuracy and contains functional data. Used for OLTP
mining. Systems.
In this paper, we present the
different sanitization algorithms which Market-Basket Analysis(MBA)
are used for balancing privacy and Market basket analysis identifies
knowledge discovery in association rule customers purchasing habits. It provides
mining. insight into the combination of products
within a customers 'basket'. The term
Introduction 'basket' normally applies to a single
order. the purchasing insights provide
The recent advice of Data Mining the potential to create cross sell
technology to analyse vast amount od propositions:
data had been playing an important role • Which product combinations are
in marketing,business, medical analysis bought
and other applications where pattern • When they are purchased; and in
discovery is paramount for strategic • What sequence
decision making. Despite its benefits in Developing this understanding enables
such areas,data mining also discovers businesses to promote their most
new threats to privacy and information profitable products. It can also
security if not implemented properly. encourage customers to buy items that
Recent advances in this area and might have otherwise been overlooked
Machine learning algorithms had or missed.It is the process of analyzing
introduced new problems in privacy transactional level data to determine the
protection. The main problem is that, likelihood that a set of items/products
from non-sensitive data, one is able to will be bought together.
infer sensitive information, including
Association rules: 400 customers buy orange juice
300 customers buy milk & orange juice
Retailers use the results/observations
from an MBA to understand the Support = P(milk & orange juice)/1000
purchase behaviour of customers for = 300/1000 = 0.3
cross-selling, store design, discount
plans and promotions. MBA can, and Confidence = P(milk & orange
should be done across different juice)/P(milk) = (300/1000)/(600/1000)
branches/stores as the customer = 0.5
demographics/profiles and their
purchase behavior usually varies across Lift = Confidence/P(result) = 0.5/
regions. (400/1000) = 1.25
The most common technique used in
MBA is Association Rules. The three Interpretation: A customer who
measures of Association Rules are - purchases milk is 1.25 times likely to
Support, Confidence, and Lift purchase orange juice, than a randomly
A --> B = if a customer buys A, then B is chosen customer.
also purchased
LHS --> RHS
Condition --> Result Sensitive data
Antecedent --> Consequent
Support: Ratio of the # of transactions Sensitive data include information
that includes both A & B to the total protected which should not be revealed
number of all transactions in any transaction. Intransacional
databases, sensitive data will be in
Confidence: Ratio of the # of condition or result.
transactions with all items in the rule (A
+ B) to the # of transactions with items Following are some examples of
in the condition (A ) sensitive data

Lift: Indicates how much better the rule Social Security number (SSN)
is at predicting the “result” or
“consequent” as compared to having no Credit card number or banking
rule at all, or how much better the rule information
does rather than just guessing
Tax information
Lift = Confidence/P(result) = [P
(A+B)/P(A)]/P(B) Credit reports
EXAMPLE
If a customer buys milk, what is the Anything that can be used to facilitate
likelihood of orange juice being identity theft (e.g., mother's maiden
purchased? name)

Milk --> Orange Juice Data Sanitization


Customer Base: 1000
600 customers buy milk
Technique: Masking Data

Masking data means replacing certain


fields with a Mask character (such as an
X).
This effectively disguises the data
content while preserving the same
formatting on front end screens and
Data Sanitization techniques reports. For example, a column of credit
NULL’ing Out card numbers might
Masking Data look like:
Substitution
Shuffling Records Data Sanitization Techniques
Number Variance
Gibberish Generation 4346 6454 0020 5379
Encryption/Decryption 4493 9238 7315 5787
4297 8296 7496 8724

Sanitization Techniques and after the masking operation the


information would appear as:
Technique: NULL’ing Out
4346 XXXX XXXX 5379
Simply deleting a column of data by 4493 XXXX XXXX 5787
replacing it with NULL values is an 4297 XXXX XXXX 8724
effective
way of ensuring that it is not The masking characters effectively
inappropriately visible in test remove much of the sensitive content
environments. from the record while still preserving the
Unfortunately it is also one of the least look and feel. Take care to ensure that
desirable options from a test database enough of the data is masked to preserve
standpoint. Usually the test teams need security. It would not be hard to
to work on the data or at least a realistic regenerate the original credit card
approximation of it. For example, it is number from a masking operation such
very hard to write and test customer as: 4297 8296 7496 87XX
account since the numbers are generated with a
maintenance forms if the customer specific and well known checksum
name, address and contact details are all algorithm.
NULL Also care must be taken not to mask out
values. potentially required information. A
masking
Verdict: The NULL’ing Out technique is operation such as XXXX XXXX XXXX
useful in certain specific circumstances 5379 would strip the card issuer details
but rarely useful as the entire Data from the credit card number. This may,
Sanitization strategy. or may not, be desirable.
Verdict: If the data is in a specific, major exercise in itself.
invariable format, then Masking is a
powerful and Verdict: Substitution is quite powerful,
fast Data Sanitization option. If reasonably fast and preserves the look
numerous special cases must be dealt and feel of the data. Finding the required
with then random data to substitute and developing
masking can be slow, extremely complex the procedures to accomplish the
to administer and can potentially leave substitution can be a major effort.
some
data items inappropriately masked. Technique: Shuffling Records

Shuffling is similar to substitution


Technique: Substitution except that the substitution data is
derived from the column itself.
This technique consists of randomly Essentially the data in a column is
replacing the contents of a column of randomly moved between rows until
data with there is no longer any reasonable
information that looks similar but is correlation with the remaining
completely unrelated to the real details. information in the row.
For example, the surnames in a customer There is a certain danger in the shuffling
database could be sanitized by replacing technique. It does not prevent people
the real from asking questions like “I wonder if
last names with surnames drawn from a so-and-so is on the supplier list?” In
largish random list. other words, the original data is still
present and sometimes meaningful
Substitution is very effective in terms of questions can still be asked of it. Another
preserving the look and feel of the consideration is the algorithm used to
existing data. The downside is that a shuffle the data. If the shuffling method
largish store of substitutable information can be determined, then the data can be
must be maintained for each column to easily “unshuffled”. For example, if the
be substituted. For example, to sanitize shuffle algorithm simply ran down the
surnames by substitution, a list of table swapping the column data in
random last names must be available. between every group of two rows it
Then to sanit ize telephone would not take much work from an
numbers, a list of phone numbers must interested party to revert things to their
be available. Frequently, the ability to unshuffled state.
generate known invalid data (phone
numbers that will never work) is a nice- Shuffling is rarely effective when used
to-have on small amounts of data. For example,
feature. if there are only 5 rows in a table it
probably will not be too difficult to
Substitution data can sometimes be very figure out which of the shuffled data
hard to find in large quantities. For really belongs to which row.
example, if a million random street
addresses are required, then just On the other hand, if a column of
obtaining the substitution data can be a numeric data is shuffled, the sum and
average of the column still work out to
the same amount. This can sometimes be example, it is pointless to carefully
useful. remove real customer names and
addresses while still leaving intact in
Verdict: Shuffle rules are best used on stored copies of correspondence in
large tables and leave the look and feel another table. This is especially true if
of the data intact. They are fast and the original record can be determined via
relatively simple to implement since no a simple join on a unique key.
new data needs to be found, but great
care must be taken to use a sophisticated Sanitizing “formless” non specific data
algorithm to randomise the shuffling of such as letters, memos and notes is one
the rows. of the hardest techniques in Data
Sanitization. Usually these types of
fields are just substituted with a random
Technique: Number Variance quantity of equivalently sized gibberish
or random words. If real looking data is
The Number Variance technique is required, either an elaborate substitution
useful on numeric data. Simply put, the exercise must be undertaken or a few
algorithm involves modifying each carefully hand built examples must be
number value in a column by some judiciously substituted to provide some
random percentage of its real value. This representative samples.
technique has the nice advantage of
providing a reasonable disguise for the Verdict: Occasionally it is useful to be
numeric data while still keeping the able to substitute quantities of random
range and distribution of values in the text.
column within viable limits. For Gibberish Generation is useful when
example, a column of sales data might needed but is not a very widely
have a applicable technique.
random variance of 10% placed on it.
Some values would be higher, some Technique: Encryption/Decryption
lower but
all would be not too far from their This technique offers the option of
original range. leaving the data in place and visible to
those with the appropriate key while
Verdict: The number variance technique remaining effectively useless to anybody
is occasionally useful and can prevent without the key.
attempts to correlate true records using This would seem to be a very good
known numeric data. This type of Data option – yet, as with all techniques, it
Sanitization really does need to be used has its strengths and weaknesses.
in conjunction with other options The big plus is that the real data is
though. available to anybody with the key – for
example administration personnel might
Technique: Gibberish Generation be able to see the personal details on
their front end screens but no one else
In general, when sanitizing data, one would have this capability. This
must take great care to remove all “optional” visibility is also this
imbedded references to the real data. For techniques biggest weakness. The
encryption password only needs to References:
escape once and all of the data is
compromised. Of course, you can
change the key and regenerate the test
instances – but stored or saved copies of
the data are immediately available under
Data mining and concepts: Han and
the old password.
Kamber.
Encryption also destroys the formatting
and look and feel of the data. Encrypted
data rarely looks meaningful, in fact, it
usually looks like binary data. This
sometimes leads to NLS character set
issues when manipulating encrypted
varchar fields. Certain types of
encryption impose constraints on the
data format as well. For example, the
Oracle Obfuscation toolkit requires that
all data to be encrypted should have a
length which is a multiple of 8
characters. In effect, this means that the
fields must be extended with a suitable
padding character which must then be
stripped off at decryption time.

The strength of the encryption is also an


issue. Some encryption is more secure
than others. According to the experts,
most encryption systems can be broken –
it is just a matter of time and effort. In
other words, not very much will keep the
national security agencies of largish
countries from reading your files should
they choose to do so. This may not be a
big worry if the requirement is to protect
proprietary business information.

Verdict: The security is dependent on the


strength of the encryption used. It may
not be suitable for high security
requirements or where the encryption
key cannot be secured. Encryption also
destroys the look and feel of the
sanitized data. The big plus is the
selective access it presents.

You might also like