You are on page 1of 57

Guide: Data Masking

Data Masking: 8 Techniques and


How to Implement Them Successfully
List of Content

1 Data Masking
2 Snowflake Data Masking
3 Redshift Data Masking
4 Data Encryption
5 Data Tokenization
6 How K Anonymity Preserves Data Privacy
7 Data Generalization
8 Data De-Identification
9 Pseudonymisation
10 Data Anonymization
11 Dynamic Data Masking
Guide: Data Masking

Data Masking: 8 Techniques


and How to Implement Them
Successfully
What Is Data Masking?

Data masking is a technique used to create a version of data that looks structurally similar to the original
but hides (masks) sensitive information. The version with the masked information can then be used for
various purposes, such as user training or software testing. The main objective of masking data is to create
a functional substitute that does not reveal the real data.

The majority of organizations have stringent security controls that protect production data when it rests in
storage and when it is in business use. However, sometimes data is used for less secure operations like
testing or training, or by third parties outside the organization. This can put the data at risk, and might
result in compliance violations.

Data masking offers an alternative that can allow access to information, while protecting sensitive data.
Data masking processes use the same data format to emulate the original data, while changing the values
of sensitive information.

There is a wide range of ways that can be used to alter data, including character shuffling, word or charac-
ter substitution, and encryption. Each method has its unique advantages. However, when masking data
the values must always be changed in some manner that makes reverse engineering impossible.

Here are several examples of data masking:

Data masking is a technique used to create a version of data that looks structurally similar to the original
but hides (masks) sensitive information. The version with the masked information can then be used for
various purposes, such as user training or software testing. The main objective of masking data is to create
a functional substitute that does not reveal the real data.

The majority of organizations have stringent security controls that protect production data when it rests in
storage and when it is in business use. However, sometimes data is used for less secure operations like
testing or training, or by third parties outside the organization. This can put the data at risk, and might
result in compliance violations.

Data masking offers an alternative that can allow access to information, while protecting sensitive data.
Data masking processes use the same data format to emulate the original data, while changing the values
of sensitive information.

https://satoricyber.com | contact@satoricyber.com 1
Guide: Data Masking

There is a wide range of ways that can be used to alter data, including character shuffling, word or charac-
ter substitution, and encryption. Each method has its unique advantages. However, when masking data the
values must always be changed in some manner that makes reverse engineering impossible.

Here are several examples of data masking:

Replacing personally-identifying details and names with other symbols and characters
Moving details around or randomizing sensitive data like names or account numbers
Scrambling the data, substituting parts of it for other parts from the same dataset
Deleting or “nulling out” sensitive values within data records
Encrypting the data to make it infeasible for unauthorized users to access it without a decryption key

In this article:

Which Data Requires Data Masking?

Types of Data Masking

8 Data Masking Techniques

What Are the Challenges of Data Masking?

Data Masking Best Practices

Which Data Requires Data Masking?

Here are the most common data types that require data masking:

Personally identifiable information (PII)—data that can be used to identify certain individuals. This
includes information like full name, passport number, driver’s license number, and social security
number.
Protected health information (PHI)—data collected by healthcare service providers for the purpose of
identifying appropriate care. This includes insurance information, demographic information, test and
laboratory results, medical histories, and health conditions.
Payment card information—the Payment Card Industry Data Security Standard (PCI DSS) requires
merchants that handle credit and debit cards transactions to appropriately secure cardholder data.
Intellectual property (IP)—data related to creations of the mind, including inventions, business plans,
designs, and specifications, have high value for an organization and must be protected from unautho-
rized access and theft.

Types of Data Masking

Here are three common types of data masking:

Static data masking—involves creating a duplicated version of a dataset, containing fully or partially
masked data. The dummy database is maintained separately from the production database.
Dynamic data masking—alters information in real time, as it is accessed by users. This technique is
applied directly to production datasets. It ensures that the original data is seen only by authorized users,
and any non-privileged user sees masked data.

https://satoricyber.com | contact@satoricyber.com 2
Guide: Data Masking

On the fly data masking—modifies sensitive information as it is transferred between environments,


ensuring that sensitive information is masked before it reaches the target environment. This technique is
ideal for organizations migrating data between systems, or maintaining continuous integration or
synchronization of disparate data sets.

8 Data Masking Techniques

Here are a few common data masking techniques you can use to protect sensitive data within your
datasets.

1. Data Pseudonymization
Lets you switch an original data set, such as a name or an e-mail, with a pseudonym or an alias. This
process is reversible—it de-identifies data yet still enables later use of re-identification if needed.

2. Data Anonymization
A method that lets you encode identifiers that connect individuals to the masked data. The goal is to
protect the private activity of users while preserving the credibility of the masked data.

3. Lookup substitution
You can mask a production database with an added lookup table that provides alternative values to the
original, sensitive data. This allows you to use realistic data in a test environment, without exposing the
original.

4. Encryption
Lookup tables are easily compromised, so it is recommended you encrypt data so that it can only be
accessed via a password. The data is unreadable while encrypted, but is viewable when decrypted, so you
should combine this with other data masking techniques.

5. Redaction
If the sensitive data is not necessary for QA or development purposes, you can replace it with generic
values in the development and testing environment. In this case there is no realistic data with similar
attributes to the original.

6. Averaging
If you want to reflect sensitive data in terms of averages or aggregates, but not on an individual basis, you
can replace all the values in the table with the average value. For example, if the table lists employee
salaries, you can mask the actual individual salaries by replacing them all with the average salary, so the
overall column matches the real overall value of the combined salaries.

https://satoricyber.com | contact@satoricyber.com 3
Guide: Data Masking

7. Shuffling
If you need to retain uniqueness when masking values, you can protect the data by scrambling it, so that
the real values remain, but are assigned to different elements. Given the salary table example, the actual
salaries will all be listed, but it won’t be revealed which salary belongs to each employee. This method is
best suited to larger datasets.

8. Date Switching
If the data in question involves dates that you want to keep confidential, you can apply policies to each
data field to obfuscate the real date. For example, you can set back the dates of all active contracts by 100
days. The drawback of this method is that, because the same policy applies to all values in a field, the
compromise of one value results in the compromise of all values.

What Are the Challenges of Data Masking?

Here are some of the key challenges involved in data masking:

Format preservation—the data masking solution has to understand the data (i.e what it represents).
When the masking system replaces the original data with inauthentic data, it should preserve the
original format. This is especially important for data threads that require a specific order or format, such
as dates.
Referential integrity—the tables in a relational database are connected via primary keys. When the
masking solution obfuscates or replaces the values of a table’s primary key, these values must be modi-
fied consistently across the database.
Gender preservation—the masking system should have gender awareness when replacing a person’s
name in the database, and be able to detect if the name is male or female. The gender distribution in a
table will be altered if the masking system changes names randomly.
Semantic integrity—databases typically enforce rules that limit the range of values permitted (e.g. the
range of salaries). Any masked data must fall within the specified range in order to preserve the seman-
tics (meaning) of the data.
Data uniqueness—when masking unique data, the masking system should apply unique values for
every data element. If the table in question stores employee SSNs, for example, each employee should
receive a unique SSN after masking. The frequency distribution of the masked data should be retained,
especially if the distribution is meaningful (i.e. geographic distribution). Each column on the table should
have similar masked data values to the original, on average.

Data Masking Best Practices

Best practices for data masking include:

Data discovery—before you can protect your data, you need to have a grasp of the data you are holding,
and distinguish the various types of information with varying degrees of sensitivity. Security and business
experts typically collaborate to produce an exhaustive record of all the data components across an
enterprise.

https://satoricyber.com | contact@satoricyber.com 4
Guide: Data Masking

Survey of circumstances—the security director responsible for determining the availability of sensitive
data should oversee the circumstances in which the data is stored and used, and decide on the appropri-
ate concealing strategy for each type of data.
Veiling actualization—for large enterprises, it is not realistic to apply a single data masking technique
across all datasets. Each type of data has to be considered in terms of the appropriate arrangement,
engineering and usage needs.
Veiling testing—this involves testing the results of data veiling techniques. The QA and testing teams
must guarantee that the data masking techniques used offer the desired outcomes. In the event that a
masking technique falls short of expectations, the DBA must restore the database to the original,
unmasked state and apply a new masking procedure with new calculations.

Data Masking with Satori

Satori enables dynamic masking over any data platform being accessed, based on your choice of security
policies, and can be set based on identities, data locations, as well as by data types. If you’d like to see that
in action, schedule a demo here. If you’d like to read more about how our masking works, you can also visit
our masking documentation.

Snowflake Data Masking:


Static vs Dynamic
In today’s ever-evolving digital age, data masking is one of the most essential features of data security and
privacy. It’s the act of redacting and obfuscating sensitive information that’s being shared internally or
externally to maintain optimal functionality while reducing the risk of exposing sensitive data within an
organization. Overall, administrators are able to seamlessly allow access to some users, and restrict it for
others to maintain security control and reduce risk.

Motivations Behind Data Masking

There are several reasons why someone would want to opt for data masking. In many cases, there is more
than one type of motivation behind a data masking project. Some of those underlying reasons include:

Business Oriented: This is when either the data owners, data governance, or another team decides
that exposing certain data can have negative consequences. An example to that is masking of financial
data from teams who shouldn’t be exposed to this data, as it could induce business risks.

Compliance Oriented: The primary drivers of compliance based data masking initiatives are usually
data governance, GRC, compliance teams, and in some cases, security teams. These masking projects
are done to ensure that the organization complies with certain security frameworks, such as NIST
Cybersecurity Framework.

https://satoricyber.com | contact@satoricyber.com 5
Guide: Data Masking

Security Oriented: This motivation for data masking stems from security teams. The goal here is to
reduce risks, mainly ones involving data leaks.

Privacy Oriented: This is usually a motive pushed by the privacy office or legal teams, and it is powered
due to privacy regulations or privacy risks. An example would be the masking of PII, or masking accord-
ing to specific privacy requirements.

How Do You Mask Data In Snowflake?

When you have a Snowflake data warehouse, it is common for organizations to require anonymized data,
as some data consumers only need partial access to the data. For instance, let’s consider a typical health-
care organization using Snowflake as its data warehouse. If we have a table with the treatments provided
to patients, we may have the following data consumers:

Medical professionals using a medical application who need the exact treatment information, but don’t
need to know things like the patient’s home address, insurance plan, SSO, or financial details.

A finance team who may need to know the financial data, but not the exact people it pertains to.

An accounting team who may need to know contact details as well as charges, but not the medical
details.

Data scientists who need to know statistical information, but not personal information.

With that being said, there are several ways to do that in Snowflake, and we’ll review each one in detail:

Snowflake Dynamic Data Masking

Snowflake Data Masking using Views


Snowflake Static Masking

Satori Universal Masking

Snowflake Dynamic Data Masking

For starters, let’s approach this with a relatively new way to mask data in Snowflake, which is the Dynamic
Data Masking feature (available for the Enterprise plan). Dynamic Data Masking allows you to set data
masking policies, and apply them on certain columns.

When setting dynamic data masking in Snowflake, you are defining masking policies, which may give

https://satoricyber.com | contact@satoricyber.com 6
Guide: Data Masking

different results for columns generally based on the user’s role (in most cases). For example, a policy can be:"

CREATE OR REPLACE MASKING POLICY phone_masking AS ( val string )


RETURNS string ->
CASE
WHEN CURRENT_ROLE ( ) IN ( ' ADMIN_TEAM ' , ' ACCOUNTING_TEAM ' )
THEN val
ELSE '[REDACTED]'
END ;

From here, you can apply the dynamic masking policies on any column:

ALTER TABLE customers MODIFY COLUMN home_phone SET MASKING POLICY


phone_masking;
ALTER TABLE customers MODIFY COLUMN work_phone SET MASKING POLICY
phone_masking;

For some more support, think of dynamic masking as an abstraction of Snowflake Secure Views,
which creates a more reusable way to apply policies. This is all an effort that makes them easier to
manage and scale.

Snowflake Data Masking Using Views

Now, let’s say that you don’t have an enterprise Snowflake account for some reason or another. If this is the
case for you, or maybe you have other logic you want to include in the same abstraction layer, you still have
options here. For instance, you can write your own custom dynamic data masking logic within views. As an
example, let’s say that we have the following table:

CREATE TABLE customers (id int, name text);


INSERT INTO customers VALUES (1, 'Ben'), (2, 'Karl');

If you want to apply dynamic data masking that will give your ACCOUNTING team full read access, your
ANALYST team hashed data for statistical purposes, and others redacted data, you can create a view
abstract, such as:

CREATE SECURE VIEW v_customers AS


SELECT id, (
CASE
WHEN CURRENT_ROLE ( ) IN ( 'ACCOUNTING' ) THEN name
WHEN CURRENT_ROLE ( ) IN ( 'ANALYST' ) THEN sha2 ( name )
ELSE '[REDACTED]'
END
) AS name FROM customers;

https://satoricyber.com | contact@satoricyber.com 7
Guide: Data Masking

By revoking access from the underlying asset (customers) and granting access to the view (v_customers),
users will now have data masking per their roles and can only retrieve data based on the commands in place.

USE ROLE ACCOUNTING;


SELECT * FROM v_customers;
// returns Ben, Karl
USE ROLE ANALYST;
SELECT * FROM v_customers;
// returns hashed values
USE ROLE OTHER;
SELECT * FROM v_customers;
// returns redacted values

Snowflake Static Masking

Another option for masking data in Snowflake is to statically mask the data per use-case. That means if we
have a customers table, and we have several data access scenarios, we generate a version for each use-case.
We can do this by creating new objects that contain the data anonymized per use-case, such as:

We create a customers table in the accounting database which contains all details that the accounting
team needs.

We create a customers’ table in the analysts database with hashed and redacted values.

And so on for other teams

We perform these data transformations as ETLs, and the main advantage of doing so is creating a very clear
separation between the different roles, and the data they may view. You can also take a hybrid approach by
building a view for each role, if that aligns with your terms of maintainability.

Snowflake data masking options comparison

https://satoricyber.com | contact@satoricyber.com 8
Guide: Data Masking

Satori Universal Masking

As a final note, this comprehensive overview would not be complete without mentioning the Universal
Masking feature for Snowflake. This is a feature that can be used when your Snowflake is integrated with
Satori. In summary, Universal Masking takes a further step into ease of management by allowing you to set
complete masking profiles in one central location, and the freedom to apply these masking profiles globally:

The profiles can be as generic as “mask anything that’s PII”, or as granular as setting a specific action
for each data type.

Simple masking conditions:

https://satoricyber.com | contact@satoricyber.com 9
Guide: Data Masking

More granular masking conditions:

You can apply masking based on Snowflake Roles, or directly on IDP Roles.

You can be as granular as you want to be while setting the policy on specific columns (or even only for
specific rows!). In addition, you can also set them on several tables at once, on a complete schema or
database, or even on your entire account with one policy.

You will get transparent details about the actual masking performed in your audit screen.

If you’d like to learn more about Satori’s Universal Masking for Snowflake, please schedule a 30 minutes
demo, and we’d be happy to show you around.

https://satoricyber.com | contact@satoricyber.com 10
Guide: Data Masking

Redshift Data Masking –


Getting It Right
Data masking is the act of transforming parts of data to maintain its anonymity. There are several types of
data masking, set according to the level of anonymity needed versus the value needed from the data. For
example, if the need for anonymity is high, then the data may be redacted to an empty string, whereas if
the data need to be an identifier for statistical purposes, the data may be hashed.

In the past, data masking was done mainly by creating a redacted copy of the data—but with the growth of
data processing and available cloud resources, dynamic data masking is growing in popularity. Dynamic
masking means that a single copy of the data is stored and that the masking is done before serving the
data to the data consumers.

https://satoricyber.com | contact@satoricyber.com 11
Guide: Data Masking

Reasons for Masking Data

Data is masked for different reasons, which usually fall into one of the four categories:

Security. The main reason here is risk reduction, and according to guidelines set by security teams, to
limit the possibility of a sensitive data leak.

Commercial. This kind of data masking is done for business reasons, such as masking of financial data
that should not be common knowledge, even within the organization. This is guided by either the data
owners of specific data sets or by other teams, such as those in charge of data governance.

Compliance. This kind deals with masking projects driven by requirements or recommendations based
on specific standards, regulations, and frameworks (such as the NIST Cybersecurity Framework). The
projects are usually initiated by data governance, GRC, or compliance teams.

Privacy. This is to make sure that the organization meets privacy regulations when handling personally
identifiable information (PII). This is usually led by the privacy office or legal team.

Data Masking in AWS Redshift

When either using Amazon Web Services (AWS) Redshift as a standalone data warehouse solution or as a
data processing engine for a data lake, it is commonly used in environments where sensitive data of
different types are to be found. Masking would mean that different teams in the organization may get
different redaction levels for the same source data sets.

For example, in a healthcare company, specific medical professionals will receive the medical information
about the patients but will not get certain data like the patients’ email addresses, payment cards, or home
addresses. On the other hand, the accounting teams may get a mirror of that they will get the personal
information without the health information. Another team, such as analysts or data scientists, may get
hashed data so they can run statistical predictions and analytics without being exposed to sensitive data,
and so on.

We will discuss several ways of doing masking of AWS Redshift data, with or without Satori:
Redshift Static Masking

Redshift Virtual Static Masking

Redshift Dynamic Masking

Dynamic Universal Masking using Satori

https://satoricyber.com | contact@satoricyber.com 12
Guide: Data Masking

Redshift Static Masking

The first method of masking data may seem archaic, but in some use cases it may be the best to use. Static
masking means that you’re keeping copies of the data in different levels of redaction and giving access to
data consumers according to policies. The data can be stored in different columns in a specific table using
Redshift’s column-based security (i.e., email and email_redacted) but are usually kept in different tables, or
even in different databases. When you are giving access to the users, you grant access according to the
sensitivity level required for the user.

The data in this method are redacted on-creation, and this is done in most cases either as part of the
extract, transform, load (ETL) procedure or by parallel insertion of data (for example from a Kafka queue) to
different destinations, with different detail levels.

For example, if we have a customers table with the following records:

And we want to create a redacted version of the table with hashed names, unaffected country code, and
emails with username replaced with the character “*”, leaving the domain name intact. We can do some-
thing along the lines of:

INSERT INTO redacted_customers (first_name, last_name,


country_code, email)
SELECT sha2(first_name, 256) AS first_name,
sha2(last_name, 256) AS last_name,
country_code,
REGEXP_REPLACE(email, '[^@]+@', '*@') AS email
FROM customers;

In many cases, especially where there are several use cases of the redacted data or a lot of sensitive data to
redact, this has more cons than pros:
Depending on the specific way you create this, it may create delays in the data and a chance of inconsis-
tencies between data sets handled by different teams.

This sometimes has a high operational cost, especially at scale. Even though storage is not very expen-
sive, it still accumulates cost, and sometimes the compute and maintenance cost on each such ETL can
grow over time.

https://satoricyber.com | contact@satoricyber.com 13
Guide: Data Masking

Changes to In many cases, introducing new use cases or needs for masking of additional data, or differ-
ently creates a project that delays our time-to-value from the data.

These reasons make static masking less attractive to data teams nowadays and push for use of dynamic
data masking.

Redshift Virtual Static Masking Using Views

imilarly, you can create a masked version of the data, without actually preparing the data ahead of time, by
creating view overlays. This means that instead of putting the data in two places, you create a view that
returns the filtered data and grant access to the view instead of to the source table.

With the same conditions as the previous example, it would look like this:

CREATE OR REPLACE VIEW redacted_customers AS


SELECT sha2(first_name, 256) AS first_name,
sha2(last_name, 256) AS last_name,
country_code,
REGEXP_REPLACE(email, '[^@]+@', '*@') AS email
FROM customers;

This negates several of the issues we’ve had with truly static masking. We don’t have a delay in the time for
the data to move to the masked view, and we don’t need to perform a lot of manual ETL work.

In terms of performance, the select query will need to run the masking functions (in this case sha2 and
regexp_replace) on each query, so in a scenario with many selects, it may have a performance impact.
Besides, since this is a view, you will need a workaround for use cases where data update is needed. Finally,
making a copy for each masking type for each table may be hard to scale, which may call for dynamic
masking.

Redshift Dynamic Masking

Redshift does not have a native dynamic data masking capability. But fear not, this will not stop us from
applying dynamic masking on data processed in Redshift. What we will do in this simplified example is
create an abstract layer that will contain the logic required for serving different content based on the user
pulling the data.

https://satoricyber.com | contact@satoricyber.com 14
Guide: Data Masking

An example of a view that implements this logic can be such a view:

CREATE VIEW v_customers AS


SELECT CASE WHEN CURRENT_USER='admin' THEN first_name ELSE
sha2(first_name, 256) END AS first_name,
CASE WHENCURRENT_USER='admin' THEN last_name ELSE sha2(first_name, 256)
END AS last_name,
country_code,
CASE WHEN CURRENT_USER='admin' THEN email ELSE REGEXP_REPLACE (email,
'[^@]+@', '*@') END AS email
FROM public.customers;

As you can see, we check in several fields of the query whether the user is “admin”, and if it is, we serve the
clear-text value; otherwise, we serve a redacted version. This can easily add more specific functionality, such
as serving a full email to certain users, only domain name to other users, and an empty string to the rest of
the users.

This example is based on the CURRENT_USER. You can, of course, implement a more sophisticated version
with groups, link tables holding configuration, and make more adaptations to your usage.

The advantages here are that you do not need to create sophisticated ETL processes and can add all
different transformations in one view, so one view per table is the only thing that’s needed. This may still be
challenging to manage at scale, and the performance impact over static masking for a view with a lot of
reading operations may still be quite high.

Dynamic Universal Masking Using Satori

This article would not be complete without mentioning the Satori Universal Masking feature, which can be
used when your Redshift is integrated with Satori. Universal Masking takes a further step into ease of
management, and allows you to set complete masking profiles in one central location, as well as apply
these masking profiles globally:
The profiles can be as generic as “mask anything that’s PII”, or as granular as setting a specific action
for each data type.

https://satoricyber.com | contact@satoricyber.com 15
Guide: Data Masking

Simple masking conditions:

More granular masking conditions

You can apply masking based on Redshift Users, identity provider (IdP) Groups, or even Data Directory
Groups!

You can be as granular as setting the policy on specific columns (or even only for specific rows!), but
you can also set them on several tables at once, on a complete schema or database, or even on your
entire account with one policy.

You will get details about the actual masking performed in your audit screens.

https://satoricyber.com | contact@satoricyber.com 16
Guide: Data Masking

Data Encryption?
What Is Data Encryption?

Encryption is a method of data masking, used to protect it from cybercriminals, others with malicious
intent, or accidental exposure. The data might be the contents of a database, an email note, an instant
message, or a file retained on a computer.

Organizations encrypt data to ensure it remains confidential. Data encryption is a component of a wider
range of cybersecurity counter-processes called data security. Data security involves ensuring that data is
protected from ransomware lockup, malicious corruption (altering data to render it useless) or breach, or
unauthorized access.

Encryption is also employed to safeguard passwords. Password encryption processes jumble up your
password, so that cybercriminals can’t read it.

In this article:

How Does Encryption Work?

Data Encryption Types

Symmetric vs Asymmetric Encryption

Data Encryption in Transit vs Data Encryption at Rest

Top 7 Encryption Algorithms

Blowfish

Twofish

Triple DES

The Advanced Encryption Standard (AES)

Rivest-Shamir-Adleman (RSA)

Elliptic Curve Cryptography (ECC)

Common Criteria (CC)

5 Data Encryption Best Practices

Build a Data Security Strategy

Choose the Right Encryption Approach for Your Data

Control All Access to Your Data

Encrypt Data in Transit

Build a Data Backup Strategy

Data Encryption with Satori

https://satoricyber.com | contact@satoricyber.com 17
Guide: Data Masking

How Does Encryption Work?

When information is shared via the internet, it passes through a set of network devices from around the
world, which comprise a section of the public internet. As information ventures throughout the public
internet, the possibility exists that it might be exploited or stolen by cybercriminals. To stop this, users may
install network security software or hardware to secure the transfer of information.

These security tools encrypt data so that it is unreadable to anyone without an appropriate decryption key.
Encryption necessitates the conversion of human-readable plaintext to ciphertext, which is text that is
incomprehensible.

Encryption uses a cryptographic key, a series of mathematical values agreed upon by the recipient and
sender. The recipient makes use of the key to decrypt the information-thus converting it into its original
plaintext.

The complexity of the cryptographic key determines the level of security. Stronger encryption makes it
harder for third parties to decrypt data using brute force attacks (which involve using random numbers
until the right combination is stumbled upon).

Learn about alternative methods of data masking in our guides to:

Pseudonymisation (coming soon)


Data tokenization (coming soon)

Data Encryption Types


Symmetric vs Asymmetric Encryption

Encryption techniques can be classified according to the type of encryption key they use to encode and
decode data:

Asymmetric encryption

This method is also called public-key cryptography. It encrypts and decrypts the information utilizing two
distinct cryptographic asymmetric keys (a private key and a public key).

https://satoricyber.com | contact@satoricyber.com 18
Guide: Data Masking

Symmetric encryption

This method utilizes one private key for decryption and encryption. Symmetric encryption works faster than
asymmetric encryption. It is most suited for use by individuals or in a closed system. Employing symmetric
methodologies with several users in an open system, for example over a network, demands the
transmission of the key, which presents the possibility of theft. The most readily employed form of
symmetric encryption is AES.

Data Encryption in Transit vs Data Encryption at Rest

Data encryption solutions, including cloud data encryption and data encryption software, are often
categorized according to whether they are intended for data in transit or data at rest.

In-Transit Encryption
Data is deemed to be in transit when it moves between devices, including over the internet or within
private networks. When being transfered, data is at increased risk of exposure, given that it must be
decrypted prior to transfer. Encrypting data for the duration of the transfer process, called end-to-end
encryption, makes sure that if data is intercepted, it remains private.

At-Rest Encryption
Data is labeled at rest when it remains on a storage device and is not being transferred or actively used.
Data at rest tends to be less vulnerable than when it is in transit. Device security attributes restrict
access—however, this doesn’t mean that data at rest is unexploitable. This data often tends to be especially
valuable, so it is a more attractive target for attackers.

Encrypting data at rest minimizes the possibility of data theft as a result of lost or stolen devices, accidental
permission granting, or accidental password sharing. It lengthens the time needed to access data and
offers valuable time for the owner of the data to discover ransomware attacks, data loss, changed
credentials or remotely erased data.

One means of protecting data at rest is via Transparent Data Encryption; a method used by Oracle, IBM
and Microsoft, to encrypt database files. TDE safeguards data at rest, encrypting databases on backup
media and on the hard drive. TDE does not safeguard data in transit.t

Top 7 Encryption Algorithms

Today, the Data Encryption Standard is an outdated symmetric encryption algorithm. With DES, you utilize
the same key to decrypt and encrypt a message. DES utilizes a 56-bit encryption key and encrypts data in
units of 64 bits. Such sizes are generally not big enough for today’s purposes. Thus, different encryption
algorithms have superseded DES.

https://satoricyber.com | contact@satoricyber.com 19
Guide: Data Masking

Blowfish
As with DES, Blowfish is now out-of-date—nevertheless, this legacy algorithm is still effective. This
symmetric cipher organizes messages into units of 64 bits and encrypts them one by one. Twofish has
superseded Blowfish.

Twofish
Utilized in both hardware and software applications, Twofish makes use of keys up to 256 bits in length.
However, it remains one of the quickest encryption algorithms. This symmetric cipher is unpatented and
free.

Triple DES
Triple DES (3DES or TDES) runs the DES algorithm three times. It encrypts, decrypts and re-encrypts to
produce a longer key. It may be run with just one key, two keys, or three distinct keys—the more keys, the
more security. 3DES utilizes a block cipher methodology, causing it to be vulnerable to attacks including
block collision.

The Advanced Encryption Standard (AES)


A symmetric encryption algorithm. It encrypts blocks of data (of 128 bits) per time. There are three options
for keys used to decrypt the text:

128-bit key - encrypts the information in 10 rounds


192-bit key - encrypts in 12 rounds
256-bit key - encrypts in 14 rounds

Every round comprises a few steps of substitution, mixing of plaintext, transposition and more. AES
encryption standards are the most prevalent encryption methods today for data in transit and at rest.

Rivest-Shamir-Adleman (RSA)
RSA is an asymmetric encryption algorithm. It is founded on the factorization of the result of two big prime
numbers. Only an individual who knows these numbers will know how to decode the message. RSA is
typically employed when passing data between two distinct endpoints (such as web connection). However,
it functions slowly when handling encryption of large volumes of data.

Elliptic Curve Cryptography (ECC)


ECC, favored by agencies including NSA, is a fast and powerful form of data encryption employed as a
component of the SSL/TLS protocol. It utilizes an entirely different mathematical process that lets it utilize
shorter key lengths to increase speed, while offering superior security. For instance, a 3,072-bit RSA key and
a 256-bit ECC key offer identical levels of security.

https://satoricyber.com | contact@satoricyber.com 20
Guide: Data Masking

Common Criteria (CC)


This is not an encryption standard, but rather a series of international guiding rules for checking that
product security claims are resilient under testing. Encryption was not initially covered by CC, though it is
now more commonly included in the security standards outlined for the project.

CC guiding rules were established to offer a third-party, vendor-neutral checking of security products.
Vendors voluntarily present products for evaluation, and their functionalities are studied either individually
or as a whole. Once a product has been evaluated, it’s capabilities and features are checked in keeping with
up to seven levels of rigor. It is also compared to a set of standards based on product type.

5 Data Encryption Best Practices


The following practices can help you ensure your data is secured effectively.

Build a Data Security Strategy

Your security approach should take into account your organization’s size. For instance, organizations with a
lot of users should be employing cloud servers to retain their encrypted data. Alternatively, small
organizations can store their media on workstations.

The following are some points to consider when developing a security approach:

Know the regulations - PII requires robust encryption to comply with government regulations. See
which other governing rules apply to your organization and how they affect your security approach.
Choose the right tools - decide which encryption tools are most suitable for your organization (consider
your organization’s needs and data volume).
Use a strong encryption algorithm - see if the algorithm or technology utilized by your encryption
vendor adheres to international standards.
Manage decryption keys - find ways to store, replace and generate keys. Also, develop strategies to
erase the encryption keys if there is a security breach.
Audit your data - decide how you will find irregularities or isolate unauthorized access to the encryption
keys.

An additional point to consider is the speed of your encryption. You don’t want to have to wait hours for
your data to be encrypted, particularly if you need to urgently transfer it over the network. Check with your
vendor to see how fast the tool can encrypt the file, but ensure security is not compromised.

https://satoricyber.com | contact@satoricyber.com 21
Guide: Data Masking

Choose the Right Encryption Approach for Your Data

When deciding which data you should encrypt, you must think about the worst outcome. How much
damage and loss would take place if a certain part of the data is exposed? If the risk is unacceptable, then
you have to encrypt that data.

Data that should be encrypted irrespective of the strength of your security systems are, for example,
sensitive details such as constant information, names, credit card information and social security numbers.

You must also ensure that files you are accessing remotely or transferring over a network are encrypted.

Control All Access to Your Data

Provide access to encryption keys to your users according to the sort of data they require. For instance, your
financial data must only be accessible by individuals from within the finance department.

Furthermore, determine what a user may access from the files. For instance, your marketing group may
access your customer’s email from the PII file, but must not be allowed to see their credit card information
or passwords.

You can achieve this by encrypting every column in a file of its own, or altering your vault access policies.

Encrypt Data in Transit

Storage and data collection are core parts of every organization. Data retained in your system or in
dedicated servers is simpler to safeguard than files that are in transit. While data is being transferred to and
from various locations, it is advisable to employ a VPN to mask your IP address.

Here are some additional reasons for using a VPN when you transfer data:

VPNs establish an encrypted connection between the internet and your device, masking all your online
undertakings
VPNs use security protocols to protect your data and devices from attacks via public Wi-Fi
A VPN modifies your IP address, so malicious actors cannot see when files are in transit
VPNs ensure secure access to storage devices (i.e. servers, cloud network) from workstations

https://satoricyber.com | contact@satoricyber.com 22
Guide: Data Masking

Build a Data Backup Strategy

If data is lost or stolen, you must be able to access the files or recover the keys employed to encrypt the
information.

Store your decryption keys in a secure location and retain a backup of all files. Keep your decryption codes
separately from your backup keys.

You may also employ a centralized key management approach to reduce the possibility of isolation. Such a
system ensures all parts of key management (such as software, hardware and processing) are in one
physical place, thus reinforcing security.

Additional points to consider when putting in place an encryption approach:

Ensure your encryption vendor lets you scale your network with minimal disruptions
Your encryption approach should support data migration, particularly if your organization plans to move
to the cloud
Make sure you can easily integrate third-party technologies without affecting security
Establish several layers of security to protect your information in the event of a data breach
Ensure your encryption approach doesn’t adversely affect the accessibility, performance or functionality
of your data

Data Encryption with Satori

Satori always keeps connections encrypted in transit when simplifying your access to sensitive data,
security your data access and keeping your data private and well-governed. Satori is also able to
dynamically mask and decrypt data that is stored encrypted in data stores, for keeping sensitive data

encrypted while in data stores. Learn more about what Satori does here.

https://satoricyber.com | contact@satoricyber.com 23
Guide: Data Masking

Data Tokenization: Why It’s


Important and How to Make
it Great
What is the Meaning of Tokenization?

Tokenization is a form of data masking, which replaces sensitive data with a different value, called a token.
The token has no value, and there should be no way to trace back from the token to the original data.
When data is tokenized, the original, sensitive data is still stored securely at a centralized location, and must
be protected.

When applied, tokenization approaches can vary according to the security applied on the original data
values. The approach should also account for the process and algorithms used to create the token as well
as establish a mapping system between tokens and original data values.

In this article:

Why Is Tokenization Important?

How Data Tokenization Works

Tokenization and PCI DSS

Tokenization vs Encryption

4 Data Tokenization Best Practices

Secure Your Token Server

Combine Tokenization with Encryption

Generate Tokens Randomly

Don’t Use a Homegrown System

Data Tokenization with Satori

Why is Tokenization Important?

A tokenization platform helps remove sensitive data, such as payment or personal information, from a
business system. Each dataset is replaced with an undecipherable token. The original data is then stored in
a secure cloud environment - separated from the business systems.

When applied in banking, tokenization helps protect cardholder data. When a business processes a
payment using a token, only the tokenization system can swap the token with a corresponding primary
account number (PAN). The tokenization system then sends it to the payment processor for authorization.
This ensures that business systems never store, transmit, or record the PAN—only the generated token.

https://satoricyber.com | contact@satoricyber.com 24
Guide: Data Masking

Cloud tokenization platforms can help prevent the exposure of sensitive data. This can prevent attackers
from capturing usable information. However, tokenization is not intended to stop threat actors from
penetrating networks and information systems. Rather, a tokenization system serves as a security layer
designed especially to protect sensitive data.

How Data Tokenization Works

Tokenization processes replace sensitive data with a token. The token itself has no use and it is not
connected to a certain account or individual.

Tokenization involves replacing the 16-digit PAN of the customer with a custom, randomly-created
alphanumeric ID. Next, the process removes any connection between the transaction and the sensitive
data. This limits exposure to breaches, which is why tokenization is highly useful in credit card processing.

Tokenization processes can safeguard credit card numbers as well as bank account numbers in a virtual
vault. Organizations can then safely transmit data across wireless networks. However, to ensure the
tokenization process is effective, organizations need to use a payment gateway for secure storage of
sensitive data. A payment gateway securely stores credit card numbers and generates random tokens.

Tokenization and PCI DSS

Payment card industry (PCI) standards restrict merchants from storing credit card numbers on their POS
terminal or in their databases after a transaction. PCI compliance requires that merchants install
encryption systems.

Alternatively, merchants can outsource their payment processing to a service provider offering
tokenization. In this case, the service provider issues tokens and is held responsible for keeping cardholder
data secure.

Tokenization vs Encryption

Tokenization and encryption are data obfuscation techniques that help secure information in transit and at
rest. Both measures can help organizations satisfy their data security policies as well as regulatory
requirements, including PCI DSS, GLBA, HIPAA-HITECH, GDPR, and ITAR.

While tokenization and encryption share similarities, these are two different technologies. In some cases, a
business may be required to apply only encryption, while other cases might require the implementation of
both technologies.

https://satoricyber.com | contact@satoricyber.com 25
Guide: Data Masking

Here are key characteristics of encryption:

Mathematically transforms plain text - encryption processes use math to transform plain text into
cipher text. This is achieved by using an encryption algorithm and key.
Scales to large data volumes - a small encryption key enables you to decrypt data.
Structured fields and unstructured data - you can apply encryption to both structured and unstruc-
tured data, including entire files.
Original data leaves the organization - encryption enables organizations to transmit data outside the
organization in an encrypted form.

Encryption is ideal for exchanging sensitive information with those who have an encryption key. However,
format-preserving encryption schemes offer lower strength.

Here are key characteristics of tokenization:


Randomly generates a token value - tokenization systems generate random token values, which then
replace plain text. The mapping is stored in a database.
Difficult to scale securely - when databases increase in size, it becomes difficult to securely scale and
maintain performance.
Structured data fields - tokenization applies to structured data, such as Social Security numbers or
payment card information.
Original data does not leave the organization - tokenization helps satisfy compliance demands that
require keeping the original data.

Tokenization enables organizations to maintain formats without diminishing the security strength.
However, exchanging data can be difficult because it requires direct access to a token vault that maps
token values.

4 Data Tokenization Best Practices


Here are some practices to help you make the most of tokenization.

Secure Your Token Server

To ensure your tokenization system is compliant with PCI standards, it is crucial to secure your token server
by maintaining network segregation. If the server is not adequately protected, the entire system could be
rendered ineffective. The token server is responsible for reversing the tokenization process, so it must be
safeguarded with robust encryption.

Combine Tokenization with Encryption

Encryption service providers are increasingly turning to tokenization as a means of complementing their
encryption capabilities. While some experts favor the use of either tokenization or end-to-end encryption,
the best approach may be to combine the two. This is particularly relevant for payment card processing.

https://satoricyber.com | contact@satoricyber.com 26
Guide: Data Masking

Each technology offers different functions, which achieve different purposes. Tokenization works well with
database infrastructure and provides irreversible data masking. End-to-end encryption offers greater
protection for payment card data in transit.

Generate Tokens Randomly

To maintain the irreversibility of token values, they must be generated randomly. Applying mathematical
functions to the input in order to generate the output allows it to be reversed to reveal the original data.
Effective tokens can only be used to uncover PAN data through a reverse lookup in the database of the
token server.

Generating random tokens is a simple process, because the data type and size constraints are insignificant.
PAN data shouldn’t be retrievable from tokens, so randomization should be applied by default.

Don’t Use a Homegrown System

Tokenization may appear to be simple in theory, but it is still possible to make mistakes. Tokens must be
generated and managed properly, with the token server being secured in a PCI-compliant way. All this can
be complicated to manage entirely in-house.

Homegrown tokenization deployments carry a greater risk and often fail to meet compliance standards.
Tokens may be easily deciphered if they are reversible, or if the overall tokenization system isn’t properly
secured.

Data Tokenization with Satori

With Satori you can de-tokenize data, without having the de-tokenized data pass through your data store.
For example, if you have sensitive tokenized data in Snowflake, you can de-tokenize it by using Satori,
without the clear-text sensitive data passing through your Snowflake account. To learn more, contact us.

https://satoricyber.com | contact@satoricyber.com 27
Guide: Data Masking

How K Anonymity Preserves


Data Privacy
Customers today entrust organizations with their personal information, which is used to provide them with
better services while enhancing the company’s decision-making. However, a lot of the value in this
information still goes unused.

This information could be instrumental to help third party analysts and investigators answer queries
ranging from urban planning to curing deadly diseases. Therefore, often companies want to share this
information with other parties without compromising the confidentiality of their customers. At the same
time, they also strive to maintain the utility of the information to guarantee precise analytical outcomes.
So, how do you publicly release a database without compromising individual privacy? Many data owners
just omit any unique identifiers such as SSN and name, hoping that it works. However, it’s not the right
approach.

According to Prof. Sweeney, the combination of date of birth, gender, and zip code is enough to uniquely
identify at least 87% of the US population in publicly accessible databases. To ensure a real privacy
guarantee, it must be proved and established mathematically, and this is where K Anonymity helps.

In this article, we’re going to talk about:

What is K Anonymity?

K Anonymization Examples

K Anonymity For Protecting Privacy

K Anonymity Implementation

K Anonymity vs L-Diversity

K Anonymity vs Differential Privacy

What is K Anonymity?

According to the K Anonymity definition, it is a privacy model usually applied to safeguard the subject’s
confidentiality in information sharing situations by anonymizing data. In this model, attributes are
suppressed or generalized until every row is identical with at least (K-1) other rows. At this point, the
database is said to become K Anonymous.

K6 Example (Making subjects indistinguishable)

https://satoricyber.com | contact@satoricyber.com 28
Guide: Data Masking

K Anonymization Examples

Let’s take a look at an example of how K Anonymization works.

Table 1 shows fictitious data of 12 patients admitted to a healthcare facility.

Table 1: Original Dataset

Table 2 shows the anonymized database. In this table, we’ve applied 4-Anonymization, which means that
the dataset contains at least 4 entries for the given set of quasi-identifiers (QI).

Table 2: 4-Anonymous Dataset

This data has 4-anonymity with respect to the attributes ‘Zip Code’ and ‘Age’. That’s because there are always
at least 4 rows with exact attributes for any combination of these attributes found in any row of the table.

https://satoricyber.com | contact@satoricyber.com 29
Guide: Data Masking

K Anonymity For Protecting Privacy

In many privacy-preserving systems, the ultimate objective is the anonymity of the individuals. On the
surface, anonymity just means to be nameless. However, when you look at it closely, you’ll quickly
understand that only eliminating names from a dataset isn’t enough to attain true anonymization.

It’s possible to re-identify anonymized data by connecting it with another dataset. The data may include
information pieces that aren’t unique identifiers themselves, but can be identified when linked with other
datasets.

K Anonymity prevents definite database linkages. It defines attributes that indirectly point to the person’s
identity as quasi-identifiers and handles data by making at least K individuals have the same combination
of QI values. As a result, at worst, the data released narrows down an individual entry to a group of K
individuals.

K Anonymity Implementation

The most common implementations of K Anonymity use transformation techniques such as generalization,
suppression, and global recoding.

1. Generalization
Generalization is the practice of replacing a specific value with a more generic one. For instance, zip codes
in a dataset can be generalized into counties or municipalities (i.e. changing 13053 to 130**). Ages may be
generalized into an age bracket (i.e. grouping 22 into [20-30]).

This technique removes recognizing info that can be gleaned from the dataset by reducing an attribute’s
specificity. You may think of it as sufficiently ‘widening the net.’

2. Suppression
Suppression is the process of eliminating an attribute’s value completely from a dataset. In the above
example of age information, suppression would mean eliminating age data from every cohort completely.

Bear in mind that suppression should only be used for data points that aren’t relevant to the purpose of the
data collection. For instance, if you’re gathering data to determine at which age individuals have the most
chances of developing a particular illness or condition, suppressing the age data would make the data itself
useless.

Suppression should be applied to mostly irrelevant data points, particularly on a case-by-case basis, instead
of using a set of overarching rules that apply universally.

https://satoricyber.com | contact@satoricyber.com 30
Guide: Data Masking

3. Global Recoding
In this method, continuous or discrete numerical variables can be grouped into a predefined class. It
means that a given specific value is replaced with a more generic value that can be chosen from anywhere
in the whole dataset.

For instance, if global recoding is performed on a dataset, the zip code will be generalized regardless of
gender or any other descriptive variable. The recoding process can also be single-dimensional or
multi-dimensional.

In single-dimensional recoding, each attribute is individually mapped (such as zip code).


In multi-dimensional recoding, the mapping can be performed on a function of numerous attributes
together, like in quasi-identifiers (such as zip code, gender, and date of birth).

K Anonymity vs L-Diversity

L-diversity is a form of group-based anonymization used to maintain privacy in datasets by decreasing the
granularity of a data representation model via methods including generalization and suppression.

The L-diversity model is often used as a yardstick to measure whether K Anonymization efforts have gone
far enough to avoid re-identification. A dataset is said to satisfy L-diversity if there are at least L
well-represented values for every sensitive attribute in every group of records that share key attributes.

In other words, any attribute that is considered sensitive, such as what medical conditions a person has, or
whether a learner passed or failed an exam, takes on at least L distinct values within each subset K.

This model protects privacy even when the data holder or publisher doesn’t know what information a
malicious party may already have about the subjects in a dataset. It helps achieve true anonymity as the
values of sensitive attributes are well-represented in every group.

K Anonymity vs Differential Privacy

Differential privacy is a system that allows you to publicly share info about a dataset by defining the
patterns of groups within the dataset while suppressing info about people in that dataset.

The main idea is that if the consequence of making one arbitrary replacement in the database is
sufficiently small, the query result can’t be used to deduce much about any single subject, and thus
ensures confidentiality.

In other words, differential privacy is a limitation on the algorithms used to share aggregate info about a
statistical database that constrains the disclosure of sensitive info in the database.

https://satoricyber.com | contact@satoricyber.com 31
Guide: Data Masking

For instance:

Some government agencies use differentially private algorithms to share demographic info or other
statistical aggregates while guaranteeing the discretion of survey responses.
Many businesses also use differentially private algorithms to gather information about customer
behavior while regulating what can be accessed even by in-house analysts.

To safeguard the privacy of subjects, differential privacy adds noise in the data to disguise the real value,
and therefore, makes it private. By doing so, it conceals the subject’s identity with little to no influence on
the information utility. This means that the statistical results from the dataset shouldn’t be affected by a
subject’s contribution as the information represents the characteristics of the whole population.

Conclusion

Data privacy has garnered a lot of attention in recent years. As leakage of customer data is a constant issue
we experience nowadays, businesses use different approaches to protect their user data.

Many businesses collect customer information for in-house usage, and they sometimes make this
information publicly accessible through datasets. To safeguard the customer’s identity, data engineers use
K anonymization, differential privacy methods, and other approaches to protect the customers’ private info.

K Anonymity is a robust tool when applied correctly and with the right protections implemented, such as
access control. It contributes significantly to privacy-improving technologies, together with alternative
methods like differential privacy algorithms. With big data becoming the norm, we see growing data
dimensionality along with more and more public datasets that can be used to support re-identification
efforts.

https://satoricyber.com | contact@satoricyber.com 32
Guide: Data Masking

Data Generalization: The Specifics


of Generalizing Data
Data mining is not a new concept that emerged with the digital revolution. The idea has been around for
about a century, although it became more popular in the 1930s. In 1936, Alan Turing proposed a universal
machine that could perform computations comparable to your current computers, one of the first forms of
data mining.

Data Mining, also called Knowledge Discovery in Data (KDD), is a technique for extracting patterns and
other useful information from huge data sets. Because of the advancements in data warehousing
technologies and the rise of big data, the use of data mining techniques has exploded in recent decades,
supporting businesses in turning raw data into valuable knowledge.
In this article, you will receive an in-depth view of a concept closely knitted to data mining - data
generalization. Specifically:

What is Data Generalization? Data Generalization vs. Data Aggregation

When is Data Generalization Important? Approaches to Data Generalization

Data Generalization vs. Data Mining Examples of Data Generalization

What is Data Generalization?

When faced with the question of generalization in data mining, one can simply answer that data
generalization is the process of broadening the classification of data in a database. This helps a user expand
out from the data to provide a broader picture of trends or insights.

Below is a generalization in data mining, with an example.

If you have a data set with a collection of people’s ages, for example, the data generalization process would
look like this:

Data generalization in data mining substitutes a precise value with a less accurate value, which may appear
counterintuitive. Still, it is a widely practical and used technique in data mining, analysis, and secure storage.

https://satoricyber.com | contact@satoricyber.com 33
Guide: Data Masking

Two Forms of Data Generalization in Data Miningt

There are two main forms of data generalization in data mining: Automated and Declarative.

Automated Generalization distorts values until a given value of k gets reached. Because you can utilize an
algorithm to apply the least amount of distortion required to obtain the stated value of k, this method may
offer the optimal balance between privacy and accuracy. You can select which deals are of most impor-
tance for your use case, and those values can be blurred using one of the various approaches to achieve any
value of k.

Declarative Generalization, on the other hand, allows you to set the bin sizes upfront, such as always
rounding to entire months. Outliers sometimes get discarded from this procedure, which might skew the
data and add bias. Although, you must remember a declarative generalization does not always lead to
k-anonymity.

Although declarative generalization may not help you reach k-anonymity, it is a good idea to use it as a
default. Therefore, the recipient of the de-identified material only sees the level of detail they need.

Identifiers used in Data Generalization in Data Mining

Identifiers are data points about a subject that can determine their identity and link to other personal
information. There are two main types of identifiers: direct identifiers and quasi-identifiers.

Direct identifiers are data points that can identify an individual while allowing other data to link to that
person. Even if multiples of the same data point exist in the data, a data point can be a direct identifier. For
example, even if two people are named “Mary,” the name is still a direct identifier.

Quasi Identifiers, on the other hand, do not allow you to identify a person on their own. Still, you can use
them in conjunction with additional information to do so. Quasi Identifiers can be unique within a data
collection. Still, they are also expected to appear in different data sets shortly or are currently present in
other unique data sets.

Suppose you have a data set that includes a person’s gender and zip code. There will be enough people of
that gender who live in that zip code that this person cannot get identified only based on those two data
factors. However, suppose that person also appears in another data collection, including their gender, zip

When is Data Generalization Important?

Data generalization in data mining allows you to abstract personal data by removing identifying
characteristics.

https://satoricyber.com | contact@satoricyber.com 34
Guide: Data Masking

This generalization allows you to examine the data you have collected without jeopardizing the people’s
privacy in your dataset. It is crucial to remember that there are several methods for generalizing data, and
you should choose the one that makes the most sense for your case. In some circumstances, masking
direct identifiers is the best course of action, while in others, you want to keep the signal in data analytics.

Remember that there is no one-size-fits-all solution for retaining privacy. Due to this fact, you should learn
about different approaches like tokenization, redaction, and pseudonymization. Once you understand
those concepts, you can apply them as needed to get the most out of your data without jeopardizing
privacy.

Data Generalization vs. Data Mining

Treading the line between data generalization vs. data mining need not be difficult.

Data generalization is the process of summarizing data by replacing relatively low-level numbers with
higher-level concepts. In contrast, data mining involves investigating and analyzing vast data blocks to
uncover relevant patterns and trends. Data generalization is a type of descriptive data mining, to put it
simply.

Data Generalization vs. Data Aggregation

Data aggregation is a notion linked to, and frequently confused with, data generalization in data mining.

When treading the line between data generalization vs. data aggregation, the primary distinction is that
accumulation creates a general class from many classes. In contrast, generalization is the process of
constructing a specific general class from numerous classes.

Put simply:

Approaches to Data Generalization

There are two basic approaches to Data Generalization in Data Mining:

https://satoricyber.com | contact@satoricyber.com 35
Guide: Data Masking

Data Cube Approach


In most cases, a data cube makes data easier to understand. It is very helpful when displaying data with
dimensions as specific gauges of business needs. Every cube dimension reflects a different aspect of the
database, such as daily, monthly, or yearly sales.

A data cube’s data allows for analyzing nearly all figures for virtually any or all customers, sales agents,
products, among other things. As a result, a data cube can assist in identifying trends and analyzing
performance.

In a nutshell:
It is also known as the OLAP approach or Online Analytical Processing.

It is a practical strategy because it aids in the creation of a previous selling graph.

The Data cube gets used to holding the computation and results in this method.

On a data cube, roll-up and drill-down procedures get employed.

Aggregate functions like count(), sum(), average(), and max() are commonly used in these procedures.

These materialized, you can then use perspectives for decision-making, information discovery, and

various other uses.

Attribute Oriented Induction

Attribute Oriented Induction is a database mining technique that compresses the original data collection
into a generalized relation, resulting in concise and comprehensive information about the huge datasets.

Moreover, attribute generalization in data mining allows for the transition of similar data collections, origi-
nally stated at a low (primitive) level in a database, into more abstract conceptual representations.

In a nutshell:
Attribute generalization in data mining is a query-oriented, generalization-based technique to online

data analysis.

Generalizations get made using this method based on varying values of each attribute within the

relevant data set. Then, to do aggregation, the same tuple is merged, and their corresponding counts

get accumulated.

Before an OLAP or data mining query gets submitted for processing, it performs offline aggregation.

It does not get restricted to specific metrics or categorical data.

Attribute-Oriented Induction uses two methods:

Attribute removal

Attribute generalization

https://satoricyber.com | contact@satoricyber.com 36
Guide: Data Masking

Examples of Data Generalization

Market Basket Analysis is one of the most well-known examples of data generalization in data mining.
Market Basket Analysis is a method for analyzing the purchases made by a customer in a supermarket.

The idea is to use the concept to identify the things that a customer buys together. What are the chances
that if a person buys bread, they will also buy butter? This analysis aids in the promotion of company offers
and discounts. Data mining is used to do the same thing.

Moreover, business reporting for sales or marketing, management reporting, business process manage-
ment (BPM), budgeting and forecasting, financial reporting, and similar sectors commonly use Market
Basket Analysis. However, other sectors such as Agriculture are beginning to find new ways of using this
analysis as well.

Conclusion

With the realization of the significance of data, businesses are continuously finding ways to use and lever-
age data to their advantage. As a result, data scientists have become increasingly important to companies
worldwide as they strive to achieve greater heights with data science than ever before. However, with this,
comes the need to protect the privacy of individuals and follow compliance, which brings the need for data
generalization, as well as other data anonymization strategies.

https://satoricyber.com | contact@satoricyber.com 37
Guide: Data Masking

Data De-Identification
With the increasing demand for government-held data, organizations need effective processes and proce-
dures for removing personal information. A vital tool in this regard is data de-identification that involves the
removal of personal information from a data set or record. It protects the privacy of people because, after
de-identification, a data set is considered to no longer include personal information.

In this article, we are going to discuss:


What is Data De-identification?

When & Why is Data De-identification Important?

Data De-identification vs Data Tokenization

Data De-identification vs Data Masking

Data De-identification vs Anonymization

Approaches to Data De-identification

Data De-identification Tools

What is Data De-identification?

Data de-identification is the process of eliminating Personally Identifiable Data (PII) from any document or
other media, including an individual’s Protected Health Information (PHI).

De-identification of data is the quickest and easiest way to ensure compliance and identification security
on communication methods that could be accessed by outsiders or the public.

Data de-identification allows information to be utilized by others without the likelihood of individuals being
recognized. It may be used to:

Safeguard the privacy of people and organizations, such as companies

Build community trust in how agencies store and handle data

Guarantee that the spatial location of archaeological or mineral findings or endangered species isn’t

publicly accessible

Reduce risk and minimize the damage caused to people from a data breach

De-identifying data

https://satoricyber.com | contact@satoricyber.com 38
Guide: Data Masking

When & Why is Data De-identification Important?

The main goal of data de-identification is to safeguard the confidentiality of people. If a record includes any
type or amount of personal data, it can’t be considered de-identified.

At the same time, one of the key reasons for releasing de-identified data is to give others a chance to study
the characteristics and values of the raw data for research purposes.

For instance, a state-owned educational organization may employ an agency to study the outcomes or
influence of educational policy like a recent expansion of state-subsidized kindergarten programs. The
investigators would then request access to data needed to conduct their study (such as records showing
the number of students enrolled in kindergarten programs over 10 years).

However, before providing access to the records, the educational organization would de-identify the data to
avoid individual identities from being exposed in the data provided to the external agency.

Therefore, data de-identification techniques should also focus on preserving as much value in the info as
possible, while safeguarding the privacy of people. This twofold purpose of de-identification makes it a
significant tool to be used in several contexts, including:

Responding to access to information requests in a privacy-protective manner.

Supporting improved marketing based upon customer activity data without disclosing info about the

individual customers from whom the information was gathered.

Open data initiatives that seek to promote research, innovation, and the development of new applica-

tions and services.

Allowing for groundbreaking energy research with data on energy consumption that won’t disclose the

corresponding users.

Data sharing within and among organizations to break down silos.

Supporting leading-edge healthcare research with patient information without violating patient privacy.

Allowing libraries to maintain the privacy of their visitors concerning their reading and viewing activities

while maintaining trends and statistics about which items they access and read.

Data De-identification vs Data Tokenization

Data tokenization is a process of substituting personal data with a random token. Often, a link is main-
tained between the original information and the token (such as for payment processing on sites). Tokens
can be completely random numbers or generated by one-way functions (such as salted hashes).

Unlike encrypted data, tokenized data can’t be deciphered or reverse engineered. That’s because there’s no
mathematical relationship between the token and its original number. Simply put, tokens can’t be returned
to their initial form.

https://satoricyber.com | contact@satoricyber.com 39
Guide: Data Masking

Data De-identification vs Data Masking

Data Masking is a technique that removes or hides information, replacing it with realistic replacement data
or fake information. The objective is to create a version that can’t be decoded or reverse engineered. There
are a number of ways to change the data, including encryption, character shuffling, and word or character
replacement.

Masking is usually applied to things that directly identify an individual like their name and phone number.
However, the information must remain usable, appear real, and look consistent. For example, in a call
center, masking may be used so that operators can’t view credit card numbers in billing systems.

Data De-identification vs Anonymization

Data Anonymization is a kind of data sanitization process that intends to protect the privacy of individuals.
It is the process of removing PII from data sets to maintain the anonymity of individuals whom the data
describe. It is often the preferred method for making structured medical datasets secure for sharing.

For instance, you can run personal information such as names, addresses, and social security numbers
through a data anonymization process that preserves the data but keeps the source anonymous.

Here’s a table that gives you a snapshot of how de-identification, anonymization, tokenization, and masking
compare with one another.

Approaches to Data De-identification

When discussing de-identification techniques, it’s important to understand the two kinds of identifiers:
direct identifiers and quasi-identifiers (also called indirect identifiers).

https://satoricyber.com | contact@satoricyber.com 40
Guide: Data Masking

deid software
Direct identifiers are variables that can uniquely identify a person, such as names, email addresses, and packa
social security numbers.

Quasi-identifiers are variables that can identify a person but are also beneficial for data analysis. For

instance, dates, demographic info (like race and origin), and socioeconomic variables.

Understanding this difference is significant as the approaches used to secure the identifiers will depend on
how they are categorized.

Now let’s take a look at some of the commonly used data de-identification techniques:

Redacting information, including via pixelation in digital recording and video footage

Omission (omitting data in the data set such as full names)

Differential privacy (describing or analyzing the patterns of groups within the data set while concealing

data about individuals)

Aggregating data

Suppression (removing values from the data set or substituting some values with ‘missing’)

Data swapping (for instance swap salaries for individuals within the same area, so the aggregate is still

valid)

Coding or pseudonymization (substituting identifiers with unique, temporary IDs or codes)

Hashing (one-way encryption of identifiers)

Micro-aggregation (forming groups with a certain number of observations and replacing the individual

values with the group mean. For example, group in threes, so ages 21, 22, and 23 each become 22)

Removing some variables

Generalization (such as by substituting the exact birth date with a month and a year)

k-anonymization (defining attributes that indirectly point to the person’s identity as quasi-identifiers (QI)

and handling data by making at least k individuals have the same combination of QI values)

Adding noise (creating white noise by generating and adding a new variable to the original variable with

mean zero and positive variance)

Data De-identification Tools

De-identification can be an intricate and technically challenging process. However, there are several
automated data de-identification tools or software that can facilitate the process. Some of these tools
include:

ARX Data Anonymization Tool

ARX is a comprehensive open-source tool that anonymizes sensitive personal information. It supports an
extensive range of privacy and risk models, techniques for data transformation, and techniques to analyze
the utility of output data. This tool is used in a variety of contexts including research projects, commercial
big data analytics platforms, clinical trial information sharing, and for training purposes.

https://satoricyber.com | contact@satoricyber.com 41
Guide: Data Masking

e deid software package

The deid software package includes code and dictionaries that automatically locate and remove PHI in free
text from medical records. It was developed and tested using a gold standard corpus of over 2,400 nursing
notes that were methodically de-identified by a multi-pass process including various automated methods
as well as scrupulous reviews by multiple experts working autonomously.

Google Differential Privacy (DP) Library

Google’s DP library offers a set of building blocks that enable developers to build differentially private
applications in Go, Java, and C++, Java, and Go. It allows developers to easily access metrics related to how
successfully their apps are engaging their consumers, such as Daily Active Users and Revenue per Active
user, in a way that helps ensure individual users can’t be identified or re-identified.

Examples of Data De-identification

Let’s take a look at some examples of de-identification.

In the table above, we have:


Replaced individual names with unique codes so that they become unidentifiable.

Ages are replaced with age brackets through the process of generalization.

Street names are omitted to hide some part of their current location.

Data De-Identification With Satori

To learn more about how Satori helps you solve privacy, security and governance challenges, visit our
product page.

https://satoricyber.com | contact@satoricyber.com 42
Guide: Data Masking

Pseudonymisation:
9 Ways to Protect Your PII
What Is Pseudonymisation?

Pseudonymisation is a way of masking data that ensures it is not possible to attribute personal data to a
specific person, without using additional information subject to security measures. It is an integral part of
the EU General Data Protection Regulation (GDPR), which has several recitals specifying how and when
data should be pseudonymized.

The term personal data applies to information related to an individual known as a data subject. Data
subjects are identifiable based on attributes such as a person’s name, ID number, or location, or specific
identity factors such as the physical, genetic, physiological, mental, cultural, economic, or social characteris-
tics of the individual.

In this article:

Which GDPR Recitals Mention Pseudonymization?

5 Pseudonymization Techniques

Data Scrambling

Data Masking

Data Encryption

Data Tokenization

Data Blurring

4 Pseudonymization Policies

Deterministic Pseudonymization

Randomized Pseudonymization

Document-Randomized Pseudonymization

Establishing Your Pseudonymization Techniques and Policies

Pseudonymisation with Satori

Which GDPR Recitals Mention Pseudonymization?

The GDPR requires the implementation of pseudonymisation for the purpose of protecting personally
identifiable information (PII). The GDPR includes data protection principles that apply to identified or
identifiable individuals.

The GDPR treats personal data that is attributable to a data subject as PII, even if it has undergone
pseudonymization and requires additional information to crack. Organizations must consider all
reasonable means to determine whether pseudonymization can be traced to data subjects using
additional information.

https://satoricyber.com | contact@satoricyber.com 43
Guide: Data Masking

Note that data protection regulations don’t apply to anonymous data, because it cannot be traced to a data
subject.

Here are several GDPR recitals related to pseudonymization:

GDPR recital 28

This recital recommends using pseudonymization to reduce any risks threatening data subjects as well as
help data processors and data controllers meet data protection duties and achieve compliance. However,
pseudonymization is not a substitute for other data security and protection measures.

GDPR recital 29

This recital offers incentives to encourage data controllers to apply pseudonymization. Additionally, GDPR
recital 75 refers to unauthorized reversal of pseudonymization as a risk and violation of an individual’s
freedoms.

GDPR recital 78

This recital recognizes pseudonymization of PII as a means of demonstrating GDPR compliance. This is
similar to demonstrating compliance by adhering to data protection standards and other codes of conduct
that may impact assessments.

GDPR recital 85

This recital refers to unauthorized reversals of pseudonymization as a personal data breach that can trigger
a notification duty, which reaches the controller. Additionally, GDPR recital 156 states that pseudonymiza-
tion of data is a safeguard that controllers can use to determine whether it is feasible to process any further
personal data. This is done for archiving purposes, as well as for historical, statistical or scientific research
purposes, when the identification of data subjects is not permitted.

5 Pseudonymization Techniques

Here are several technical methods you can use to pseudonymize sensitive data.

Data Scrambling

This technique involves mixing and obfuscating letters. For example, the name Jonathan, can be scrambled
into ‘Tojnahna’.

https://satoricyber.com | contact@satoricyber.com 44
Guide: Data Masking

Data Masking

Data masking involves hiding important or unique parts of the information through the use of random
characters or other data. Data masking can help identify data without having to manipulate actual identi-
ties. For example, the credit card number “5600-0000-0000-0003” can be stored as “XXXX-XXXX-XXXX-0003”.

Data Encryption

Encryption involves rendering original data into an unintelligible form. Ideally, this process cannot be
reversed without using the correct decryption key. The GDPR requires keeping additional information,
including the decryption key, separately from pseudonymized data.

Learn more in our detailed guide to data encryption

Data Tokenization

Tokenization processes replace sensitive information with a random token value, which is used to access
the original information. Tokens have no connection to the original information and can be used on a
one-time basis to increase their level of security. Tokens also enable organizations to minimize their access
to sensitive information and any related liability.

Learn more in our detailed guide to data tokenization

Data Blurring

This technique involves using an approximation of values to obscure the original meaning of the data. It
can also make it impossible to identify these individuals. For example, a blurred face in an image.

https://satoricyber.com | contact@satoricyber.com 45
Guide: Data Masking

4 Pseudonymization Policies

Pseudonymization policies are different approaches to substituting real data with other data. Each policy
may have implications on the ease of implementation and the rigorousness of data protection.

Deterministic Pseudonymizatio

This policy requires replacing the original information with an identical substitution across all databases
and whenever it appears. This ensures the substitution is consistent within the database and between
multiple databases. When implementing this policy, you need to first extract the list of unique identifiers
from the database. Next, map the list to the substitutions. Finally substitute the original information in the
database.

Randomized Pseudonymization

This policy replaces any occurrences of the original information within the database with fully-randomized
substitutions. The policy can serve as an extension of document-randomized pseudonymization – although
the two policies behave similarly when applied on one document, if fully-randomized pseudonymization is
applied multiple times to the same document, it will produce different outputs.

Document-randomized pseudonymization, on the other hand, results in the same output being repro-
duced. This means that document-randomized pseudonymization applies selective randomness, while
fully-randomized pseudonymization applies randomness globally to any record.

Document-Randomized Pseudonymization

This policy replaces the original information with a different value every time it appears in the database.
However, the original information is always mapped to the same set of substitutions in the dataset.

https://satoricyber.com | contact@satoricyber.com 46
Guide: Data Masking

In this case, the substitution is consistent only between different databases. The mapping table is created
using all identifiers stored in the database, and each occurrence of an identifier is treated independently.

Establishing Your Pseudonymization Techniques and Policies

Here are several different parameters that can help you choose a pseudonymisation technique and policy:

The data protection level—RNG, encryption, and message authentication codes are generally consid-

ered stronger techniques. However, pseudonymization can offer additional protection. Fully-randomized

pseudonymisation policies offer the highest protection level. However, they prevent comparisons

between databases.

The utility of the pseudonymized dataset—utility requirements might entail using a combination of

several approaches and variations of a chosen approach. Document-randomized and deterministic

functions offer utility. However, they enable records to be linked.

Pseudonymisation with Satori

Satori provides several ways to protect sensitive data, such as dynamic masking, de-tokenization and
decryption. This is done universally, regardless of the data platforms you may be using. This decoupled
approach allows you to streamline access to sensitive data without delays. To learn more, contact us.

https://satoricyber.com | contact@satoricyber.com 47
Guide: Data Masking

Data Anonymization: Use Cases


and 6 Common Techniques
What is Data Anonymization?

Data anonymization is a method of information sanitization, which involves removing or encrypting


personally identifiable data in a dataset. The goal is to ensure the privacy of the subject’s information.
Data anonymization minimizes the risk of information leaks when data is moving across boundaries. It
also maintains the structure of the data, enabling analytics post-anonymization.

The European Union’s General Data Protection Regulation (GDPR) demands the pseudonymization or
anonymization of stored information of individuals living in the EU. Anonymized data sets are not
classified as personal data, and so are not subject to the rules of GDPR. This permits organizations to
use the information for broader purposes while remaining compliant and protecting the rights of the
data subjects.

Data anonymization is also a core component of HIPAA requirements. HIPAA is a US regulation govern-
ing the use of Private Health Information (PHI) in the healthcare industry and its partners.

This is part of our series of articles about data masking.

In this article:

Data Anonymization Use Cases

What Data Should Be Anonymized?

6 Data Anonymization Techniques

Data Masking

Pseudonymization

Generalization

Data Swapping

Data Perturbation

Synthetic Data

The information provided in this article and elsewhere on this website is meant purely for educational
discussion and contains only general information about legal, commercial and other matters. It is not legal
advice and should not be treated as such. Information on this website may not constitute the most
up-to-date legal or other information.

The information in this article is provided “as is” without any representations or warranties, express or
implied. We make no representations or warranties in relation to the information in this article and all
liability with respect to actions taken or not taken based on the contents of this article are hereby expressly
disclaimed.

https://satoricyber.com | contact@satoricyber.com 48
Guide: Data Masking

You must not rely on the information in this article as an alternative to legal advice from your attorney or
other professional legal services provider. If you have any specific questions about any legal matter you
should consult your attorney or other professional legal services provider.

This article may contain links to other third-party websites. Such links are only for the convenience of the
reader, user or browser; we do not recommend or endorse the contents of any third-party sites.

Data Anonymization Use Cases

Typical cases of data anonymization include:

Medical research - researchers and healthcare professionals examining data related to the prevalence of

a disease among a certain population would use data anonymization. This way they protect the patient’s

privacy and adhere to HIPAA standards.

Marketing enhancements - online retailers often seek to improve when and how they reach their

customers, via digital advertisement, social media, emails, and their website. Digital agencies use

insights gained from consumer information to meet the increasing need for personalized user experi-

ence and to refine their services. Anonymization allows these marketers to leverage data in marketing

while remaining compliant.

Software and product development - developers need to use real data to develop tools that can deal

with real-life challenges, perform testing, and improve the effectiveness of existing software. This

information should be anonymized because development environments are not as secure as production

environments, and if they are breached, sensitive personal data is not compromised.
Business performance - large organizations often collect employee-related information to increase
productivity, optimize performance, and enhance employee safety. By using data anonymization and
aggregation, such organizations can access valuable information without causing employees to feel
monitored, exploited, or judged.

What Data Should Be Anonymized?

Not all datasets need to undergo anonymization. Every database administrator should identify which
datasets need to be made anonymous and which data can safely remain in their original form.

Choosing which datasets to anonymize may seem straightforward. However, “sensitive data” is a subjective
idea that changes according to the individual and the sector. For example, contact information could be
seen as impersonal to a marketing agency’s manager, however, it may be viewed as highly sensitive by
security personnel.

Most compliance standards and organizational policies agree that Personally Identifiable Information (PII)
should be treated as sensitive data and stored safely. Thus, such information is a perfect candidate for
anonymization. This still leaves some room for interpretation, because PII might mean different things in
different industries, and there is also debate around the legal definition of PII in different territories.

https://satoricyber.com | contact@satoricyber.com 49
Guide: Data Masking

There is a broad consensus that certain data is deemed as PIIs—irrespective of legal or industry influence.
This includes:
Name - no matter what context this arises, the name is the most significant key identifier in a data set. A

data set reduces a data source’s list of variables. If this information is obtained by the cybercriminal they

can readily trace the source of a data set—even encoded data sets. Thus, names must be anonymized

Credit card details - this field deals with credit card numbers, other details like expiration date and CVV,

and credit card tokens. They are regarded as highly personal, are unique to the individual, and can have

financial implications for the individual if compromised. They must always be protected.

Mobile numbers - if a cybercriminal gains access to a mobile number they could also gain access to

additional, more sensitive data about the individual. Thus, personal phone numbers should always be

anonymized.

Photograph - photographs are the perfect means of identification. Often, photographs are collected to

verify identity and to ensure security. A dataset containing photos of individuals must be safeguarded,

and thus it is a strong candidate for anonymization.

Passwords - a cybercriminal could easily impersonate someone and gain access to private data by
compromising their password. In any backend structure created to store passwords, you should encrypt

and/or anonymize the data.

Security questions - such data sets are also key identifiers. Many software services and web applications

use these questions as a step towards granting user access. Given this, it is important to encrypt them.

6 Data Anonymization Techniques


The following are common techniques you can use to anonymize sensitive data.

Data Masking

Data masking involves allowing access to a modified version of sensitive data. This can be achieved by
modifying data in real time, as it is accessed (dynamic data masking) or creating a mirror version of the
database with anonymized data (static data masking). Anonymization can be performed via a range of
techniques, including encryption, term or character shuffling, or dictionary substitution.

Pseudonymization

Pseudonymisation is a method of data de-identification. It replaces private identifiers with pseudonyms or


false identifiers, for example, the name “David Bloomberg” might be switched with “John Smith”. This
ensures data confidentiality and statistical precision.

Related content: Read our guide to pseudonymisation (coming soon)

https://satoricyber.com | contact@satoricyber.com 50
Guide: Data Masking

Generalization

Generalization requires excluding certain data to make it less identifiable. Data could be changed into a
range of values with logical boundaries. For instance, the house number at a specific address could be
omitted, or replaced with a range within 200 house numbers of the original value. The idea is to remove
certain identifiers without compromising the data’s accuracy.

Related content: Read our guide to data generalization

Data Swapping

Data swapping, also called shuffling or data permutation, rearranges dataset attribute values so that they
don’t match the initial information. Switching columns (attributes) that feature recognizable values,
including date of birth, can greatly influence anonymization.

Data Perturbation

Data perturbation changes the initial dataset slightly by using rounding methods and random noise. The
values used must be proportional to the disturbance employed. It is important to carefully select the base
used to modify the original values - if the base is too small, the data will not be sufficiently anonymized, and
if it’s too large, the data may not be recognizable or usable.

Synthetic Data

Synthetic data is algorithmically produced data with no connection to any real case. The data is used to
create artificial datasets rather than utilizing or modifying the original dataset and compromising
protection and privacy.

This data method uses mathematical systems based on patterns or features in the original dataset. Linear
regressions, standard deviations, medians, and other statistical methods may be employed to create
synthetic outcomes.

Automated Data Classification with Satori

Satori is the first DataSecOps platform that does automated data classification and sensitive data discovery
by continuously examining your data access. This is done without adding any database objects, and helps
discover new sensitive data immediately, instead of on a scheduled scan. To read more, visit our product page.

https://satoricyber.com | contact@satoricyber.com 51
Guide: Data Masking

Dynamic Data Masking: 8


Benefits and 3 Examples of Usage
What Is Dynamic Data Masking?

Dynamic data masking (DDM) is a technique for protecting sensitive data from exposure to unauthorized
users. Data masking can help simplify application design and secure coding by making data unreadable to
anyone without the proper privileges.

Dynamic data masking lets you specify the extent of sensitive data revealed to unauthorized users,
minimizing the impact on the application layer. You can configure DDM using dedicated database fields that
hide data in query result sets. DDM does not alter the information in the database but only masks it to limit
who can see it.

You can easily apply dynamic data masking to an existing application, applying masking rules in query results.
Many applications support data masking without requiring modifications to existing queries.

In this article:
8 Benefits of Dynamic Data Masking

Static Data Masking (SDM) vs Dynamic Data Masking (DDM)

3 Databases That Provide Dynamic Data Masking

SQL Server DDM

Snowflake DDM

Redshift DDM

Dynamic Data Masking with Satori

8 Benefits of Dynamic Data Masking

A central data masking policy deals directly with the sensitive fields in a database, assigning roles to users
without access privileges to the data. Dynamic data masking helps organizations mitigate the risks
associated with a data breach, providing the following benefits:

1. Minimizes the exposure of sensitive data by masking it to non-privileged users, including database

administrators and developers who require access to the database.

2. Enables organizations to specify how much data they want to reveal or restrict access, allowing them to
turn this information into the server quickly.

3. Supports database configuration to conceal sensitive data in query results without changing the data in
the database.
4. Applies masking rules in query result sets, making DDM easy to implement in existing applications.

https://satoricyber.com | contact@satoricyber.com 52
Guide: Data Masking

5. Provides both partial and full masking options and allows you to protect numeric data with random

masking functions.

6. Helps enforce data privacy standards to maintain regulatory compliance.


7. Provides transparency to applications, with masking applied based on user privileges.
8. Provides agility, masking data on the fly while keeping underlying data intact in the database.

Related content: Read our guide to data anonymization (coming soon)

Static Data Masking (SDM) vs Dynamic Data Masking (DDM)

Static data masking (SDM) lets you create sanitized copies of your existing database, removing any
identifying information. The copy contains altered data that you safely share without risking a privacy
breach. You create an entirely new database, devoid of any unnecessary data, and move it to a separate
location where you can mask the data and share it with a specific audience.

However, SDM can make it harder to maintain a single source of truth for your data. Having multiple,
different copies of your original data can result in data silos and confusion.

On the other hand, dynamic data masking allows you to stream data directly from the original location to
different systems (i.e., a development or testing environment). You don’t need to store data in a secondary
environment, as you would with SDM.

DDM thus allows operations and data engineering teams to maintain control over sensitive data, including
a single clear source of truth. DDM also offers scalability because you don’t need to copy or move data
when the number of users of data sets increases.

3 Databases That Provide Dynamic Data Masking

DDM functionality is offered by many modern databases and data warehouses. Here are some examples of
popular databases that provide DDM functionality.

SQL Server DDM

This tool lets you mask sensitive data pooled together when running queries. With masking rules
implemented within the query results, you can configure DDM on specific fields in the database to conceal
data. You can specify specific roles and users with access to masked data.

SQL Server 2016 or later and Azure SQL Database support DDM. You can apply DDM to existing applications
without modifying the data you mask. Most DDM processes allow for random, partial, and full data
masking. They also support simple Transact-SQL commands that you can use to specify masking jobs and
manage your DDM policy.

You can use four types of data masking on SQL Server:

https://satoricyber.com | contact@satoricyber.com 53
Guide: Data Masking

Default masking - allows you to create rules to mask the entire value when a user with a read-only

privilege queries the data. To implement this function, you must define the function name and empty

parentheses. This option masks specific columns based on their data type and does not support

arguments.

Email masking - allows you to create rules to mask the domain type and the exposed first letter of email

addresses. This function masks all domains not ending in .com with .com domains.

Random masking - allows you to create rules to replace every numerical value with random values

based on specified ranges. This function lets you trick unauthorized users into believing they can view

the correct information.

Partial masking - allows you to mask data partially. This function only works for string-type data

columns. To implement it, you have to define the value in the “masked with” clause of the function

parameter as “partial(start characters, mask, end characters.”

Learn more in our detailed guide to SQL Server Data Masking

Snowflake DDM

Snowflake offers DDM as a security feature allowing you to define a policy that selectively masks plain-text
data in views or tables. You can use this feature, for example, to mask sensitive data and restrict access to
columns containing sensitive data on a need-to-know basis.

Snowflake DDM is a powerful and easy-to-use feature that lets you manage data governance by
automatically masking sensitive data objects.

You can use data classifications to implement DDM and define your masking policies. You can choose to
assign roles to predefined roles or directly to specific users, providing greater flexibility.

Redshift DDM

While Redshift does not offer a native DDM function, you can apply dynamic masking to the data you
process in Redshift. You can create an abstract layer providing the logic needed to serve different content
based on which user requests to pull the data.

For example, you can use this view to implement the necessary logic:

CREATE VIEW v_customers AS


SELECT CASE WHEN CURRENT_USER='admin' THEN first_name ELSE
sha2(first_name, 256) END AS first_name,
CASE WHENCURRENT_USER='admin' THEN last_name ELSE sha2(first_name, 256)
END AS last_name,
country_code,
CASE WHEN CURRENT_USER='admin' THEN email ELSE REGEXP_REPLACE (email,
'[^@]+@', '*@') END AS email
FROM public.customers;

https://satoricyber.com | contact@satoricyber.com 54
Guide: Data Masking

This process involves checking whether the user in various query fields is an admin—if yes, Redshift serves a
plain-text value; if not, it serves a redacted version.

You can use this mechanism to add functionalities easily, including serving full emails to specific users
while providing domain names only to some users and empty strings to any other user. You implement
more sophisticated masking functionality using groups.

Learn more in our detailed guide to Redshift data masking

Dynamic Data Masking with Satori

Satori offers dynamic data masking for all data platforms, regardless of their native support of data masking
policies. Dynamic data masking policies can be applied without the need of data engineering resources,
and can also be applied without specific location configuration, based on Satori’s continuous sensitive data
discovery capabilities. Read more about Satori’s dynamic masking capabilities here, or schedule a demo to
learn more.

https://satoricyber.com | contact@satoricyber.com 55

You might also like