You are on page 1of 141

DEPARTMENT OF INFORMATICS

TECHNISCHE UNIVERSITÄT MÜNCHEN

Master’s Thesis in Informatics: Information Systems

Privacy-Preserving Natural Language


Processing: A Systematic Mapping Study

Felix Viktor Jedrzejewski


DEPARTMENT OF INFORMATICS
TECHNISCHE UNIVERSITÄT MÜNCHEN

Master’s Thesis in Informatics: Information Systems

Privacy-Preserving Natural Language


Processing: A Systematic Mapping Study

Privatsphäre erhaltende Verarbeitung


natürlicher Sprache: Eine systematische
Mapping-Studie

Author: Felix Viktor Jedrzejewski


Supervisor: Prof. Dr. Florian Matthes
Advisor: Oleksandra Klymenko
Submission Date: 15.07.2021
I confirm that this master’s thesis in informatics: information systems is my own work and I
have documented all sources and material used.

Munich, 15.07.2021 Felix Viktor Jedrzejewski


Acknowledgments

I’d like to thank my family and friends for their support. I also want to thank my advisor
Oleksandra for her guidance and patience with me.
Abstract
The introduction of the General Data Protection Regulation (GDPR) shifted the focus of
interest towards the research field of privacy preservation techniques. This paper investigates
challenges and solutions within the research field of Privacy-Preserving Natural Language
Processing (NLP), an intersection between Privacy and Natural Language Processing (NLP)
and privacy. Since this research field is beginning, we want to know which challenges and
solutions it provides to deal with privacy-related challenges and if there is a possibility to
categorize them. Furthermore, our goal is to overview this research field to support other
researchers coordinating within the area. To achieve this goal, we apply a systematic mapping
study focusing on the challenges and the solutions provided by the research field of Privacy-
Preserving Natural Language Processing. We noticed two significant categories that interpret
NLP either as a privacy enabler or as a privacy threat. To deepen our understanding of the
privacy-related issues within the field, we also extracted use case categories and NLP-related
features to map them on the privacy-related challenges and solutions. Finally, we offer an
overview of privacy-related challenges in which use case categories they appear and how
they are solved.

iv
Kurzfassung
Die Einführung der General Data Protection Regulation (GDPR) hat das Interesse auf das
Forschungsfeld der Techniken zur Erhaltung der Privatsphäre geweckt. Diese Arbeit unter-
sucht Herausforderungen und Lösungen innerhalb des Forschungsfelds der privatsphäre-
erhaltenden natürlichen Sprachverarbeitung, einer Kombination aus Datenschutz und natürli-
cher Sprachverarbeitung und Privatsphäre. Da dieses Forschungsfeld sehr jung ist, wollen
wir wissen, welche Herausforderungen und Lösungen es für den Umgang mit privatsphäre-
erhaltenden Herausforderungen bietet und ob es eine Möglichkeit gibt, diese zu kategorisieren.
Darüber hinaus ist es unser Ziel, einen Überblick über dieses Forschungsfeld zu geben, um
andere Forscher zu unterstützen, die sich auf diesem Gebiet orientieren wollen. Um dieses
Ziel zu erreichen, wenden wir eine systematische Mapping Studie an, die sich auf die Heraus-
forderungen und die Lösungen konzentriert, die das Forschungsfeld privatsphäre-erhaltenden
natürlichen Sprachverarbeitung bietet. Dabei sind uns zwei Kategorien aufgefallen, die NLP
entweder als Unterstützer der Privatsphäre oder als Bedrohung der Privatsphäre interpre-
tieren. Um unser Verständnis der datenschutzrelevanten Themen innerhalb des Feldes zu
vertiefen, haben wir außerdem Anwendungsfallkategorien und NLP-bezogene Merkmale ex-
trahiert, um sie auf die privatsphäre-relevanten Herausforderungen und Lösungen abzubilden.
Abschließend bieten wir einen Überblick über privatsphäre-relevanten Herausforderungen,
in welchen Use-Case-Kategorien sie auftreten und wie sie gelöst werden.

v
Contents
Acknowledgments iii

Abstract iv

Kurzfassung v

1. Introduction 1
1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3. Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1. Systematic Collection of Potential Sources . . . . . . . . . . . . . . . . . 4
1.3.2. Inclusion and Exclusion Process . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3. Analysis of Included Sources . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. Foundations 7
2.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1. Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2. Privacy Enhancing Technologies (PETs) . . . . . . . . . . . . . . . . . . . 8
2.1.3. Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4. Privacy-Preserving Natural Language Processing . . . . . . . . . . . . . 10

3. Related Work 12
3.1. Systematic Mapping Studies Addressing Privacy . . . . . . . . . . . . . . . . . 12
3.1.1. Security and Privacy Concerns in Connected Cars . . . . . . . . . . . . . 12
3.1.2. Privacy-Related Challenges in mHealth and uHealth Systems . . . . . . 13
3.2. Applying Natural Language Processing in the Legal Domain . . . . . . . . . . 13
3.2.1. Natural Language Processing and Automated Handling of Privacy
Rules and Regulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3. Surveys on Applying Privacy Preservation on Machine Learning and Deep
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1. Privacy Preservation and Machine Learning . . . . . . . . . . . . . . . . 15
3.3.2. Privacy Preservation and Deep Learning . . . . . . . . . . . . . . . . . . 15
3.4. Workshops on Privacy in Natural Language Processing . . . . . . . . . . . . . . 19
3.4.1. Papers from the ”PrivateNLP 2020@WSDM 2020” Workshop . . . . . . 19
3.4.2. Papers from the ”PrivateNLP@EMNLP 2020” workshop . . . . . . . . . 20

vi
Contents

4. Methodology 22
4.1. Definition of Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2. Search Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1. Search Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2. Selection of Electronic Data Sources . . . . . . . . . . . . . . . . . . . . . 24
4.2.3. Result Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.4. Validation of the Search Process . . . . . . . . . . . . . . . . . . . . . . . 27
4.3. Paper Screening for Inclusion or Exclusion . . . . . . . . . . . . . . . . . . . . . 28
4.4. Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5. Keywording of Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.1. Subcategories Natural Language Processing as Privacy Enabler . . . . . 34
4.5.2. Subcategories Natural Language Processing as Privacy Threat . . . . . 40

5. Results 48
5.1. Numbers of Publications per Year . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2. Numbers of Publications per Electronic Data Source . . . . . . . . . . . . . . . 48
5.3. What privacy-related challenges exist in the area of Natural Language Process-
ing (NLP)? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.1. Challenges for NLP as a Privacy Enabler . . . . . . . . . . . . . . . . . . 50
5.3.2. Challenges for NLP as a Privacy Threat . . . . . . . . . . . . . . . . . . . 56
5.4. What approaches are used to preserve privacy in and with NLP tasks and how
can they be classified? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4.1. Privacy Solutions Provided by NLP . . . . . . . . . . . . . . . . . . . . . 58
5.4.2. Solutions Provided for NLP Privacy Issues . . . . . . . . . . . . . . . . . 61
5.5. What are the current research gaps and possible future research directions in
the area of privacy-preserving NLP? . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5.1. Research Gaps for NLP as a privacy Enabler . . . . . . . . . . . . . . . . 63
5.5.2. Research Gaps for NLP as a privacy Threat . . . . . . . . . . . . . . . . . 64

6. Discussion 83
6.1. Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1.1. Main Privacy related Challenges for NLP as a Privacy Enabler . . . . . 83
6.1.2. Main Privacy related Challenges for NLP as a Privacy Threat . . . . . . 84
6.1.3. Solutions supported by NLP as a Privacy Enabler . . . . . . . . . . . . . 85
6.1.4. Main Solutions for NLP as a Privacy Threat . . . . . . . . . . . . . . . . 86
6.2. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7. Conclusion and Future Work 89


7.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

vii
Contents

A. General Addenda 91
A.1. Work Sheets containing the analysis of the top categories . . . . . . . . . . . . . 91
A.1.1. NLP as a Privacy Enabler Analysis Sheet . . . . . . . . . . . . . . . . . . 91
A.1.2. NLP as a Privacy Threat Analysis Sheet . . . . . . . . . . . . . . . . . . . 98
A.2. Detailed Category Tables with Aggregation Mapping . . . . . . . . . . . . . . . 102
A.2.1. Privacy Issue Solutions for NLP as a Privacy Enabler . . . . . . . . . . . 102
A.2.2. Privacy Issue Solutions for NLP as a Privacy Threat . . . . . . . . . . . 106
A.3. Figures for Mapping Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.3.1. Use Case Classification for NLP as a Privacy Treat . . . . . . . . . . . . 108

List of Figures 110

List of Tables 112

Acronyms 113

Bibliography 114

viii
1. Introduction
The following chapter will elaborate the motivation of this research in section 1.2, the research
objectives in section 1.2 and afterwards, in section 1.3, we illustrate how we intend to achieve
these goals.

1.1. Introduction
One of the most crucial events in terms of privacy was the introduction of General Data Pro-
tection Regulation (GDPR) in 2016. The European Union adopted this law that is recognised
across Europe. The member states had 2 years to adopt the law on their national level until
May 2018. Now, an individual has legal protection for the data that were collected based
on different factors (e.g behavior online, input data, etc.). This also means that there are
crucial legally binding consequences for anyone who is not acting accordingly [1]. One main
reason for this law is to protect the individual from the threats posed by Big Data and the
insatiable hunger of companies collecting data in any possible manner [2]. Therefore, the
GDPR enables the European Data Protection Authorities (DPA) to punish companies for a
violation of this right with a payment of up to 4 percent of the annual income [3]. There
are already several examples of EU member countries that prosecuted companies for GDPR
violations. For instance, France suing Google for insufficient legal basis for data processing
resulting in a fine of 50 million euros. The same scenario occurred to H&M in Germany and
TIM(telecommunications operator) in Italy with fines of around 35 million and 27 million
euros, respectively [4].
However, this was not the only reason for the increased interest for privacy, but rather the
result of multiple events in the past that drew the attention of the public towards it. One of
the aforementioned events is the Cambridge Analytica Scandal which only got disclosed to
the public through whistleblower Christopher Wylie. He described how his former employer
harvested data of Facebook users without their consent. With this data, Cambridge Analytica
was able to analyze the behavior of a large amount of Facebook users and target those who
were likely to change their mind and also convince others. This ”service” was purchased by
famous politicians to win voters for their side [5]. This is just one of many examples that got
leaked, however, there are many more cases of data breaches and leaks.
Living in the time of Big Data, companies are collecting more and more data from everything
possible we can get in order to analyze. This fact is reflected in the amount of publications in
the area of big data which were made between 2000 and 2019, as can be seen in Figure 1.1 [6].
The collection of data happens automatically, therefore, it is imperative to also apply an
automated approach to preserve the data privacy in order to protect not just the individual

1
1. Introduction

Figure 1.1.: Trend of the number of publications in data science and big data taken from [6]

but also companies intending to guarantee privacy of every involved individual and avoid
tremendous penalty fines. However, data privacy has the main disadvantage of usability
reduction which is a key issue and is being intensively researched. We decided to focus on the
medium that incorporates and transmits private information, namely the natural language.
The natural language occurs either in the form of voice or in written form.
Filtering out private information is a difficult tasks especially if the text is unstructured.
There are several easily detectable pieces of sensitive information, namely email and local
addresses, names, phone numbers etc., which embody some sort of uniqueness that can be
solved with algorithms (e.g. a pattern matching). However, there are also some information
types which require more sophisticated approaches in order to be detected. If the context
or the user perception of the private information varies it is complex to devise an optimal
solution because of the lack of flexibility of current methods [7]. An attempt is needed that is
robust to varying complexity levels to find sensitive information and also capable to handling
the amount of data. We consider Natural Language Processing (NLP) domain as an adequate
fit for the aforementioned challenge.
Furthermore, we must point out that there are several vulnerabilities within the NLP
domain. One paper demonstrates the ability to extract information about the training data
used to train neural networks. This trait is called unintentional memorability. Additionally,
a method is presented that can measure the unintended memorability of different neural
networks. Thus, models save information from the training set. In our case, this is especially
critical if the training set contains private information in the form of natural language [8].
Additional privacy leaks were also detected within language models. Successful attacks
already have been constructed to demonstrate the exploitability of popular state-of-the-art
language models, namely BERT, Transformer-XL, XLNet, GPT, GPT-2, RoBERTa, XLM, Ernie

2
1. Introduction

2.0. Word embeddings are reverse-engineerable without any prior knowledge and attackers
are able to retrieve up to 75 percent of the sensitive data they have been trained on [9].
Another research attempt could fully reconstruct 411 out of 600 patient data records that were
anonymized before they were used to train the published word embeddings [10].
During our mapping study, we realized that there are two challenges that can be solved with
NLP. On the one hand, we know that large unstructured text corpora can be automatically
redacted by the application of NLP to extract or replace private information. There is
a plethora of papers aiming at the aforementioned challenge, thus, we concluded that a
structured overview for this domain will be essential to not just show what kind of research
has already been done but also which gaps exist. On the other hand, it emerges the need
to preserve privacy within the tasks broadly applied in the NLP domains. Furthermore, we
discovered that plenty of papers address the problems caused by privacy policies. All the
mentioned aspects encouraged us to aggregate it in one domain. This leads us to the domain
of Privacy-Preserving Natural Language Processing (PP NLP).

1.2. Research Objectives


This thesis aims to provide a definition for PP NLP, since we realized that there will be
multiple ways of interpretation of the term during our preliminary analysis. Additionally, we
think it will helpful in the future of this research field to have a clear definition in order to
avoid repetitive research. We will provide some additional information that are the answer to
our research questions listed in Table 1.1.

ID Research Question
What privacy-related challenges exist in the area of Natural Language
RQ1
Processing (NLP)?
What approaches are used to preserve privacy in and with NLP tasks
RQ2
and how can they be classified?
What are the current research gaps and possible future research direc-
RQ3
tions in the area of privacy-preserving NLP?

Table 1.1.: Table of Research Questions

In our overview of the research field of PP NLP, we include the privacy related challenges
which exist in the context of NLP. Simultaneously, this is the answer of RQ1. This piece of
information will support the community to see which challenges were already identified
and also ease the process of finding new ones. In order to answer RQ1, we will scan all
the papers we found in the search process for the problem statements that are trying to
solve and summarize it. Then, we propose an aggregation scheme for the findings with the
help of keywording [11]. This will ease the understanding of the research field since the
aggregation scheme will map the papers to high level terms. Further, we define the term
Privacy Preserving Natural Language Processing to clarify our understanding of the research
field which we analyze.

3
1. Introduction

After extracting and having an overview of the different privacy challenges within the NLP
domain, we aim to figure out which approaches currently exist in order to preserve the privacy
in NLP tasks and also how NLP concepts can be utilized to establish privacy. We expect to
derive patterns from these findings indicating popular privacy enhancing technologies PETs
to preserve privacy within which NLP concepts and which NLP concepts were used in order
to introduce an increased level of privacy in a certain scenario. Here, we will categorize our
findings and bring it in context with the findings of RQ1. Furthermore, our research will
point out which trends are currently emerging.
After answering RQ1 and RQ2, we will provide an overview of the privacy challenges
and their solution approaches. The answer to RQ3 will be the investigation of possible gaps
within the current research state based on the aforementioned findings and imply which
research streams are up-rising or even suggest new research streams. Additionally, other
researchers are supported in discovering further research gaps.
Additionally, we would like to devise an overview of the research field of PP NLP. With
this overview, we would like to graphically summarize the research field to ease the task
to discover new research opportunities. Therefore, a categorization of different research
streams is necessary to introduce in order to structure the overview better and improve the
readability of the graphical representation. Furthermore, we will highlight resulting trends in
the research area by showing how many papers contributed to the field.

1.3. Research Approach


Since we want to receive a broad overview over the research field of PP NLP and which
challenges and solutions are prevalent, the application of a systematic mapping study (SMS)
is suggested [12]. Therefore, we will describe the process roughly and in more detail in
chapter 4. We chose the systematic mapping study, since it supports the idea of devising a
classification scheme and structure of a software engineering field. According to Petersen, the
mapping study roughly consists of three phases, namely planning, conducting, and reporting
the mapping. The detailed process is displayed in Figure 1.2 and the process adaptations
will be further explained in chapter 4 [11]. We will utilize the systematic approach of the
SMS to conduct the collection process of papers addressing our definition in a reproducible
way. Mainly, the aim of SMS is to categorize different research streams and the analysis of
publication related information in a software engineering field of interest [13]. Instead of
focusing on the data regarding the paper publication, we inspect the challenges and solutions
in the area of PP NLP. This will be the main inspiration for the research approach of this
thesis. The number of relevant papers in the different phases of the SMS, which are explained
in the next sections, are displayed in Figure 1.3.

1.3.1. Systematic Collection of Potential Sources


After the selection of the data bases which we involved in our search process defined in table
4.3, we needed to decide which search queries have to be executed. As a result, we selected

4
1. Introduction

Figure 1.2.: Overview of the Systematic Mapping Study (SMS) taken from [11]

Figure 1.3.: Amount of papers during the different SMS phases

five search queries listed in Table 4.1. Every data base was confronted with each query which
resulted in 4424 papers. In the next step, the collected papers need to be further examined, as
suggested by Petersen [11].

1.3.2. Inclusion and Exclusion Process


After the collection of the relevant papers, we thought of inclusion and exclusion criteria
which supported us to scan through the papers and sort those out that do not fit into our
definition of PP NLP. Here, we applied the methodology of Petersen and inspected the title
and the abstract of the collected papers and included or excluded accordingly to the criteria
we agreed on [11]. At the end of this phase, 665 papers were left that addressed our topic of
interest. Now, we analyze the remaining papers and start the categorization process.

1.3.3. Analysis of Included Sources


In the last phase of the mapping study, we sorted our findings into a classification scheme
regarding the privacy related challenges and solutions according to Petersen [11]. For this
purpose, we applied the keywording of abstracts displayed in Figure 1.4. During this process

5
1. Introduction

we discovered two main streams within the selected papers, namely NLP as enabler for
privacy preserving methods and also fixed privacy related issues within NLP concepts. The
two main streams ended up to different amount of papers. NLP as enabler for privacy counts
304 papers and the second stream fixed privacy issues within NLP concepts 127 papers.

Figure 1.4.: Keywording abstracts to build classification schemes taken from [11]

6
2. Foundations
In this chapter we will define our area of research and the terms applied in this work
to have the same understanding. Also, we want to propose a novel definition regarding
Privacy-Preserving Natural Language Processing.

2.1. Definitions
This section contains the definitions of the key terms that are essential in the context of
this thesis. First, the key terms privacy, its preservation techniques and Natural Language
Processing are delineated. As part of our contribution, we also propose a definition for
Privacy-Preserving Natural Language Processing.

2.1.1. Privacy
The definition of privacy is always a complicated endeavor, especially in the legal domain
because it is a term that not just differs from country to country, but from individual to
individual [14]. According to Westin, there are three levels of privacy, namely the political,
socio-cultural and personal [15]. In this thesis, we will focus on the personal level. Every
individual has a set of information that is clearly attributable to it, however, this set is dynamic
and changes from context to context and from time to time [14]. In this thesis, we call this set
of information personal data. According to the General Data Protection Regulation (GDPR),
personal data, or Personal Identifiable Information (PII), is defined as any information about
an identified or identifiable natural person, or data subject. Specifically, an identifiable natural
person, is a person that can be identified directly or indirectly by direct identifiers like a
name, an identification number or one or more factors referring to its physical, physiological,
mental, economic, cultural, or social identity [16]. The legal definitions of Personal Identifiable
Information (PII) differ from country to country. In this thesis we use the expansionism of
the European Union towards the term PII instead of the very restricted definition of PII in the
United States of America [17]. The European expanded definition of PII is, however, limited to
the realistic possibility of linkability of the information to a data subject [16]. It is also proven
that the concept of PII is not sufficient to protect the privacy of an individual, especially when
the sophistication of tools capable of inferring further information with little to no previous
knowledge is constantly improving [7]. Therefore, we need an improved categorization of
information related to an identifiable natural person. Any piece of information that might
lead to the revealing of the privacy of an individual based on its perception is also called
Privacy Sensitive Information (PSI). The combination of PSI and PII forms the super set

7
2. Foundations

Privacy Revealing Information (PRI). Most importantly, the avoidance of the mapping of PII
to PSI needs to be guaranteed at any time [7].

2.1.2. Privacy Enhancing Technologies (PETs)


Same as Y. Shen and Pearson, our understanding of privacy enhancing technologies are
applied in order to protect private information of a data subject [18]. This part will describe
our understanding of privacy preserving techniques that will frequently appear in this thesis.
Therefore, we need to define our understanding of a few terms in order to have the same
theoretical foundation.

De-identification of Personal Information

The process of De-Identification focuses on the removal of identifying information within a


data set. The data removed in this context is called Personal Information instead of Personal
Identifiable Information (PII). According to Garfinkel et al., Personal Identifiable Information
refers to information that is only attributable to one individual, however, information that
can attribute any individual are left out, although they are also important. This refers to
any kind of data type including structured and unstructured text or multimedia. After the
removal of that information, the risk is reduced for an individual to be linked to the data
set. The goal is to have an acceptable trade-off between privacy and usability, in other words,
sharing a data set without disclosing any information that might disclose the identity of an
individual. Furthermore, the terms ”de-identification” and ”anonymization” are commonly
used interchangeably, but in general, they aim for the same aforementioned process [19].

Differential Privacy

Differential Privacy guarantees a certain level of privacy to an input of an individual to a


(randomized) function or a series of functions. It was designed in order to protect the privacy
of an individual or a small group of individuals in a statistical analysis. A tremendous
benefit of Differential Privacy is that even if an adversary has a lot computational power and
knowledge over the differential private data set it is impossible to re-identify the individuals
[20].

Obfuscation

The paper by Pang, X. Xiao, and J. Shen, illustrates our understanding of obfuscation in a
sense where the actual data within a text corpus like a search query is masked by for example
injecting fake queries in order to blur the intentions of the searching user or in other words
not to fully disclose the data to a third party [21].

8
2. Foundations

Synthetic Data Generation

According to El Emam, the generation of synthetic data is based on statistical models that are
reapplied in order to generate a new data set fitting into the statistical traits from the original
model. This avoids the sharing of the actual data but still allows the recipient to conduct
analysis on the synthetic data set [22].

Federated Learning

The general idea of Federated Learning is that the training of a model is not executed on a
central server but is distributed among different parties, also called clients. Every client has a
data set containing sensitive information that is not uploaded to a central server. Instead, the
client trains the global model with his locally stored data set and updates the parameters of
the model. The orchestration of the training model among the different clients is managed by
one central server [23].

Secure Multiparty Computation

As the term implies, the calculation process of an agreed function f involves several parties .
All parties are in possession of a secret xi that they do not send to any other of the parties
participating in the computation process but is used as input for the function f . Two attributes
are essential for Secure Multiparty Computation, namely the correctness of the computation
of f and the privacy of the input xi [24].

Homomorphic Encryption

According to Fontaine and Galand, Homomorphic Encryption needs to satisfy the following
equation [25]:
∀m1 , m2 ∈ M, E(m1 M m2 ) ←− E(m1 ) C E(m2 )
Let M and C represent all sets of plaintext and ciphertext respectively. M and C denote
an operation that is validly executable on the respective plaintext or ciphertext. ←− means
that the same operation is computable like on the left side of the arrow. In other words,
Homomorphic Encryption is the process of performing operations on encrypted data with
decrypting it at any stage [25].

2.1.3. Natural Language Processing


Our understanding of NLP is equal to the definition by Liddy [26]. Through the application of
computational techniques, naturally occurring text is analyzed on multiple levels of linguistic
analysis in order to have a similar level of human-like language processing to preform a
specific set of task on it [26].
This definition is further explained in the source since a few terms a rather vaguely
formulated. By ”naturally occurring text” any text is meant that is spoken or written by a
human. This language is then ”analyzed on multiple levels of linguistic analysis” to translate

9
2. Foundations

the human language in a way that is also understandable for the computer to process it
[26]. This is an essential aspect for this work because we inspect not just written but also
spoken text. It is commonly known that written text is containing private information about
us like our email or our local address or our name. All of those information lead to a unique
mapping that points to an individual. But this also counts for uniqueness of our voice which
is also used for biometric authentication systems.
One level of the aforementioned level of linguistic analysis is called phonology or speech
processing. This area inspects the voice sounds by applying a set of rules in order to not just
distinguish words in a sentence but also take into account that pronunciation differs from one
to another. There are rules called phonetic rules concentrating on the sounds within a word.
Then, there are phonemic rules when words are spoken together and their pronunciation
differs. Prosodic rules aim for the fluctuation in stress and intonation in a sentence [26]. But
there are also novel approaches for the generation of voices that is based on vectors and
probabilistic models [27].
For the written text, there are multiple linguistic models that need to be considered. The
first and very fundamental level is morphology. The main subject of this topic are morphemes
the smallest particle or word chunks with meaning in linguistics [26]. Part of Speech Tagging
(POS) is the process of assigning words in a sentence their role within it for example noun,
verb, adverb etc. The tagging is based on the morphological analysis of words [28]. The word,
the composition of morphemes and lemmas, itself and its meaning is inspected on the lexical
level. If we take the verb ”deliver” as a lemma and combine it with some morphemes we
receive ”delivers” or ”delivered” etc. [29].
According to Liddy, another level of analysis within the NLP domain is the semantic one.
The main goal of it is to investigate the contribution of meaning of sentences in order to
detect and solve the problem of disambiguity and larger text corpora taking into account
the information delivered by context of a given corpus [26]. Frequently used in the domain
of NLP is a task from information extraction called Named Entity Recognition (NER). NER
is a way to introduce structure into an unstructured text document searching for a certain
meaning on the semantic level. It is not just used in order to identify people, organizations
and institutes, but to search for certain information within a text corpus [30].

2.1.4. Privacy-Preserving Natural Language Processing


This definition of privacy preserving natural language processing is inspired by the accepted
papers from the workshop series ”PrivateNLP” [31, 32, 33]. During the mapping study, we
discovered two major topics that are handled by PP NLP. The first area is the application of
NLP in order to establish privacy in a certain scenario. Commonly, the scenario focuses on
some forms of sharing or publishing unstructured data sets with third parties in different
contexts or supports the process of coping with privacy policies or requirements. But this
can not be done before a privacy preservation technique based on NLP is executed on the
data set in order to protect the privacy of the data subjects. The challenge with privacy
policies or requirements is that the language, in which they are written, is complex and not
comprehensible at first sight. In context of agreeing on web site cookies, privacy related

10
2. Foundations

documents frequently challenge the user with its length. Because of that, NLP is applied in
order to ease the process of handling those kinds of documents in different ways, namely
summarization or rephrasing of privacy policies or an automatic comparison of a privacy
policy and privacy requirements. The second topic we recognized, is the privacy protection
of the data involved within a NLP process. As already mentioned in the introduction, the
exploitation of fundamental NLP concepts was demonstrated. Therefore, it is important to
extend this definition by including also the statement of Q. Feng, D. He, Z. Liu, et al. claiming
that NLP should not enclosing any information regarding the data it is trained on [34].

11
3. Related Work
In the following, we list research that is closely related to our work to stress the value of our
contribution to the field. Since, to the best of our knowledge, there is no concrete definition
for Privacy-Preserving Natural Language Processing, there are several papers interfacing
the topic. In this chapter, we point out Systematic Mapping Studies and surveys that focus
on privacy preservation techniques related to machine learning or deep learning. However,
there are some papers about the combination of privacy preservation and specific natural
language techniques. In the first part, we will have a look at Systematic Mapping Studies that
are topicwise related to our research. Afterwards, we discuss the application of NLP in the
legal domain. Then, we inspect surveys based on privacy preservation techniques and deep
learning. At the end, we inspect a workshop series addressing the topics NLP and privacy.

3.1. Systematic Mapping Studies Addressing Privacy


To the best of our knowledge, there are no systematic mapping studies addressing privacy
preservation and Natural Language Processing, but only surveys on PP ML or PP AI not NLP.
We selected two Systematic Mapping Studies analysing privacy challenges and its solutions
within a context that includes natural language.

3.1.1. Security and Privacy Concerns in Connected Cars


We chose this Systematic Mapping Study inspecting privacy and security challenges within
the domain of connected cars [35]. All research questions of this work are comparable
to ours. The first attempts to identify what kind of privacy and security-related issues
were identified by the researchers in connected cars. Here, the problems identified towards
identity management refer to unallowed access to the network and, simultaneously, to the
personal data processed and stored within it. The vehicle provider does not provide any
additional privacy measures within the network to secure the personal data in a malicious
network intruder or an insecure central server. According to the paper, the solution to this
threat is the implementation of privacy-preserving schemes like k-anonymity, encryption,
bit-array encoding, or pseudonym-changing scheme [35]. Our work will focus on the privacy
preservation of data in natural language in written or spoken form. Moreover, we analyze the
role of NLP in providing privacy exclusively.

12
3. Related Work

3.1.2. Privacy-Related Challenges in mHealth and uHealth Systems


There was another Systematic Mapping Study that caught our attention that aimed for a
subtopic of our research. Therefore, we thought it is noteworthy in the context of this work.
The research of Iwaya, Ahmad, and Babar focuses on the security and privacy in systems
considering the NIST 800-53 control families that handle natural language. More particularly,
on the privacy and security of mobile Health (mHealt) and ubiquitous Health (uHealth)
systems. The mapping studies filter out 365 papers that match their field of research. The
main targets of the analysis phase were the identification of research themes, main challenges,
and their most prominent solutions. Then, those findings were categorized and synthesized.
An overview of the different solutions organized by themes stresses the illustration of the
limitations of the current research status. Also, an evaluation phase is conducted to check the
practicability of the existing solution approaches [36]. Our approach differs mainly on the
broader scope that we decided to have. Also, our interest is on NLP based approaches for
private data represented in natural language.

3.2. Applying Natural Language Processing in the Legal Domain


This section will disclose several research overview papers addressing the application of NLP
in the context of law and privacy. We realized that NLP offers support in handling complex
legal documents explaining the privacy rights or complying with it.

3.2.1. Natural Language Processing and Automated Handling of Privacy Rules


and Regulations
There is considerable research conducted on the combination of the legal domain and NLP.
Papanikolaou, Pearson, and Mont survey natural language understanding and the automated
enforcement of privacy rules and regulations in the cloud [37]. Automatic extraction of
information from legal, regulatory, or policy text in the form of privacy knowledge or
rules and combination with compliance and enforcement. The paper’s main objective is to
guarantee privacy level for customer data stored and processed in a cloud environment of an
enterprise. To achieve this goal, the application of semantic analysis on text with NLP concepts
and enabling automatic rule enforcement is imperative. Four categories were introduced
within the paper: parsing and fundamental analysis of source texts, knowledge extraction
from texts and learning, semantic models and representations, and policy enforcement and
compliance. The analysis and parsing of a source text embody similar and repetitive patterns
in frequencies of different word clusters. Therefore, several tools exploit this trait of legal
text. Knowledge extraction from text and learning applies several machine learning concepts
to analyze the semantic representation of legal text on a higher level and extract rules.
Semantic models and illustrations support the idea of setting up rules with a specific order
on the semantic level. Thus, this section refers to a sophisticated framework of rules for the
generation of legal items that are easier to comprehend. The last section within the paper,
policy enforcement, and compliance highlights that there is no research so far that focuses

13
3. Related Work

on the entire life cycle of natural language analysis in the context of privacy. In other words,
research that not just focus on the generation of clear and easily comprehendible rules, but
also its enforcement [37]. Our work is more abstract and does not focus on the legal domain
specifically. We will graphically represent the status quo of the research in the legal field with
the focus on privacy preservation with the involvement of NLP, not explicitly searching for a
complete privacy life cycle. Not just the creation of privacy life cycles was investigated. Also,
tools were developed to detect privacy violations in legal documents like online contracts. The
paper by Silva, Gonçalves, Godinho, et al. applies commonly known tools like NLTK, spaCy,
and Stanford CoreNLP to develop a new tool that supports companies to detect privacy
violations within legal contracts. The fundamental NLP task executed here is Named Entity
Recognition (NER). It automatically detects potential personal identifiable information (PII)
in legal documents. The tool is then tested in an experiment to evaluate its accuracy [38].
This is a very particular example for the application of NLP paradigm of semantics analysis
to preserve the privacy of data subjects within a legal document like a contract. A systematic
mapping study has been rendered on NLP for requirements engineering [39]. The objective
of the SMS is to inspect the status quo of the research of NLP for requirements engineering,
short NLP4RE, and to detect the open challenges with the domain. The result of the research
is the extraction of five significant findings. First, a lot of publications in the area indicated
that the research area is thriving. The research methods applied within the publications in
the area using either laboratory experiments or demonstrate an example application. Most
of the publications focus on the analysis phase requirement specification and detection as
central linguistic analysis tasks. Furthermore, an overview of the applied NLP tools and
concepts within the NLP4RE domain was presented. The paper concludes that there is a
gap between the state-of-the-art theory and practice. The developed view in papers and
the lack of application of the concepts in the industry [39]. Our work focuses more on the
privacy-related requirements and other application areas of NLP and also gives an overview
of fixed privacy issues within the NLP domain.

3.3. Surveys on Applying Privacy Preservation on Machine


Learning and Deep Learning
This thesis unites multiple research domains. Therefore, we considered a huge variety of
papers from different disciplines applying similar methodologies. Because of that, we also
took papers surveying on topics related to our research topic. For example, machine learning
and statistical methods are frequently applied in the domain of NLP (e.g., in context of
ambiguity of words) [40]. So, we also saw a relation between surveying privacy preservation
techniques in the area of machine learning and deep learning and our work which are
fundamental for the recent successes in the NLP research field.

14
3. Related Work

3.3.1. Privacy Preservation and Machine Learning


This part of the thesis delineates related surveys on privacy-preserving machine learning in
which way this research is similar to ours but also how our research contributes differently to
the field. There are two surveys on the combination of privacy preservation techniques and
differential privacy.
One paper by Z. Ji, Lipton, and Elkan surveys the impact of differential privacy on the
training process of machine learning algorithms [41]. The application of differential privacy
supports the idea of not disclosing private user data to the algorithm and still providing
valuable information by sharing or publishing differential private data sets. Moreover, the
paper analyzed what could be learned from a differential private data set and also which
limits in terms of the loss function differential private algorithms contain [41]. This paper
focuses on the possibilities of applying differential privacy in combination with data release
and training potentials. In contrast, we are eager to acquire an overview of which privacy
preservation challenges and solutions exist specifically in the NLP domain.
Gong, Xie, K. Pan, et al. survey different differential private approaches in combination with
machine learning [42]. The paper identified two different categories. First, the combination of
the non-private machine learning model with the calibrated noise and the second category
perturbs the objective function with random noise. Within the paper, the challenges regarding
the model privacy level, usability, and its applications are pointed out and evaluated [42].
Again, the privacy preservation techniques and their impact on the relying machine learning
model are inspected, our interest targets the application scenarios of the different techniques.

3.3.2. Privacy Preservation and Deep Learning


Here, we discuss how the research landscape of deep learning in combination with privacy
preservation is shaped in the form of surveys and also how our work differs from it. After
inspecting a survey about collaborative deep learning, we will elaborate on applying deep
learning techniques to preserve privacy.
Several surveys deliver an overview of a privacy preservation technique comparable to
our work. For example, the paper by D. Zhang, X. Chen, D. Wang, and Shi elaborates
potential problems that might come up in a collaborative privacy-preserving deep learning
environment that enables the participants of a model to contribute their data without actually
sharing it [43]. Thus, the data is not leaving the data owner. However, either the model
is updated by every participant, or the user is provided with an API that takes the user’s
input and generates an output sent to the central server. Although this process sounds
secure, it still covers some challenges that are also addressed by potential privacy preservation
techniques. Three major problem scenarios were presented in the paper. The problem
assumption is based on the idea that the participating unit is malicious, namely either the
server or the user or both. To mitigate the aforementioned problems, the paper presents
three common techniques for privacy preservation, namely secure multiparty computation
(SMC), Homomorphic Encryption, and Differential Privacy. The paper elaborated that the
privacy preservation techniques address two phases, namely training, and usage. In the

15
3. Related Work

training phase, it is essential that the user’s data is not visible to any of the participating
parties. A homomorphic encryption technique is applied to execute the training. Still, some
communication between the participating parties is needed, which is solved by SMC. Thus,
curious but honest participants are not able to spy on the sent data. To avoid statistical attacks
when a malicious server and user are collaborating, the idea of differential privacy on the
user level is applied to defend against it. Then there is also indirect collaboration in the
training phase, which changes the challenge for privacy preservation. The data needs to
remain at the user’s storage. Here, it is essential for the parameters that are sent around
the participating parties that there can not be any statistical attacks on them. There is one
solution that averages over multiple outputs from users. Another solution applies additionally
differential privacy that is incorporated in algorithms of the training stage, and there is also
an approach that distributes the processes across the participating parties. Changes were
made on the server and client-side to form a pipeline. In the usage phase, the paper by
D. Zhang, X. Chen, D. Wang, and Shi highlights the CryptoNet system that works with
homomorphic encryption [43]. The user encrypts its data and sends it to the server. The
server performs its algorithms and calculations on encrypted data and sends the results back
to the user that decrypts the received result. This requires a few adaptations in the training
stage of the neural network. The other solution avoids changes to the neural network training
stage and applies additive homomorphic encryption. This solution represents an oblivious
network that works with secrete sharing and garbled circuits in the usage phase. The final
solution approach for the usage phase incorporates Private Aggregation of Teacher Ensembles
(PATE) and introduces random noise into the strategy [43]. We figured that the applied
privacy-preserving technologies are similar to the ones we discovered in the mapping study
for fixing NLP issues within the tasks. Still, we will concentrate on the domain of NLP which
also incorporates deep learning techniques. In our work, we attempt to give an overview of
privacy preservation covered by NLP.
Another survey investigates deep learning techniques and compares their benefits and
drawbacks [44]. In Figure 3.1 the graphical representation of the deducted classifications is
represented. The methods are divided into two categories: classical privacy preservation
and hybrid privacy-preserving deep learning, meaning that privacy preservation does not
solemnly rely on one classical privacy preservation technique but at least two. As classical
privacy preservation techniques, the paper lists homomorphic encryption (HE), differential
privacy, Secrete sharing, oblivious transfer (OT), and secure multiparty computation (MPC).
All of those methods do not apply any deep learning. The Hybrid Privacy-Preserving Deep
Learning (PPDL) takes advantage of a mixture of classical privacy-preserving techniques and
deep learning. According to Tanuwidjaja, Choi, and K. Kim, every category is represented by
one or multiple papers to implemented a privacy preservation technique that will be evaluated
by the metrics listed in Figure 3.2. The largest sub-category listed here are papers based on HE,
namely HE + Convolutional Neural Network (CNN), HE + Deep Neural Network (DNN), HE
+ Discretized Neural Network (DiNN), HE + Neural Network (NN), and HE + Binary Neural
Network (BNN). Then there are a few more categories like three-party Computation (3PC)
+ Garbled Circuit (GC) + Neural Network (NN), Oblivious Transfer (OT) + Secrete Sharing

16
3. Related Work

+ Secure MPC + DNN and OT + Secure MPC + CNN. This followed by a paper mentioned
by Tanuwidjaja, Choi, and K. Kim applying Differential Privacy and Generative Adversarial
Network (GAN). The aforementioned metrics listed in Figure 3.2 measure the accuracy of the
underlying algorithm after the privacy preservation technique was installed. The run time
refers to the time the privacy preserved method needs to execute its particular task. Data
Transfer inspects the amount of data sent from the client to the server. Privacy of Client (PoC)
stands for the fact that no other party is able to see the data except the data owner. Privacy of
model (PoM) means that neither the client nor any participant is aware of the model classifier
in the server. The papers that are based on HE are evaluated and compared with each other
with those metrics. This also happens for the papers based on differential privacy and SMC.
The results of those comparisons are irrelevant for our work [44]. This work demonstrated
a possibility to categorize and test different privacy-preserving techniques. This thesis will
zoom out and give an overview of NLP and not compare different papers with each other
based on metrics since we want to shape a landscape of privacy-preserving Natural Language
Processing and point out the different challenges and solutions within the domain.

Figure 3.1.: Overview of privacy preservation methods in deep learning taken from [44]

The following paper discusses Privacy-Preserving Deep Learning (PPDL) in the research
sector Machine Learning as a Service (MLaaS) and provides an overview [45]. At first, the
paper points out the most classical approaches of PPDL and explains it. Then, fundamental
conflicts of deep learning and adversarial models in privacy preservation are delineated.
Finally, the paper offers a classification of the current PPDL techniques and a timeline of
publications in a survey to follow the development better. Additionally, challenges and
weaknesses of the state-of-the-art PPDL are displayed in Figure 3.3. The Figure displays two
approaches of PPDL, namely the model parameter transmission and the data transmission

17
3. Related Work

Figure 3.2.: Overview of privacy preservation metrics in deep learning taken from [44]

approach. Further, the approaches are then divided into the applied PPDL methods applied
in the respective approach. For the model parameter transmission, it is federated learning
and distributed machine learning having the weaknesses communication overhead, backdoor,
and GAN attacks. The data transmission approach is solved by using either anonymiza-
tion or homomorphic encryption or differential private-based PPDL. The main weaknesses
for anonymization are homogeneity and background knowledge attacks. Homomorphic
encryption-based PPDL is complex and comes with a slow training process. The problem of
differential privacy-based PPDL is the central character having a central coordinator. Thus, a
problem of a single point of failure [45]. Just as this thesis, privacy preservation techniques are
inspected to shape an overview of the current approaches in a particular field of research. The
difference to our work is not just the focus on NLP but also the more abstract interpretation
of privacy-preserving NLP and the aggregation of challenges and their solutions within our
definition of PP NLP. In contrast, this paper has a strong focus on the techniques themselves
and their evolution, their challenges, and weaknesses.

Figure 3.3.: Overview of privacy preservation challenges and weaknesses taken from [45]

18
3. Related Work

Virtual assistants like Siri are commonly used by humans worldwide and in different
languages, disclosing private information. One study focused its effort to analyze the
literature prevalent in the topic of voice assistants and their impact on privacy and security.
Those systems are frequently cloud-based [46].

3.4. Workshops on Privacy in Natural Language Processing


In 2020 the first workshop was hosted, called ”PrivateNLP 2020@WSDM 2020”. In this
workshop, several papers were presented addressing privacy-related challenges in NLP. The
previous papers will be briefly mentioned in this section [33]. Furthermore, the second
workshop of this series was held. The title of it was ”PrivateNLP@EMNLP 2020”. The papers
accepted to this workshop are also part of the section [31]. Additionally, there will be a third
workshop hosted named ”PrivateNLP 2021” [32].

3.4.1. Papers from the ”PrivateNLP 2020@WSDM 2020” Workshop


In this part of the thesis, we will elaborate on all the papers accepted by the ”PrivateNLP
2020” workshop. Since this workshop series is the first one discussing privacy and NLP, it is
a relevant aspect for this thesis.
One topic mentioned within the workshop refers to an approach that introduces gaps
instead of privacy-sensitive information within email texts inferred by an algorithm using a
vector space trained on the email incoming box of the respective receiver. This method shall
mitigate the disclosure of sensitive information caused by unintentionally send emails outside
of the corporate environment [47]. Another user-centric approach presents a framework that
works in the form of a plugin that warns the user about disclosing potentially sensitive data
in his written text on, e.g., social media by analyzing its sentiment, authorship, and other
linguistic tools [48]. The following paper introduces the combination of Recurrent Neural
Networks (RNN) and homomorphic encryption to enable a user to use Machine Learning
as a Service (MLaaS) without sending the plain text data. Still, an encrypted version [49].
At first sight, this approach seems to be susceptible to extended execution times. However,
Podschwadt and Takabi prove that this is not the case with the same model working on plain
text data. The comparison has been made with the IMDb movie review dataset [49]. The last
paper of this workshop surveys privacy preservation techniques based on machine learning
[50]. The result of the survey showed that there are two major streams represented in the
literature. One is privacy-preserving machine learning, and the other one is mechanisms to
control the data of users. The former category contains approaches that are related to artificial
intelligence (AI). The main ideas of this research sector are the models and the training data.
Both pillars are addressed by the literature, meaning that sensitive data sets are protected
through applying the concept of differential privacy, and a distributed approach reinvents the
concept of training a model without disclosing the sensitive data sets in the form of federated
learning or Private Aggregation of Teacher Ensemble (PATE). Further, the mechanism to
control users’ data incorporates the general idea of notifying or giving the user a choice.

19
3. Related Work

In literature, this is primarily done by serving the extracted information from documents
explaining the handling of the user’s data. However, it is imperative to understand the
general idea, is the applied language mostly either complex or ambiguous [50].

3.4.2. Papers from the ”PrivateNLP@EMNLP 2020” workshop


The second workshop of the ”PrivateNLP” series accepted six papers from different subjects
enumerated here. One challenge in distributed or federated learning is that the potential of
eavesdropping is high, but introducing a classic encryption scheme into the protocol might
cause a significant increase in task performance. A proposed solution to this challenge is to
introduce a small encryption step performed by the participants of the respective learning
concepts. The encryption step is based on a ”one-time secrete key”. This encryption is then
part of the training of pre-trained BERT. Both aspects are wrapped up in a framework called
TextHide [51]. Another challenge within the privacy research field is privacy policies that
every user should read before entering a website or downloading an app to identify third
parties that collect data from the user. Privacy Policies are known to be long and complex;
therefore, a team of researchers tackled this issue with a model that automatically extracts
mentioned third parties out of them [52].
Further, the workshop accepted two papers that focus on detecting sensitive data or
privacy-related settings with the assistance of semantic analysis [53, 54]. One example is
the application of semantic analysis for content generated by users to avoid informational
or emotional self-disclosure, which is frequently occurring in social media platforms [53].
Another approach is to utilize semantic analysis to locate settings relevant to the user’s
privacy setting. This supports the user to decide on its own privacy settings instead of
sticking with the default settings of the developer [54].
Differential privacy is another major topic at the workshop and was also discussed by two
papers [55, 56]. An advantage of differential privacy, especially in machine learning, is that
users’ privacy is ready to share sensitive information as training data can be protected. Still,
the performance of language models that are trained on differential private data diminishes
the model quality. Now, a few researchers discovered that differential private data is suitable
for fine-tuning public base models [55]. Furthermore, there is also the possibility to apply
differential privacy in privacy-preserving text analysis. Here, the concept of word embeddings
is utilized. Specific words of the customer data are mapped into a continuous word embedding
space which is then perturbed by applying a differential private algorithm. But Z. Xu,
Aggarwal, Feyisetan, and Teissier discovered that some words in a sparse area of the word
embedding space remained unchanged although a large scale of perturbation was selected.
Therefore, a few researchers did design a new variant based on the Mahalanobis metric to
solve the problem as mentioned earlier [56].
All in all, there are a lot of approaches addressing privacy and its preservation techniques.
On the one hand, we have several papers summarizing certain types of privacy preservation
techniques from different angles and discussing their challenges and solutions and also
a few papers addressing specific privacy-related situations solved by or with NLP, but a
summarization of privacy challenges and their solutions in the form of categories in the

20
3. Related Work

context of NLP still needs to be done. On the other hand, there are also a few papers that
apply NLP to ease the handling of privacy-related text documents like privacy policies, rules,
or privacy requirements. Until now, there are a plethora of privacy topics applying NLP that
needs to be summarized to receive an overview over the research field of Privacy-Preserving
Natural Language Processing.

21
4. Methodology
We decided to conduct a systematic mapping study instead of a systematic literature review
(SLR) in that we seek to acquire a broader scope of the research field of Privacy-Preserving
Natural Language Processing [57]. Therefore, the following chapter will be structured
according to the guidelines of Petersen to increase the comparability of the conducted
systematic mapping study. This chapter will consist of a research question definition section
in which we discuss the goals we are aiming for. Then, we will delineate the search process,
the strategy behind it, and which results we received. Afterward, we screen through the
papers to include or exclude them according to our criteria. Ultimately, we present our
analysis and mapping results [11].

4.1. Definition of Research Questions


The classical approach of a SMS is to provide an overview of a research area and also illustrate
current results and types of research [11]. We shifted our focus towards the challenges and
their solutions in the domain of Privacy-Preserving Natural Language Processing. More
specifically, we will illustrate how NLP supports us to protect our privacy and how NLP
tasks can be secured from malicious users to exploit sensitive information they were trained
on. These thoughts resulted in the research questions listed in table 1.1 which were explained
in section 1.2. Usually, research questions answered by a SMS are accompanied by a character
of high level and, therefore, they are amended during the process because a lot of findings
cannot be foreseen before executing the study itself [12].

4.2. Search Process


This section elaborates the search strategy we applied to collect all papers that address our
interests. Three significant decisions need to be made regarding the search process: the
impact of completeness, the validation of the search process, and the appropriate mix of
search methods. Since we conduct a mapping study on a relatively new topic, completeness
is rather challenging to achieve and also not always a significant hallmark of a SMS. We
decided to take the papers mentioned in a series of workshops addressing NLP and privacy
to validate the completeness of our search process [12]. First, we explain our thoughts on
why we chose our search queries and the electronic data sources. Then, we present the initial
results of the search queries applied to the respective electronic data source and validate the
search process.

22
4. Methodology

4.2.1. Search Queries


We planned the queries that we would insert into the search bar of every electronic data
source. All of our selected electronic data sources support the usage of boolean operator like
in our case ”AND”, and ”OR”. The usage of ”AND” postulates that the results include all
terms written on the right and left side of the operator. For the ”OR” operator this slightly
differs because just one side needs to appear in the result list [58, 59, 60, 61, 62, 63, 64, 65].
The result of our thoughts are listed in table 4.1. We were aware of the fact that PP NLP
is a very specific research field. Because of that we investigated how many papers address
our topic directly in the very beginning of the mapping study. As shown in Figure 4.1, we
divided our main term ”Privacy Preserving Natural Language Processing” into two parts and
searched for synonyms or related terms that might be of our interest and are a foundation
for the generation of our search queries. Q1 represents the equivalent of the research field
we analyze consisting of two terms ”Privacy Preserving” the operator AND, and ”Natural
Language Processing”. Q2 also includes potential synonyms for ”Privacy Preserving”, namely
”Privacy Enhancing” and ”Privacy Guaranteeing”. Q3 aims for a very specific subtopic of our
research field that investigates the interface between privacy and word embeddings. Q4 and
Q5 address a wider scope of the research field. Q4 considers any kind of combinations of
”Privacy” or ”Private” and ”Natural Language Processing” or ”NLP”. The last search query is a
construct of Q2 and the sub-fields of NLP handling either speech or text inspired by the work
of Pandey [66].

Figure 4.1.: Search Term Segmentation

In the next section, we will elaborate on the selection of electronic data sources, we decided

23
4. Methodology

ID Query
Q1 ”Privacy Preserving” AND ”Natural Language Processing”
Q2 (”Privacy Enhancing” OR ”Privacy Guaranteeing”) AND ”Natural Lan-
guage Processing”
Q3 (”Privacy” OR ”Private” OR ”Privacy preserving”) AND ”word embed-
dings”
Q4 (Private OR Privacy) AND (NLP OR ”Natural Language Processing”)
Q5 (”Privacy Enhancing” OR ”Privacy Guaranteeing” OR ”Privacy preserv-
ing”) AND (”Speech Recognition” OR ”Speech Segmentation” OR ”Text-
to-Speech” OR ”Automatic Summarization” OR ”Machine Translation” OR
”Natural Language Generation” OR ”Natural Language Understanding”
OR ”Question Answering”)

Table 4.1.: Table of all search queries applied in the search process

to consider within this thesis.

4.2.2. Selection of Electronic Data Sources


We agreed to select required sources from a selection of electronic data sources that provide a
broad coverage, listed in Table 4.2 [67].

Electronic Data Source Selected


IEEE Xplore Yes
ACM Digital Library Yes
EI Compendex & Inspec No
ISI Web of Science Yes
CiteSeer No
Google Scholar Yes
ScienceDirect Yes
SpringerLink Yes
Wiley InterScience Yes
SCOPUS Yes
Kluwer Online No

Table 4.2.: Table of all electronic data sources elected by [67] and their selection status in this
mapping study

Out of the eleven electronic data sources, we selected eight. There were several reasons
why we did not proceed with some of the sources. EI Compendex & Inspec provides the only
access to its database if the sales team is contacted [68]. Since we also have other sources to
our disposition, we decided to exclude EI Compendex & Inspec. CiteSeer and Kluwer Online
delivered empty result lists for a basic query (privacy or private) AND (NLP OR ”Natural

24
4. Methodology

language processing”). Therefore, we implied that further consultation would be obsolete and
stop proceeding with it.
In Table 4.3, we listed the selected electronic data sources and the corresponding Uniform
Resource Locator (URL). The remaining eight electronic data sources were included in our
mapping study, and all search queries from Table 4.1 were executed. We will present the
results of each source in the next part according to the order of Table 4.3.

Electronic Data Source URL


IEEE Xplore https://ieeexplore.ieee.org/Xplore/home.jsp
ACM Digital Library https://dl.acm.org/
Google Scholar https://scholar.google.de/
SpringerLink https://link.springer.com/search
ISI Web of Science https://apps.webofknowledge.com/
Wiley InterScience https://onlinelibrary.wiley.com/action/doSearch?AllField=
ScienceDirect https://www.sciencedirect.com/search
SCOPUS https://www.scopus.com/search/form.uri?display=basic#basic

Table 4.3.: Table of all electronic data sources selected for the mapping study

Exporting Query Results

To acquire a good overview of our findings, we utilized Google Sheets for our mapping study.
Since it supports a variety of analytical tools, it is similar to Excel and allows us to cooperate
[69]. Based on the selected tool, we needed to export the search query results in the form
of a comma-separated value (CSV) or any similar format to insert it into our Google Sheets
worksheet. As we experienced it, the only electronic data sources that did not support this
format of exporting were ACM Digital Library, ScienceDirect, and Wiley InterScience. Instead,
the exporting of search result lists in the form of BibTex files was possible. Because of that,
we used JabRef, a free reference manager that allows us to import the exported BibTex files
containing the search query results and export them in the form of a CSV file [70]. In the next
part, we will discuss the overview of results we achieved after executing the search queries
and inserting the acceptable format into our worksheet.

Query Results Overview

Here, we illustrate the results according to the consulted electronic data source. For this
purpose, we generated Figure 4.2. In total, we received 6024 instances from all of the
consulted electronic data sources. Most of the results were delivered by SCOPUS(1083),
followed by Google Scholar(998) and ScienceDirect(956). Also, SpringerLink(881) and ACM
Digital Library(822) contributed a significant amount of instances. The least amount of
instances were contributed by Wiley InterScience(499), IEEE Xplorer(398), and ISI Web
of Science(387). Technical issues occurred during the execution of Q4 for SpringerLink.
Therefore, it was not possible to export the resulting list based. We contacted the support

25
4. Methodology

team of the corresponding electronic data source. However, the issue was not resolved in
time to include it in our mapping study. Next, we will explain how we filtered the initial
results we acquired so far.

Figure 4.2.: Search Query Results

4.2.3. Result Filtering


As we had a look at the initial results, we realized that a lot of duplicates appear within it.
Therefore we decided to apply several functions from Google Sheets, namely query and sort.
COUNTUNIQUE allows us the count all those entries in a column that are unique [71]. Since
all electronic data sources, we included had their own sheet to avoid any form of format
issues or overwrite and keep an overview of which source delivered most of the results.
The query function helped us to merge all sheets together without the risk of overwriting
entries from other electronic data sources automatically. The function consists of two inputs,
the first input specifies which data sheets are meant, and the second one is a query that
works precisely like a database SQL request. In our case, the query was ”select * where Col1
is not Null, or Col2 is not Null”. ”Col1” and ”Col2” represented the paper title and the
authors, respectively. If one of those was empty for one instance, then it wasn’t included
in the validation process because both columns are essential and indicate the completeness
of a source [72]. The sortn function delivers n entries of a list after performing a specific
sort task. This function is particularly vital for us because it helps us to eliminate duplicates
that exist within the result list output by the query function with the Tie Mode 2 [73]. Since
we wanted to have all entries of the resulting list output by the sortn function, we needed

26
4. Methodology

to insert ”99̂” since we did not know how many entries to expect [74]. After the execution
of sortn, we received the following results that are observable in Figure 4.3. At first sight,
the number of delivered results dropped significantly. In total, we observed the removal of
1600 duplicates. Most of the duplicates were part of the result list of SCOPUS (529), Google
Scholar (390), and ScienceDirect(253), which is observable in the Delta column in Table 4.4
as the result of the difference of the Initial and the Filtered columns. Furthermore, we were
interested in which electronic data source has the best ratio (so far) regarding the delivery of
duplicates and all results of the respective electronic data source. According to our Table 4.4,
Springer(93.98), IEEE(89.95) and ACM(89.29) embody the best ratio. After completing the
search process, we face 4424 allegedly different papers in the paper screening phase of the
mapping study, which we will further discuss in the next section.

Figure 4.3.: Filtered Search Query Results

4.2.4. Validation of the Search Process


To validate the completeness of the search process, we apply a suggested method that requires
us to select papers addressing PP NLP that were known by us based on our preliminary search
[12]. We chose to utilize the papers mentioned in the workshop ”PrivateNLP 2020@WSDM
2020” [33] because those workshops are the foundation of the research field investigated by us.
In Table 4.5 we highlight that all papers covered by the workshop ”PrivateNLP 2020@WSDM
2020” also appeared in the results of our search process without actually including specific
terms within our search queries except for the term ”word embeddings”. Since we validated
the results of our search process, we can now execute the paper screening process in which

27
4. Methodology

Electronic Data Source Initial Filtered ∆ Ratio (%)


ACM 822 734 88 89.29
Google Scholar 998 608 390 60.92
IEEE 398 358 40 89.95
ScienceDirect 956 703 253 73.54
SCOPUS 1083 554 529 51.15
Springer 881 828 53 93.98
Web of Science 387 247 140 63.82
Wiley 499 392 107 78.56
Total 6024 4424 1600 73.44

Table 4.4.: Table displaying the proportions after the filtering of the results

we decide which paper will be included in the mapping study or excluded from it.

Paper Title Appearance


Privacy-Aware Personalized Entity Representations for Improved User Under- Yes
standing [47]
Classification of Encrypted Word Embeddings using Recurrent Neural Net- Yes
works [49]
A User-Centric and Sentiment Aware Privacy-Disclosure Detection Framework Yes
based on Multi-input Neural Network [48]
Is It Possible to Preserve Privacy in the Age of AI? [50] Yes

Table 4.5.: Table of all papers selected for the search process validation

4.3. Paper Screening for Inclusion or Exclusion


This section will explain our self-defined criteria that determine whether a paper is included
or excluded from our study. Furthermore, we will delineate the idea behind them. As a lone
researcher with supervision, it is advised to go through the search query results multiple
times in different orders and ask the supervisor to randomly select some papers to assess
the quality and give some feedback. Since we have a tremendous amount of research papers
that we consider to be a part of the research field of our interest, we just examined the paper
title in the first round of validation and rarely the abstract if the title of the paper did not
indicate its content [12]. After that, the screening is executed according to self-defined criteria
that support the supervisor or other curious researchers to reproduce results of the validation
phase [11]. In Table 4.6, the aforementioned inclusion and exclusion criteria are listed. We
included papers that covered terms related to privacy preservation techniques in combination
with NLP for natural language in the form of text or speech in their title or abstract. Then, if
neither the title nor abstract delivered clear information about the content of the paper, we

28
4. Methodology

inspected the keyword section if terms related to privacy or NLP appear. In general, we were
interested in the synthetic generation of text or speech since this topic also contributes to our
research field of interest. Ultimately, we also considered any type of involvement of NLP and
the processing of privacy regulating documents like privacy policies or privacy policies and
complying with them. We excluded every paper that neither mentioned privacy nor NLP
in its title or abstract. Deliberately, we omitted papers handling steganography because we
question its privacy sustainability. Additionally, we left out all papers that just processed
privacy regulating aided with crowdsourcing. If an instance represented a conference or
workshop summary, a content list, chapter overview, a book, or a patent, we also excluded
it because we want to analyze papers reviewed by other researchers of the conference or
workshop program committee. If there was no possibility to access the paper securely or at
all or it was not written in English, we also excluded the paper from the mapping study. In
the next part, we will present the results of the screening process.

Inclusion Criteria Exclusion Criteria


The appearance of terms closely related to pri- Lack of privacy in the title or abstract
vacy preservation and NLP in the title or the
abstract
Privacy and Natural Language Processing or Lack of Natural Language Processing or its
related terms appear in the keyword section sub-domains in the title or abstract
Generation of speech or text synthetically Conference or workshop summary, content list,
chapter overview, books or patents
Processing of any privacy regulating document Restricted access to the content
Any topics related to steganography
Insecure access to the paper
Processing of Privacy Policies with crowdsourc-
ing
Papers not written in English

Table 4.6.: Table of all criteria applied in the screening process

After the screening of the results of the search process according to our criteria and the
removal of additional duplicates, 666 papers (15.1 percent) remained within our mapping
study, as you can see in Figure 4.4. The next step is to validate the screened papers.

4.4. Validation
To remove the bias within the mapping study and also assess the quality of the search process
results, the advisor of this thesis validated the screened papers randomly because of the vast
amount of papers potentially relevant for our work [12]. In this chapter, we will explain how
we conducted the validation and which outcome we received.
Since we conducted the mapping study with the support of Google Sheets like [75], it was
possible to share a link containing the sheet with the screened paper with the advisor to

29
4. Methodology

Figure 4.4.: Proportion of remaining papers

validate them. The advisor filtered the list to inspect just the screened paper that we consider
valuable for the mapping study. We decided to communicate with three different categories
in the validation process. Generally, we used ”1” as an equivalent for the inclusion of the
paper in the mapping study or ”0” otherwise. Then, we applied the question mark (”?”) for
uncertainty. Additionally, we commented on our decision for the options ”0” and ”?” in
brackets after the corresponding sign. Table 4.7 displays an overview of the comments made
by the advisor during the validation.
As you can observe in Table 4.7, the advisor had discovered ten uncertainties out of its 179
comments. Two uncertainties were caused by the papers with the title ”A DNS Tunneling
Detection Method Based on Deep Learning Models to Prevent Data Exfiltration”[76] and ”A
Full-Text Retrieval Algorithm for Encrypted Data in Cloud Storage Applications”[77]. The
belonging of Zhang’s paper to our mapping study is indeed debatable. However, data traffic
carries private information, and the developed data exfiltration prevention is based on NLP
[76]. Song’s Team worked on an information retrieval algorithm on encrypted data avoiding
the disclosure of private information in a cloud-based environment applying NLP [77].
The following comment in the table refers to questions the relevance to NLP and text data
in Li’s review paper about applications applying federated learning [78]. After another review
of the paper, we decided to exclude the paper from the mapping study because the focus did
not highlight NLP related applications.
Meenakshi’s team published a paper reviewing security attacks and protection strategies
for machine learning which lacked Privacy and NLP focus. Therefore, we agreed to exclude

30
4. Methodology

Comment Types Amount


? (addresses security issue) 2
? (does it mention NLP? / text data) 1
? (security rather than privacy issues) 1
? (security) 1
? 1
? (not textual data as far as I understand) 1
? (what data type do they work with?) 1
? (do they address privacy issues?) 1
? (what data type?) 1
0 (duplicate) 36
0 (no privacy?) 2
0 (doesn’t address privacy issues?) 1
0 (doesn’t seem to address privacy issues) 1
0 (spatial data) 1
0 (location data?) 1
0 (location privacy) 1
0 (not text?) 1
0 (not about NLP) 1
0 (doesn’t seem to to talk about NLP/text data) 1
1 123
Total 179

Table 4.7.: Table of all comments made by the advisor during the validation

the paper from our work [79]. The same issue appeared for Liu’s survey paper focusing
on security threats and defensive techniques for machine learning having a data-driven
perspective [80]. The fifth entry of the table states a general uncertainty towards the paper
developing a flexible text corpus for specific linguistic research that did not address privacy
[81]. The following comment of uncertainty questions the data sets worked on in the paper of
Rashid’s team. The data sets MIDUS and ADULTS contain personal information also in text
form that is protected either by de-identification or differential privacy [82]. The comment
asking for the data type used within the work of Tsai’s team is not part of this thesis since it
works with transactional data that does not include text [83]. The work of Ji’s team does not
address a privacy issue directly. The privacy issue causes a scarcity of data that impedes the
development of the medical research area [84]. The last entry refers to the curiosity about the
used data type within the paper, which is answered within the abstract that the approach to
preserve privacy in linked data anonymization using queries is data-independent [85].
We continue with the comments that did not consider the selected papers to be a part of this
work. 36 comments pointed out that the selected papers are duplicates which is true, and they
were removed. This could easily happen caused by the number of papers being part of the
screening process and was not filtered out beforehand. The next comment group mentioned

31
4. Methodology

the lack of privacy issues relevant for the papers with the title ”A multi-task learning-based
approach to biomedical entity relation extraction”[86], ”Towards Combining Multitask and
Multilingual Learning”[87] and ”DUT-NLP at MEDIQA 2019:An Adversarial Multi-Task
Network to Jointly Model Recognizing Question Entailment and Question Answering” [88].
This paper was wrongly selected because of the applied shared-private model architecture,
which refers to the visibility of the parameters within the system and not to our understanding
of privacy in the sense of intentionally or unintentionally disclosing sensitive information to a
third party [86]. The same issue also occurred for the second paper [87] and the third one [88].
Roberts’ team paper is also motivated because there is a lack of data for the cancer research
field caused by privacy issues. Therefore is the aim of the paper to propose annotations and a
framework that works on the semantic level and, thus, introduce flexibility to the annotation
scheme and an approach to preserve the privacy of annotated data sets [89]. Location-oriented
privacy is describable in the form of numbers or words in the three comment types. Here
the three papers with the title ”SpatialPDP: A personalized differentially private mechanism
for range counting queries over spatial databases” [90], ”Towards privacy-driven design of
a dynamic carpooling system” [91] and ”Quantifying location privacy” [92]. The following
comment assumes that there is not text data involved in the work of Gong’s team, which
is not provable because there is no access to the entire content of the paper [93]. The last
two comment types question the focus of NLP in the papers, ”Perceived privacy” [94] and
”Distributed Latent Dirichlet allocation for objects-distributed cluster ensemble” [95]. Al-
Fedaghi presents in his paper a model that connects privacy and security based on information
exchange between two parties with measurable metrics for both domains. However, the
application of NLP misses [94]. The second paper works on privacy preservation applying
distributed Latent Dirichlet allocation, a topic modeling method, and was published at the
International Conference on Natural Language Processing and Knowledge Engineering in
2008 [95].
The last comment type remaining is the approving one. The advisor approved 123 papers.
However, 118 papers made it in the end into the mapping study. Here, we will explain all
those papers that we did not include in the mapping study. One out of the five papers was
excluded because it was based on evaluating the numerical input of employees to investigate
their stance on privacy by letting them choose from different levels of privacy in exchange for
a payment. Here, the connection to NLP is not present [96]. Hakkani-Tur’s team developed a
patent for a privacy-preserving database that cooperates with natural language. However,
we are interested in papers from conferences and workshops [97]. Another paper with the
title ”Identifying textual personal information using bidirectional LSTM networks” sounded
promising, but just the title and the abstract were written in English. The rest was in Turkish
[98]. The last two papers with the titles ”Reasoning about unstructured data de-identification”
[99] and ”Towards Efficient Privacy-Preserving Personal Information in User Daily Life,” [100]
did fit this thesis. Still, there was no access to the full content.
All in all, Table 4.8 illustrates the consensus between the paper screening phase and the
validation. Out of the 128 accepted papers, 118 were included in the mapping study resulting
in an inclusion rate of 95.6 percent. Also, just one out of the 46 excluded papers is a part

32
4. Methodology

of the mapping study, and three papers out of the ten uncertain ones take part in the next
stage of this thesis. During the validation process of the advisor, the author conducted further
analysis tasks on the screened data in which a few alterations occurred that led to a slight
diversion. Still, all the mentioned results in the table highlight that the advisor and the author
of the thesis have the same understanding of the research field of interest. In the next part of
the thesis, we enter the keywording phase.

Comment Category Amount Coverage Ratio (%)


in SMS
1 (Inclusion) 123 118 95.9
0 (Exclusion) 46 1 21.7
? (Uncertainty) 10 3 30.0

Table 4.8.: Overview of all comment categories and their coverage within the mapping study

4.5. Keywording of Abstracts


In the process of keywording, the author inspects the abstracts of the validated and evaluates
suitable categories and schemes based on the findings. If the abstract of a paper does not
deliver sufficient information, the introduction or the conclusion part of the paper can be used
for the keywording process [11]. This part will focus on the results which were extracted out
of the keywording process. One key outcome of a mapping study is to give an overview of a
specific area by classifying the articles. There are two types of classification schemes, the topic
dependent one and the independent topic one. We decided to stick with a topic-dependent
one with a focus on issues and their solutions [13]. The complete analysis tables are located
in section A.1 the appendix of this thesis.
At the very beginning of this process, we realized that we need to distinguish two top
categories for privacy-preserving NLP, namely NLP as a privacy enabler and also a privacy
threat as privacy enabler NLP introduces privacy into a data set or an activity. As a privacy
threat NLP endangers the privacy of data engaging with it. Both top categories need to
deliver an appropriate answer to the research questions from Table 1.1. With RQ1, we desire
to know which challenges exist for NLP that are privacy-related. With RQ2, we attempt to
display the solutions for the corresponding problem within the domain of Privacy-Preserving
Natural Language Processing. After answering RQ1 and RQ2, we will have an overview of
the research field and can detect gaps within the landscape of Privacy-Preserving Natural
Language Processing.
Since we utilized Google Sheets for this thesis, we inserted columns that were mainly
provided by the export function of the corresponding electronic data source, namely title,
author, abstract, and year. In addition to the basic columns within our worksheet, we inserted
the following columns to the corresponding top category to answer the research as mentioned
earlier questions displayed in Table 4.9.

33
4. Methodology

NLP as Privacy Enabler NLP as Privacy Threat


Domain (RQ1) Domain(RQ1)
Classification of Use Cases (RQ1) Use Case Classification (RQ1)
Generalized Privacy Issue (RQ1) Data Type(RQ1)
Generalized Privacy Issue Solution (RQ2) Generalized Issue/Vulnerability solved
for NLP(RQ1)
Generalized Category of Applied NLP PETs(RQ2)
Concept(RQ2)
NLP Method Type(RQ2)

Table 4.9.: Table of all sub-categories with the addressed research question

For both top categories, we will observe the domain they engage with to better classify
the different privacy challenges. We expect that the top category, which utilizes NLP as a
privacy enabler and as a privacy threat, to contain a wide variety of use cases that will help to
delineate also the potential of the inspected research field. Then, we highlight the prevalent
privacy issues that are solved by the application of NLP. Also, for the category in which NLP
is a threat, we will point out which data type is more affected by NLP, and we will also
provide an orientation for the different privacy challenges within this top category in the form
of an overview with aggregated issue schemes. To answer RQ2, we propose an aggregated
scheme to display which privacy issue solutions exist in both top categories. To solve the
privacy issues caused by NLP, privacy-enhancing techniques (PETs) are applied [101]. For
the other top category, we wanted to aggregate the different solutions in which NLP was
involved and also which NLP concept and method type were applied. The following two
subsections will specify which terms were applied within the categories according to the top
category it appears in.

4.5.1. Subcategories Natural Language Processing as Privacy Enabler


In this part, we will delineate all the terms appearing in each subcategory. The worksheet for
the corresponding top category is located in subsection A.1.1. We will start with the terms
from the domain and continue with the use case classification and the general privacy issues.
Afterward, we explain the terms included in generalized privacy issue solution part, the
category addressing the applied NLP concepts, and the NLP method type category.

Domain

The domain column in our work sheet supports our idea to classify the different validated
papers from our mapping study. This will give us the opportunity to structure the overview
better and also provide the option to the recipient of this study to pay attention to a specific
area of expertise. A domain is well defined on Merriam Webster describing it as a sphere of
knowledge [102]. Table 4.10 displays all the major domains appearing within the validated
papers. We observed a high frequency of papers that conducting research either on privacy

34
4. Methodology

policies [103, 104] or other privacy regulating documents [105, 106], or on data from the
medical area [107, 108, 109, 110]. Also, mobile applications [111, 112] and social networks
[113, 114] were addressed by multiple papers which is reasonable, since those topics are
omnipresent for the most of us in our daily routine. Then, we realized that there are several
papers either addressing a rather specific topic like privacy preserving mining of search
queries [115], big data [116] and finance [117] which we categorized as ”Other” or the papers
worked on a generic topic which we then classified as ”General” [118, 119]. Next, we will
specify the use case classification.

Domain
Law
Medicine
Mobile Application
Social Network
General
Other

Table 4.10.: Table of all terms occurring in the category Domain, a subcategory of NLP as a
privacy enabler

Classification of Use Cases

Same as the domain category, we attempt to dive deeper into the classification of the papers
and extract the different use cases applied within the papers we extracted from our search.
Other than the domain part, the use case classification will help us to grasp which scenario is
supported with NLP to guarantee privacy. As you can see in Table 4.11, we decided to apply
two layers of generalization to keep the extracted information, because of the tremendous
amount of papers within this top category. The more specific layer of abstraction is depicted
in the column ”Use Case Classes” and the more generalized version is the second column
”Generalized Category”. We tried to position a key word in front of the term in order to map
them better to aggregated schemes if it was possible.
The first more generalized category in the table is the category ”Annotations and Training
Data” aggregating all use cases aiming for the improvement of the annotation process or to
collect training data. The improvement is realized in form of the automatization or support
[118] or revision [120] of the annotation process. Also annotations or specific data sets like
privacy policies were collected to accelerate the research in a certain field [121] or a gold
standard was developed [122].
The next category in the second column is ”Automation”. Here, we extracted two directions,
namely the automation of flexible rule enforcement [123] and the automation of privacy
regulations [124]. The following two categories were specific enough and did not embody
any further specifications, namely ”Browsing in a private manner”[125, 126] and ”Definition
of privacy requirements”[127, 128].

35
4. Methodology

One major generalized category is the simplification of privacy related regulations. We


discovered several approaches in this area range from the simple comprehension of privacy
regulating documents [129, 130] to the comparison [131], compliance [132], writing [133] and
implementation process [134].
Another category is related to investigations mostly aiming for mobile applications or
related topics expose or check their privacy impact [135]. Especially, the rightfulness of access
privileges and their coverage with the app description is inspected [136].
The next two categories work on the measuring of the compliance of apps to a privacy
regulating document with a metric [117] and mining according to rules that guarantee privacy
[137].
Another category focuses on the increase of attention towards privacy policies [103] and
we observed that some papers also combine this use case with the comprehension of privacy
policies [138].
Then, there is a common category aiming for a privacy preserving way to share data with
third parties [139, 140, 141, 107]. We also did not specify this category more, because the
added value would be negligible.
Further, there is another major category within the table that specifies the protection of
sensitive information in unstructured textual data in general [142], in the clinical context
[143] or on social media platform [144]. Here, we distinguished between the protection of the
patient privacy and the privacy of medical unrelated privacy. The next generalized category
is searching in a privacy preserving manner.
An additional section, we introduced voice related services that either protect the voice in
public [145] or preserve the privacy of the user voice during a voice based assistance service
[146] predominated by the health sector [147, 148].
The last general category in the table is dedicated to the services that require cooperation
with the voice.
Since we gathered now enough categories describing the domain in which a specific
scenario takes place, we require the privacy issue appearing within a scenario in a certain
domain.

Generalized Privacy Issue

With this column we attempt to extract the privacy issue tackled within a paper in a gener-
alized manner. For this purpose, we re-applied parts from the privacy issues specified by
Boukharrou for cloud based home assistance and Internet of Things (IoT) devices, a topic
closely related to ours. Boukharrou divides privacy issues in context of speech based systems
processing user request in form of natural language in five categories. Three of them have
proven useful to this thesis, namely identification, profiling and data linkage. Identification
alludes to all information that are unique to an individual and might be sufficient to identify
a person. Profiling outlines the possibility of collecting information about a person to increase
the predictability of its actions and behavior. Data linkage refers to aspect that information
about an individual could be combined with other available data and lead to conclusions of
other facts. Those three terms are part of scheme listed in Table 4.12 [146].

36
4. Methodology

Moreover, we extracted several more privacy issues within our results. One of these privacy
issues handles the topic of disclosure of sensitive data for analysis [82, 149] or other reasons,
intentionally [150, 146] or unintentionally [151, 152]. Then, we explored a high density of the
issue of complexity of privacy policies [141, 153, 103, 129] and difficulties with the compliance
with requirements [154, 155, 156]. We are aware of the fact that the issues with requirements
compliance is caused by the complexity of privacy policies, but we observed a high frequency
of this topic, therefore, we want to highlight this fact in our work. Then, there is a group of
terms that represents privacy issues that are caused by the possession of sensitive data by a
third party and a potential secondary use [107, 157]. The term ”secondary use of unstructured
clinical data” was taken from [158]. Here, not just the handling of those data needs to be
regulated [142, 159], but also the misusage of sensitive data needs to be pointed out [160,
161]. Inflexibility of annonymization tools [162] is another term within Table 4.12 which is
the outcome of the keywording session that caught our attention. We observed also that
annotations and their quality were frequently mentioned by the literature we collected [122,
163, 121]. A more fundamental topic within the papers we found is the fact that sensitive
information are contained by unstructured data sets [164, 165, 166].
So far, we invented categories depicting privacy related challenges solved with NLP, now,
we attempt to devise a similar approach for the solution of the aforementioned privacy issues.
The next category we introduce aims for the generalization of privacy issue solutions.

Generalized Privacy Issue Solution

Same as for the privacy issue, we also require a generalized approach towards solving those
issues. We attempt to map privacy issues, and their solution approaches in an overview
to ease the extraction of patterns. Since we observed a wide variety of use cases within
Privacy-Preserving Natural Language Processing, we deepened our investigation of the
privacy solutions and introduced the category which stands for the applied NLP concepts
and method type, explained in the subsequent sections. Same as for the use case classification,
we discovered a variety of solutions. Thus, we will provide a two-level abstraction approach
again to structure the overview better. The mapping of this aggregated view and which
specific terms we found are located in subsection A.2.1. Table 4.13 displays the top level of
the extracted terms.
We observed that a plethora of papers tried to solve the privacy issues with automation.
Not just the enforcement of privacy policies [123, 37, 167] and access control is desirable to be
automatized, but also the analysis of privacy policies [168] and the compliance check [154,
169, 132]. Also, the anonymization of written text without any user interaction. Another
interesting category was the collection of annotated corpora for different purposes [104, 170].
We detected also another niche within the validated literature and it focuses on the privacy
analysis of mobile application code [155, 171, 172]. One of the largest section we found
focused on de-identification of text data, mainly in the medical area to protect the patient
privacy [107, 108, 109, 110], or the obfuscation of authorship [173, 174] . There is one paper
that demonstrates potential threats within a fitness application [175].
Another big block in the privacy issue solution category is dedicated to the design of search

37
4. Methodology

engines[176, 177], metrics [117], frameworks [178, 48] or algorithms to protect the privacy of
an acting user for example gender [179] or the data subject within a data set.
The next entry in Table 4.13 is ”Detection”. Specifically, the search for irregularities within
privacy regulating documents[180, 181] and the actual behavior or description. On top of that,
there is the aim of detecting sensitive data within unstructured data [182] or activities [183].
A smaller part is the interest of encryption based solutions within our findings [139, 184].
Another upcoming topic within this category is the generation of artifacts[185, 186] or
synthetic data [187, 188] based on data sets or a privacy regulating document. Then, we
observe the interest of the information extraction topic. The most frequent goal is to extract
information out of privacy policies [189, 190].
Another trend is to map the content of privacy policies to different metrics [191, 106] or
other privacy policies like General Data Protection Regulation (GDPR) [192, 193]. Additionally,
ontology based solutions focusing on privacy in general to shape a more universal approach
to cope with the topic [194, 195].
A huge variety is also present in the aggregated scheme of overviews that are present in
the literature we extracted from the search. The most frequent topics of the overviews is
about applied methods of handling medical data in a privacy preserving manner [196, 197]
or research field [198, 199].
The following aggregated term points out the solutions based on semantic level. In order
to extend the application domain of developed solutions and tools [200], the semantics of
language is utilized caused by its universal attitude [136].
Earlier, we mentioned the de-identification of textual data, however, there is also research
conducted in the area of speech de-identification in which the content of a speaker is needed
in order to execute a task , however, the voice is negligible [148, 201].
The last category within the Table 4.13 is the suggestion of security and privacy require-
ments in different ways and for diverse targets [202, 203, 204].
The next section will elaborate the different levels from NLP applied to solve the privacy
issues.

Generalized Category of Applied NLP Concept

In order to compare the extracted information precisely with each other, we are obligated
to map our findings to an equal level. Thus, we were curious about the applied NLP
concepts within our findings. We define an NLP concept same as Liddy in his work described
as ”Levels of Natural Language Processing” [26], he distinguishes between several levels
based on the investigated linguistic domain, also discussed in this thesis in subsection 2.1.3,
namely morphological, semantic, phonological (speech processing). We understand that the
morphological analysis focuses on the extraction of those morphems that carry the meaning
of a word or to remove all those parts of the word that blur its shape. The semantic analysis
is dedicated to the meaning of a word more than its shape. Speech processing is a topic for
itself, since voice is based on sound, however, it does communicate words in which we are
interested [26].

38
4. Methodology

In Table 4.14 we see all the applied concepts, we discovered within the selected literature.
Some papers did not apply any of those concepts because they just pointed out some
possible improvements for privacy requirements or possible threats like the work of Ye that
extracted different privacy and security challenges that occur when dealing with chatbots [205].
Morphological analysis is exemplified in the work of Fujita who developed a framework of
rules in which privacy policies are parsed in order to increase the readability [153]. Semantic
analysis is conducted in Hasan’s work who used neral network and word embeddings in
order to capture the semantic relationship between words to detect sensitive information
and annonymize it [206]. Di Martino’s paper shows a good example of the combination of
morphological analysis and semantic analysis by applying Part of Speech (PoS) tagging and
Named Entity Recognition (NER), respectively, to build an automated annotation tool to
annonymize sensitive information in documents in the Italian law domain [207]. The creation
of a corpus with annotations containing privacy policies like Wilsons’ work illustrates the
combination of the application of a morphological in form of Part of Speech (PoS) tagging,
semantic with Named Entity Recognition (NER) and syntactic analysis by the identification of
sentence boundaries [208]. We also distinguish between speech processing as a lone category
if just the voice is de-identified in order to keep the patient’s voice private as it is transmitted
via internet or a phone call, but there is still the possibility to detect symptoms of depression
[148]. A paper that just contributes to speech processing and the morphological analysis is the
work of Boukharrou’s paper that injects random noisy speech requests, also utilizing Part of
Speech (PoS) tagging in a smart home environment in order to avoid the profiling or possible
linkage [146]. Semantic analysis and speech processing are well represented in the work of Ye
who developed a system that covers and identifies sensitive information with Named Entity
Recognition (NER) so that people in public cannot read the text on the smart phone screen
through peeking, however, when the smart phone owner require the covered text the system
provides is able to read the actual information applying a text-to-speech approach [151]. The
combination of three different concepts within our validated paper was observable in the
work of Qian who conducts research on privacy preserving speech data publishing, because
it is not just sufficient to de-identify the voice, it is also important to protect the content of the
voice data [140].
Furthermore, we classified also the different approaches in which NLP was applied, further
discussed in the next section.

NLP Method Type

In Liddy’s work also the different approaches used to analyze text, were delineated. He
divided it in three different parts symbolic (rule-based), statistical and the connectionist
approach which corresponds with the today’s understanding of neural networks [26]. We
introduced this category in order to structure the overview better and track the applied
technologies within the privacy solution domain in context with NLP. In Table 4.15 eight
different combinations of NLP method types. The first one is ”None” because there is none
applied. ”NN” stands for neural network and refers to the papers that apply deep learning
like Hassan’s paper detecting sensitive data in unstructured text by the application of word

39
4. Methodology

embeddings [206]. Tesfy and his team investigated and presented the existing challenges in
detecting sensitive information in unstructured data sets without applying any NLP concepts
or methods [7]. Clearly, rule based approaches like Olsson’s paper designing a framework
for a more efficient way of annotating corpora [118]. A rule based and neural network based
approach is shown in Ravichander’s work by collecting a question and answering corpus with
1750 questions about privacy policies of mobile applications annotated by 3500 expert answers
[209]. The mix of a rule-based and a statistical based approach is exemplified Deleger’s work
to set up an annotation gold standard and evaluating it with a statistical model to prove
its value [163]. In general, we consider all papers applying the Natural Language Tool Kit
(NLTK) or similar auxiliaries as part of the ”Rule based & Statistical” category, since all of
them offer support based on rules or statistical concepts. There is a the The Stanford CoreNLP
natural language processing toolkit which embodies all three categories [210], therefore we
consider every paper that uses this tool or similar ones to be a part of the respective category
like Pan’s paper that investigates the information flow of android applications to detect
privacy violations [211]. A purely statistical approach is used within Neto’s paper which is
investigating the mining of personal query logs without taking into account work-unrelated
queries [115]. The combination of statistical and neural network based approaches is well
illustrated within the work of Lindner and his team that applies both categories in order to
analyse the coverage of privacy policies and the privacy policies presented on websites [129].
The next section will describe the other subcategories for the category addressing NLP as a
privacy threat.

4.5.2. Subcategories Natural Language Processing as Privacy Threat


Here, we will specify the terms appearing in the other subcategories from the second top
category. The work sheet for the corresponding top category is located in subsection A.1.2.
Again, we start with the terms within the domain subcategory, then, the data type and the
generalized privacy issue solved for NLP subcategory. The last terms delineated in this part
are within the PET subcategory.

Domain

This category is similar to the one mentioned in the other top category in section 4.5.1. We
orient ourselves according to the definition of the domain by the website of Merriam Webster
and see it as a sphere of knowledge [102]. We distinguished between five different domains
listed in Table 4.16. We discovered a plethora of papers that referred to the cloud computing
domain [212, 77, 213, 214]. Then, we noticed the presence of home automation [215, 216].
This top category also embody ”Medicine” as a domain [217, 218]. Same as in section 4.5.1
we introduce the domains ”General” and ”Other”. ”General” has a more generic approach
to solve a problem for which we could not find a adequate domain like [219, 220] and the
domain ”Other” contains domains that are very specific [221, 222]. The next subcategory we
will delineate is data type.

40
4. Methodology

Data Type

The data type is relevant to us, because we attempt to know, if there is a prevalent focus
on text (”Written”) or rather on ”Speech” data present within the research field of Privacy
Preserving Natural Language Processing. On top of that, we want to discover the different
use cases and how the issues were solved for NLP and cluster these solutions.

Use Case Classification

We choose to extract use case categories as well, to support the understanding of potential
scenarios in which NLP preserves privacy and improve the orientation of this mapping study.
Table 4.17 displays an aggregated overview from the more detailed use case list located in
subsection A.2.2.
A frequent use case appearing in the papers we found is the classification of documents
without the data disclosure [223] or it was based on encrypted data [49]. Another block
within our categorization in this section was dedicated to the investigation of the impact
of NLP concepts like word embeddings [9] or collaborative deep learning [224] on privacy.
Additionally, we introduced a term that specifies model training without the necessity to
contribute the actual data because the neural network is separated in two parts where on
part of the network containing sensitive information in its parameters is located on the client
side and the other part is then on the server side [225] or the concept of federated learning
is applied [226, 227]. Another topic which we discovered is the privacy utility trade-off that
applies a privacy preserving concept on neural text representations [228] or word embeddings
[229] in order to protect the privacy of the data subjects, but still enough information is left to
render an analysis on the respective data. Moreover, we noticed also the a group of papers
about secure communication without the necessity to disclose sensitive information [47] or
every feature of your voice [230]. Then, we noticed a niche referring to similarity detection
without exposing the data to the analysing party [231, 232]. One of the largest segments in
this section is dedicated to speech related services like emotion recognition [233], speech
transcription [234] or speech verification [235] avoiding the complete disclosure of all features
of the voice to a third party. Ultimately, we devised a category focusing on storing and
searching Data without its exposure, mostly rendered on encrypted data [236, 237] and the
summarizing without the document disclosure [238]. The next section is dedicated to the
generalization of issues or vulnerabilities for NLP.

Generalized Issue/Vulnerability solved for NLP

The generalized view on privacy issues caused by the involvement of NLP support our idea
to provide a better orientation for the recipients of this mapping study. As you can see
in Table 4.18 this category was inspired by the research about the vulnerabilities of word
embeddings, information leakage in language and the unintended memorization of neural
networks conducted by Pan [9], Koppel [239] and Carlini [8], respectively.
The paper by X. Pan, M. Zhang, S. Ji, and M. Yang describes the exploitability of the learning
parameters applied in word embeddings which allows the attacker to reverse-engineer those

41
4. Methodology

parameters and extract the data used for the training of those word embeddings [9]. An
example for this category is the work of Podschwadt and Takabi who applies homomorphic
encryption in order to protect the sensitive information of a user attempting to train a
Recurrent Neural Network (RNN) by using encrypted word embeddings [49].
According to Koppel, Argamon, and Shimoni, it is feasible to apply simple lexical operations
in order to determine the gender of a the author who wrote a certain text [239]. The work
of Beigi, Shu, R. Guo, et al. exemplifies a case of this category by pointing out the fact that
simple tweets might lead to the disclosure of identity [221].
The paper by Carlini, C. Liu, Erlingsson, et al. describes the possibility to exploit neurons
within a neural network. Every neuron is trained on data and carries parameters that can be
exploited [8]. One example for this category is presented by Shao, S. Ji, and T. Yang which
elaborates on the mitigation of the leakage for neural rankings [231].
Furthermore, we include two additional terms that differ from the three aforementioned
ones. The first one is the disclosure of sensitive data to NLP model for training purposes
like the work of Feyisetan’s team developing an active learning approach to reduce the
required amount of annotated training data but still achieve acceptable model performance
[240]. The second term we would like to introduce is the general disclosure of sensitive data
to an NLP task to process it. An example for this classification is Reich’s paper about the
privacy-preserving classification of personal text messages meaning that the classification
model does not learn anything about the input of the author’s message that was classified
but also the author of the message doesn’t have any access to the classification model except
the resulting output [223]. Also, with this category, we try to point out current trends within
the research field and visualize their maturity. After the classification of the privacy issues,
we attempt to investigate if there is a particularly preferred solution to a specific privacy issue
within the NLP domain.

PETs(RQ2)

We realized that Dilmegani published a very thorough overview of applied privacy enhancing
technology examples which we will instrumentalize within this thesis for the classification
in category PET. He lists Differential Privacy (DP), Federated Learning, Homomorphic
Encryption (HE), Obfuscation, Secure Multiparty Computation and Synthetic Data Generation
as PETs. Table 4.19 displays all classifications we encountered during the keywording phase
[101]. Significantly, we noticed that there are also mixed PET that appear in our selected
literature. For instance, we realized that the Pathak’s work combines secure multi-party
computation (SMC) and differential privacy in order to publish classifiers trained on sensitive
data [241]. Furthermore, we detected a high frequency of papers that combine federated
learning (FL) and differential privacy (DP) for NLP based on text [226, 242] and speech [243].
The last combination, we extracted from the literature applied homomorphic encryption
(HE) and federated learning by analyzing the concept regarding its security and efficiency
[244]. In addition to that, we realized that there are some papers that rise the attention
towards vulnerabilities without solving the issue [245, 224] or suggesting concepts in order
to preserve the privacy of data [246, 247]. After the extraction and the mapping of the

42
4. Methodology

relevant information, we will be able to analyze which solutions were applied for which issue
answering RQ2 with it. In the following section, we will discuss the results of the mapping
phase.
All in all, we extracted information from the papers we collected in the previous phases as
described by Petersen [11]. We detected two different top categories and several subcategories.
In some, we needed a two-layer abstraction scheme to structure our resulting schemes better.
The two top categories interpret the role of NLP in two different ways, namely as a privacy
enabler or as a privacy threat. Therefore, the subcategories needed to be different because
we proceed with different purposes by setting up the subcategories. For NLP as a privacy
enabler, we wanted to know which use cases are affected by it, in which domain does it
appear, which solutions and its NLP method types or analysis levels were applied. For NLP
as a privacy threat, we also attempt to investigate which domain or use cases are affected
by this and which solutions were applied. Since we have our categories yet, we continue to
analyze the mapping results in the next chapter.

43
4. Methodology

Use Case Classes Generalized Category


Annotation automatization
Annotation collection
Annotation process support
Annotations and Training
Annotation revision
Data
Annotations for research acceleration
Annotations with gold standard
Collection of privacy policies
Automated and flexible rule enforcement
Automation
Automatic modeling of privacy regulations
Browsing in a private manner Browsing in a private man-
ner
Definition of privacy requirements Definition of privacy re-
quirements
Ease the automation process of privacy regulating documents
Ease the comparison of privacy policies
Ease the compliance process through automation
Simplification of privacy
Ease the comprehension of privacy policies
related regulations
Ease the comprehension of privacy related user reviews for developers
Ease the implementation of written privacy policies
Ease the writing process of privacy policies based on code
Investigating the access legitimacy of smart phone apps
Investigating the impact of GDPR with privacy policy analysis
Investigations
Investigating the permission usage coverage of app descriptions
Investigating the potential privacy impact of smart phone apps
Measuring the Compliance of apps Measuring the compliance
of apps
Mining according to privacy policies Mining according to pri-
vacy policies
Paying more attention to privacy policies Increase attention towards
Paying more attention to privacy policies & Ease the comprehension privacy policies
of privacy policies
Privacy preserving information sharing Privacy preserving infor-
mation sharing
Protecting patient privacy in unstructured clinical text Protecting sensitive
Protecting sensitive information in unstructured text information in
Protecting sensitive information on social media platforms unstructured data
Searching in a private manner Searching in a private man-
ner
Voice protection in public
Voice-based assistance in a privacy preserving manner
Voice related services
Voice-Based healthcare in a private manner
Voice-based service in a private manner

Table 4.11.: All terms occurring in the Classification of Use Cases


44
4. Methodology

Generalized Privacy Issue


Data Linkage
Identification
Profiling
Disclosure of sensitive Data
Disclosure of sensitive data for analysis
Unintended data disclosure
Complexity of Privacy Policies
Compliance with Requirements
Handling of sensitive data by applications
Misusage of Sensitive Information by Data Collector or Adversary
Secondary use of unstructured clinical data (Research)
Inflexibility of annonymization tools
Annotations and their Quality
Sensitive Information in unstructured Data

Table 4.12.: All terms occurring in the category of Generalized Privacy Issue, a subcategory of
NLP as a privacy enabler

Generalized Privacy Issue Solution Overview


Automation
Code-based Analysis
Collect annotated corpus
De-identification
Demonstration of Threats
Designing
Detection
Encryption based Solutions
Generation
Information Extraction
Mapping
Ontology based Solutions
Overviews
Semantic Solutions
Speech de-identification
Suggestions

Table 4.13.: All terms occurring in the category of Generalized Privacy Issue Solution, a
subcategory of NLP as a privacy enabler

45
4. Methodology

Generalized Category of Applied NLP Concept


None
Morphological Analysis
Semantic Analysis
Semantic Analysis & Morphological Analysis
Semantic Analysis & Morphological Analysis & Syntactic Analysis
Speech Processing
Speech Processing & Morphological Analysis
Speech Processing & Semantic Analysis
Speech Processing & Morphological Analysis & Semantic Analysis

Table 4.14.: All terms occurring in the category of Generalized Category of Applied NLP
Concept, a subcategory of NLP as a privacy enabler

NLP Method Types


None
NN
Rule based
Rule based & NN
Rule based & Statistical
Rule based & Statistical & NN
Statistical
Statistical & NN

Table 4.15.: All terms occurring in the category of NLP Method Types, a subcategory of NLP
as a privacy enabler

Domain
Cloud Computing
Home Automation
Medicine
General
Other

Table 4.16.: All terms occurring in the category of domain, a subcategory of NLP as a privacy
threat

46
4. Methodology

Use Case Category


Classification without Data Disclosure
Investigating the Impact of NLP related Concept on Privacy
Model Training without Sharing Data
Privacy-Utility-Trade-off for NLP related Concept
Private Communication
Similarity detection without Data Exposure
Speech related Service without complete Voice Feature Disclosure
Storing and Searching Data without Data Exposure
Summarization without Document Disclosure

Table 4.17.: All classes occurring in the category of use case classification, a subcategory of
NLP as a privacy threat

Generalized Issue/Vulnerability solved for NLP


Exploitability of Word Embeddings
Information Disclosure by Statistical Language Models
Memorizability of Neural Networks
Direct Disclosure of Sensitive Data for NLP Model Training
Direct Disclosure of Sensitive Data for NLP Tasks

Table 4.18.: All classes occurring in the category of use case classification, a subcategory of
NLP as a privacy threat

Privacy Enhancing Technology (PET)


Differential Privacy (DP)
Federated Learning
Homomorphic Encryption (HE)
Obfuscation
Secure Multiparty Computation
Synthetic Data Generation
Differential Privacy (DP) & Secure Multiparty Computation
Federated Learning & Differential Privacy (DP)
Federated Learning & Homomorphic Encryption (HE)
None

Table 4.19.: All classes occurring in the category of Privacy Enhancing Technology 1 and 2, a
subcategory of NLP as a privacy threat

47
5. Results
Since we have the categories and the schemes, we start to extract the information from the
validated papers and map it accordingly [11]. In the following sections, we will present
the outcome of the mapping phase and explain the results. We will start with the mapping
results showing us the distribution of publications regarding our topic in general and divided
them into the two top categories over the years. For each electronic data source, as it is
suggested by the paper by Kitchenham, Budgen, and Brereton[12]. Then, we present the
mapping results of the two top categories in which we see NLP as a privacy enabler and
iterate through the subcategories to answer the research questions RQ1 and RQ2. Afterward,
we repeat this procedure for the second top category in which NLP is determined as a privacy
threat. All plots are based on the findings listed in the worksheets located in subsection A.1.1
and subsection A.1.2.

5.1. Numbers of Publications per Year


Before answering the research questions, we first want to stress the interest of PP NLP in the
research community by pointing out the number of publications per year as it is suggested by
Kitchenham [12]. Since we discovered two major top categories in the research field of our
interest, we were curious about the publications each year for those two categories. Figure 5.1
stresses the fact that the topic we are interested in continuously gains more attention and also
its substreams. We assume that the increase is caused by multiple factors like the introduction
of GDPR on the European level in 2016 and the adjustment on the national level two years
later, which legally bind the private and public sectors.
Since the search process was completed at the beginning of March, the amount of publica-
tions in 2021 is limited. We could not manage to find a publication year for the publications
with the title ”Towards integrating the FLG framework with the NLP combinatory framework”
[248] and ”Privacy-Preserving Character Language Modelling” [249].

5.2. Numbers of Publications per Electronic Data Source


According to Kitchenham, it is an interesting piece of information to point out the electronic
data source that delivered the most publications [12]. The resulting Figure 5.2 depicts an
overview of the electronic data sources and their contribution with publications to this
mapping study. Most of the publications were delivered by Google Scholar. However, it
searches through the web for academic publications in other electronic data sources [250].
Therefore, we thought that we would distinguish this fact as well in the chart above by

48
5. Results

Figure 5.1.: Publications per year

representing the distribution of papers contributed with one category including Google
Scholar and the other one excluding Google Scholar and redistributing the papers to their
sources or assigning it to the category ”Other” if the original data source is not represented
within our selection of electronic data sources.
We observed that the 127 publications contributed by Google Scholar contained 53 pub-
lications that were part of an electronic data source from our selection, but 74, a majority,
were from other sources. As you can see, 14 publications were a part within ACM, which
Google Scholar also found, which represents the highest ratio with 240 percent of otherwise
assigned papers. 22, 6, 10, and 1 publications from IEEE, ScienceDirect, Springer, and Wiley,
respectively, were assigned to Google Scholar. Without those publications, Google Scholar
would still be in third place regarding the contribution behind SCOPUS and IEEE.

5.3. What privacy-related challenges exist in the area of Natural


Language Processing (NLP)?
In this section, we will attempt to answer the first research question. In the keywording
phase, we realized that there is the need to distinguish between two categories to answer this,

49
5. Results

Figure 5.2.: Publications per electronic data source

namely NLP as and privacy enabler and as a privacy threat. Hence, we have different views.
First, we will present our results for the privacy enabling part and then for the privacy threat
part.

5.3.1. Challenges for NLP as a Privacy Enabler


In this section, we will answer RQ1 for the top category with the mapping results we collected
in the following sections. First, we want to acquire an overview of the domains interested in
the application of NLP as a privacy enabler. Then, we will inspect the distribution of use case
classes discussed within the publications we selected. After having a better understanding
of the environment NLP is utilized as a privacy enabler, we will present the fundamental
privacy-related challenges this top category faces. Afterward, we map the use case categories
and the domains on the privacy issues. At the end of this section, we will have a look at the
timeline, including the privacy issues mapped on a timeline, to observe the development of
privacy issues solved with NLP.

Domain Mapping Results

The mapping results in Figure 5.3 regarding the domain part for NLP as a privacy enabler dis-
plays 58.2 percent of the papers we found either contribute to the topic ”Law” or ”Medicine”.

50
5. Results

28.6 percent of the papers in this part addressed either a particular topic or conducted research
on a generic or theoretical topic. 8.9 percent of the paper we extracted focused on social
networks, and the minor party in the chart is represented by mobile applications. In the next
part, we will present the mapping results for the use case categories we collected.

Figure 5.3.: Mapping Results for the Domain category in the top category NLP as a Privacy
Enabler

Classification of Use Cases Mapping Results

Figure 5.4 represents the distribution of papers among the classes of use cases we delineated
in section 4.5.1. We decided to depict the results with a bar chart to understand better the
popularity of use cases among the literature we extracted in the earlier parts of the thesis. The
top three consist of the topics ”Protecting sensitive information in unstructured data” with 81
papers, ”Privacy Preserving Information Sharing” with 73 papers, and the ”Simplification
of privacy-related regulations” with 50 papers. Followed by 20 from ”Investigations” and
17 papers from ”Annotations and Training Data”. ”Voice related services”, ”Searching in a
private manner” and ”Increase Attention towards Privacy Policies” received 15, 14, and 14
papers, respectively. Twelve papers were mapped to the ”Automation” category and 10 to
”Mining according to privacy policies”. Less attention was gained by the topics ”Definition
of Privacy Requirements”, ”Browsing in a private manner”, ”Measuring the Compliance of
apps” collecting with five, three and two papers, respectively. In the following section, we
will present the mapping results from the privacy issue section of the top category NLP as a
privacy enabler.

51
5. Results

Figure 5.4.: Mapping Results for the Use Case Classification category in the top category NLP
as a Privacy Enabler

Generalized Privacy Issue Mapping Results

In this section, we will discuss the mapping results for the privacy issues that are handled with
the support of NLP. In Figure 5.5, we can observe a clear dominance of the secondary usage

52
5. Results

of unstructured clinical data, primarily for research reasons and the complexity of privacy
policies. Considerable popularity was also dedicated towards the topics ”Unintended data
disclosure, ”Profiling’,’ and ”Compliance with Requirements”. Essential privacy issues in this
section are also ”Identification”, sensitive information in unstructured data, and the handling
of sensitive information by applications. Less frequently mentioned within the literature, we
identified were the topics ”Disclosure of sensitive data analysis”, ”Data Linkage”, ”Misuse of
Sensitive Information by Data Collector or Adversary”, ”Disclosure of sensitive Data” and
”Annotations and their Quality”. Next, we will display the results, which will show us the
mapping of use case classes on privacy issues.

Figure 5.5.: Mapping Results for the Generalized Privacy Issue category in the top category
NLP as a Privacy Enabler

Use Cases requiring NLP as a Privacy Enabler

Figure 5.6 delivers an aggregated overview over the constellation between the privacy issues
and the corresponding use cases appearing in or which they cause. On the left of the graph
are the privacy issues listed. The bar next to it displays the distribution of use cases, with the
corresponding color on the right of the graph, addressing the respective privacy issue. If a
privacy issue contains a use case category, the order is set by the order given on the right of
the graph. This means, if a privacy issue category contains one publication addressing it for
each use case category, the graph will depict a bar with 13 boxes, each filled with a ”1”. At

53
5. Results

the bottom of the graph, a percentage is given to illustrate the proportion of use cases within
publications. The numbers in the bar next to a privacy issue correspond to the number of
contributions addressing a specific use case category.
For instance, we investigate which use cases address the privacy issue ”Complexity of
Privacy Policies”. Six use cases are part of the category addressing ”Annotations and Training
Data”, two ”Automation”, one ”Browsing in a private manner”, 13 ”Increase Attention
towards Privacy Policies”, nine ”Investigations”, one ”Privacy-Preserving Information Sharing”
and one ”Protecting sensitive information in unstructured data”. 33 contributions or 50
percent of the use cases addressing the complexity of privacy policies are dedicated to the
simplification of privacy regulating documents.
This graph provides an overview of all the privacy issues and the use cases they occur
in or they cause. Because of the annotation quality, issues regarding the scarcity of data or
the training of sensitive data detection models occur in use cases related to the annotation
process. Annotated data are essential to train algorithms to detect sensitive information in a
data set. Now, we can observe that the complexity of privacy policies is a significant concern,
and many contributions were dedicated to this issue. Significantly, the simplification and the
attention increase towards privacy purposes received a fair amount of interest. Compliance
with requirements affects many use cases, especially in the area of automation and the
simplification of privacy regulating documents. Data Linkage affects all the use cases that aim
for a privacy-preserving way of sharing data or challenges protecting sensitive information in
unstructured data structures.
The disclosure of sensitive data might occur in use cases that attempt to protect sensitive
data, share them in a private manner, or during the interaction with a voice-related service.
In mining or protecting sensitive data, the disclosure of sensitive data is an issue because
the analysis bears the chance to infer sensitive information about the data subject.
The privacy issue resulting from the handling of sensitive information by applications
causes many publications to render investigations to evaluate the privacy impact and raise
awareness towards this issue. The issue of ”identification” accompanies use cases addressing
the sharing of information, protection of sensitive information, the application of search
engines, and the encounter of voice-related services.
The inflexibility of anonymization tools hinders the process in which privacy-preserving
information exchange is rendered. To prevent the misuse of sensitive data, use cases were
devised that address the issue with a privacy-preserving manner of data exchange, general
protection of sensitive information within unstructured data sets or the simplification of
privacy regulations are conducted. Profiling might originate in use cases that focus on
browsing, mining, or protecting sensitive information in unstructured data sets. Most
frequently, Profiling occurs in context with search engines and voice-based services.
One major privacy issue within this graph is the secondary use of unstructured clinical
data occurring in use cases that aim for sharing information or protecting it. The existence
of sensitive information within not medical data happens to occur in or be caused by the
collection of annotations, the definition of privacy requirements, mining of data, privacy-
preserving information sharing. The issue of unintended data disclosure originates in use

54
5. Results

cases that share sensitive information or protect it. It also is subject to use cases in which
search engines or voice-based use cases.
This graph illustrates the diversity of privacy-related issues, and in which category of use
cases they either occur or which use cases they case. There is not one category of use case that
is affected by just one. However, most of the privacy issues occur in use cases in which either
sensitive information is tried to be protected in unstructured data sets or when the use case
conducts the sharing of information. The following chart will elaborate on the development
of privacy issues during the last years.

Domains involving NLP as a Privacy Enabler

After extracting the information about the privacy issues that were dealt with NLP and the
corresponding use cases, we investigated in which domain the privacy issues are present.
The result of our work is Figure 5.7. On the left side of the graph are the privacy issues. Next
to each privacy issue is a bar that represents a 100 percent of use cases dedicated to it. each
colour in the bar represents one category of use cases. The number in the box stands for the
absolute number of the respective use cases of a given use case category.
On the first sight, the graph highlights the fact that NLP is frequently used within the
medical domain in order to solve privacy issues. Likewise, the presence of the law sector in
different privacy issues is revealed by this graph caused by the intense involvement of privacy
policies or other privacy regulating documents. We observe that the categories ”Other” and
”General” are scattered around all privacy issues in high proportions. The privacy issues
”Complexity of privacy Policies”, ”Compliance with Requirements” and ”Secondary use of
unstructured clinical data” are naturally predominated by the domain ”Law”,”Law” and
”Medicine”, respectively.
All in all, this graph shows us, that the occurrence of privacy issues solvable with NLP can
happen in any domain listed here. But this also implies that other domains could utilize NLP
more in order to preserve privacy.

Development of Privacy Issues solved with NLP as a Privacy Enabler

Figure 5.8 demonstrates the development of the variety of privacy-related issues that are
solved with the support of NLP. On the x-axis of this chart, the years of publications are
displayed. One instance in the graph has no assigned publication year because we were not
able to find it. The numbers in the bar represent the number of publications contributed to
one the privacy related issue with the corresponding color on the right side of the chart. If
the box is too small to fit the number of contributions, it is written in the next bigger box with
enough space, and it is highlighted by using a white font around the filling. For instance,
the column assigned to 2020 has the following numbers, starting from the bottom of the
column, ten and five, which belong to the categories ”Complexity of Privacy Policies” and the
”Compliance with Requirements”, respectively. Then, there are two empty fields. However,
after these, there are two ”1” in the same colors as the empty fields indicating their reference.
In this case, the privacy issues ”Data Linkage” and ”Disclosure of sensitive data for analysis”.

55
5. Results

The same concept applies to the following numbers highlighted by the white color around
the filling color. The y-axis of this graph represents the 100 percent of publications made in
the corresponding year. At the top of every column, the total amount of publications for each
year is written.
We observe that the variety and the amount of privacy-related issues solved with NLP
increases. Notably, not just the almost every year presence of the privacy issue of ”Unintended
data disclosure” and the ”Complexity of Privacy Policies”, also the increased amount of
attention contributed to the issue in the form of publications is well illustrated by the chart. A
constant present within almost every year is the secondary use of unstructured clinical data,
also assigned with growing numbers. In 2016, the variety and amount of publications received
a significant intake compared to the year before. This is the same year in which General
Data Protection Regulation was introduced by the European Union, and two years later, the
country members needed to adapt to the law [16]. In the years afterward, the dedication
towards the challenge of compliance also gained more attention. The development of the
research field evolves. The publications start to address privacy concerns more specifically
because privacy issues like identification, profiling, and data linkage are applied in more
publications and delineate the problem of privacy as more than just a violation and address
the problem’s potential consequences.

5.3.2. Challenges for NLP as a Privacy Threat


In this section we attempt to answer the challenges that are privacy related and caused by
NLP. First, we want to inspect the domains in which the privacy issues that are caused by
NLP occur and we will point out which data type was mostly involved. Then, we present the
issues and their distribution within this thesis. The last point will be a mapping of domains,
data types and use cases to the respective privacy issue.

Domain Mapping Results

The mapping results for the sub category Domain are depicted in Figure 5.9. With 55.9
percent a majority of our findings work on generic or theoretical aspects for which we did
not find an adequate domain. The ”Cloud Computing”, ”Medicine” and ”Home Automation”
domain received 15.7, 14.2 and 4.7 percent, respectively. The ”Other” domain achieved 9.4
percent of the papers. Now, we proceed with the results for the data type subcategory.

Data Type Mapping Results

Figure 5.10 illustrates the distribution of papers that conducted research on text or speech
data. Apparently, there is majority dedicated to work with written data. The next section will
discuss the results for the use case classification.

56
5. Results

Use Case Classification Mapping Results

The specific mapping results for the use case classification are depicted in a figure located
in A.3.1. Here, we will present the aggregated mapping results for the category use case
classification contained by Figure 5.11. The dominant use case class is the topic referring to
speech related services without disclosing the all the voice features with 42 papers matching it.
A comparatively mediocre amount of papers were mapped to topics ”Model Training without
Sharing Data”, ”Privacy-Utility-Trade-Off for NLP related Concept” counting 26, 20 and 15
papers, respectively. The topics ”Classification without Data Disclosure” and ”Similarity
detection without Data Disclosure” achieved seven papers each. A fewer amount of papers
were dedicated to the topics ”Investigating the Impact of NLP related Concept on Privacy”,
”Private Communication” and ”Summarization without Document Disclosure”. Next, we will
present the mapping results for the privacy issues subcategory.

Mapping Results for Generalized Issue/Vulnerability solved for NLP

In Figure 5.12 the mapping results for this subcategory are depicted. 31.5 and 29.9 percent of
the papers we found were mapped to the terms ”Direct Disclosure of sensitive data for NLP
Tasks” and ”Direct Disclosure of sensitive data for NLP model training”, respectively. The
last three topics within the chart are ”Exploitability of word embeddings”, ”Memorizability
of NN” and ”Information Disclosure by Statistical Language Models” receiving 17.3, 15.0
and 6.3 percent of the papers included in this mapping study. In the following part, we will
discuss the mapping results of final subcategory of this top category.

Domain Appearance of NLP as a Privacy Threat Domain

Figure 5.15 embodies the mapping of domains to the privacy issues caused by NLP. On the
left side the privacy issues caused by NLP are listed and on the right side of the chart the
domains are presented. Each color in the bar on the right of the listed privacy issue represent
on domain. The number within the box describes the amount of publications dedicated
towards the domain.
It is observable that the ”General” domain is predominant in all privacy issues. This
implies that the research field still is in its beginnings. Since the medical domain has already
conducted a lot of research towards privacy preservation in combination with NLP caused
by [251], it is present in most of the detected privacy issues. A small proportion of specific
domains are also present within the privacy issues. The next section will cover our analysis
for the data types involved in privacy issues caused by NLP.

Data Types addressed by NLP as a Privacy Threat

This section will show a distribution of data types addressed by the privacy issues caused
by NLP depicted in Figure 5.14. This table follows the same structure as Figure 5.15, but the
difference is that we inspect the data type if it is written or in spoken form. As you can see,
we have a clear focus on written data among the publication we included into this SMS. An

57
5. Results

interesting fact is that the direct disclosure of sensitive data for NLP task is the only privacy
issue that is predominated by speech data, mainly, because the voice itself is a bio-metric
treat that needs to be treated with caution before disclosing it to any third party. In the next
section, we will elaborate the use cases affected by NLP as a privacy threat.

Use Cases addressed by NLP as a Privacy Threat

Figure 5.13 shows the mapping of use cases on the privacy issues caused by NLP. It follows
the same structure as Figure 5.15, however on the left side is a legend listing the use case
categories and the corresponding color.
The direct disclosure of sensitive data to an NLP task are mainly addressed by use cases
that want to train a model based on sensitive data or speech related services without the
disclosure of the complete voice. A minority of use cases aim for the privacy-utility-trade-off
of training data. The next privacy issue in the chart refers to the issue of direct disclosure
of sensitive data for NLP tasks. The majority of use cases aim to protect the voice traits or
want to store and conduct a search on data in a privacy preserving way. The exploitability
of word embeddings occurs in the use case category of the privacy-utility-trade-off for NLP
concepts. The ”Information Disclosure by Statistical Language Models” also occurs in the use
case category of the privacy-utility-trade-off for NLP concepts among use cases of the type
”Speech related Service without complete Voice Feature Disclosure ” and the summarization
without the disclosure of documents. The ”Memorizability of NN” causes mostly the use case
of model training without sharing data and speech related services without fully disclosing
all voice traits.
Again, this graph shows us the different use case categories which are caused because of
the privacy issues we discovered or occur in the use case category. In the following part, we
will discuss the solutions applied in order to preserve the privacy with NLP or for NLP.

5.4. What approaches are used to preserve privacy in and with NLP
tasks and how can they be classified?
In this section, we will elaborate the solutions which were provided with the support of NLP
or the applied PET in order to preserve the privacy within the NLP task. For the section
dedicated to NLP as a privacy enabler, we will start with the mapping of privacy issue to the
corresponding solutions. Then, we will specify the solutions based on the NLP concepts and
method which were applied. For the part dedicated to solutions for NLP as a privacy threat,
we also start with the mapping of privacy issues on the solutions provided by the literature
we extracted.

5.4.1. Privacy Solutions Provided by NLP


In order to provide an answer for the question which privacy solutions are supported by
NLP and how can they be classified, we start with the scheme we elaborated in the previous

58
5. Results

chapter, then, we continue with the results for the applied NLP concept and the method type.

Generalized Privacy Issue Solution Mapping Results

Figure 5.16 shows an aggregated overview of the table located in subsection A.2.1. The most
frequent class in this category was ”De-identification” with 73 publications. On the second
place is the category ”Detection” with 38 publications. A comparable amount of papers were
dedicated to the topics ”Designing”(27), ”Information Extraction”(26), ”Automation”(23) and
”Generation”(22). On a similar level of received publications are the topics ”Semantic Solu-
tions”(19), ”Collected annotated corpus”(17), ”Speech de-identification” (16) and ”Overviews”
(14). Lower attention was addressed towards the topics ”Mapping”(8), ”Suggestions”(7),
”Code-based Analysis” (7), ”Ontology based Solutions”(4), ”Encryption based Solutions”(2)
and ”Demonstration of Threats”(1). The following part will discuss the mapping results for
the applied NLP concepts.

Generalized Category of Applied NLP Concept Mapping Results

Here, we will illustrate the mapping results for the applied NLP concepts with two different
levels of abstraction. First, we will discuss the results for various combinations we explained
in section 4.5.1, in Figure 5.17, then, we will aggregate those results and map it to the major
categories without any combinations. The result of it is displayed in Figure 5.18. Figure 5.17
depicts the fact that the papers we found have a strong focus on the semantic level and also
on the morphological level or the combination of both of those. Speech Processing seems to
be less popular. We attempted to find more appearances of combinations we delineated, but
this was not the case. Figure 5.18 just stresses the fact that most of the research is done on the
semantic level because of its more universal approach towards the natural language which
also improves the dynamics of potential solution approaches. Another significant part is
played by the morphological level supports to understand the task of words within a sentence
or even am entire text corpus. The next section will elaborate the distribution of applied NLP
method types within the extracted literature.

NLP Method Type Mapping Results

The last mapping results of the top category NLP as a privacy enabler is the method type
that was applied in the papers in this top category. Again, we instrumentalized a two
level abstraction process in Figure 5.19 the more specific one and Figure 5.20 displays the
aggregated overview. We chose a pie chart, since we have a more suitable distribution of
results. The top three are the pure methods ”Statistical”, ”NN” (Neural Networks) and the
”Rule based” approaches with 36.4, 21.4 and 12.9 percent, respectively. The most frequent
combination we found was the combination of all three method types with 9.2 percent. Now,
we will report about the mapping results for the sub categories of the other top category.

59
5. Results

Mapping of Privacy Issues on Solutions provided by NLP

In this part, we will focus on the privacy preservation solutions contributed by NLP. On the
left side of the Figure 5.21 the privacy solutions are listed. The bar next to each category
represents the privacy issue classes with the corresponding color on the right side of the
graph. The number within the box represents the number of contributions addressing the
privacy issue category.
”Automation” was mostly used in order to solve issues regarding the complexity of privacy
policies or the compliance with requirements. Also, ”Code-based Analysis” was used in
order to resolve the same privacy related issues and, additionally, issue of handling sensitive
information by applications. The collection of annotated corpora was applied in order
to resolve the issues of the complexity of privacy policies and annotation quality. ”De-
identification was applied mostly in order to make clinical unstructured data available fore
research purposes. The solution categories "Designing and "Detection" provided a immense
variety of solution towards multiple privacy issue categories. ”Encryption-based” solutions
were used to either mitigate the disclosure of sensitive data or profiling. ”Generation” was
mostly applied in order to solve the issues of the complexity of privacy policies or the
compliance with requirements or to avoid the complete disclosure of data for analyzing
purposes. The category of ”Information Extraction” was applied mostly for challenging the
complexity of privacy policies same as the ”Mapping” solution category. 50 percent of the
issues solved by an ontology-based approach aimed for profiling. The ”Overview” and the
”Suggestions” category offered a variety of solutions towards different privacy issues. Most of
the solutions offered by ”Semantic Solutions” addressed issues related to privacy regulating
documents or the compliance to them. ”Speech De-identification” was applied in order to
mitigate ”Identification” or ”Profiling.

Mapping of NLP Concepts on Solutions provided by NLP

Next, the Figure 5.22 will delineate the level on which the analysis of the NLP tasks were
applied. The chart follows the same rules as the charts discussed before, but in this case we
inspect the different NLP domains listed in the legend on the right side of the chart. We
are able to observe that most of the solution approaches apply the semantic analysis or a
combination of it with the morphological analysis. A triple combination appears seldom.

Mapping of NLP Method Types on Solutions provided by NLP

In Figure 5.23 we are able to observe the distribution of applied method types for the provision
of a solution towards a privacy issue. This chart follows the same logic as the previously
mentioned ones in this chapter. We can say that almost every solution approach with NLP
contains a variety of method types that are used in order to achieve there goal. Every
solution approach applies neural networks or a mixed version of it. Highly prominent is
also the application of statistical approaches or mixed version of it. Rather less frequent are
approaches that are mixed versions of rule based approaches. It is noticeable that almost
every solution approach applies the method type neural networks.

60
5. Results

5.4.2. Solutions Provided for NLP Privacy Issues


This section will elaborate on the mappings to the provided solutions in form of PETs. We
will map it on the threats they solve and in which use cases categories they were applied.
Also, we were curious about the development of contributions applying PETs over the last
few years.
In Figure 5.26 we observe the mapping of privacy issues caused by NLP on PETs. Also,
this chart follows the same representation logic as the previous charts. Here, however,
the left side of the chart are PETs listed and the bars next to it are filled with different
colors representing the privacy issue caused by NLP. The numbers delineate the amount of
publications addressing the corresponding privacy issue category.

PETs(RQ2) Mapping Results

In last subcategory in this section, we will inspect the mapping results from Figure 5.24 which
contain the specific mapping results and Figure 5.25 represents the aggregated overview of
the mapping results. Figure 5.24 displays the dominance of ”Homomorphic Encryption” and
”Obfuscation” having a mapping result of 36 and 30 papers, respectively. A significant amount
of papers were mapped to the topics ”Differential Privacy (DP)”, ”Federated Learning”,
”None” and ”Secure Multiparty Computation”. There were just a few papers that were
mapped to the topics that were combined PETs. Less popular was the topic ”Synthetic Data
Generation” with just four papers.
Figure 5.25 displays a different result to us, since we counted the combinations of PETs to
their respective topic. However, the results do not extremely differ from the specific view, it
shows us the share of the different PETs. Homomorphic Encryption still holds the biggest
share with 27.6 percent within the chart followed by Obfuscation with 22.4 percent and
Differential Privacy with 14.9 percent. 9.7 of the publications did not include any PETs. 8.2
percent were dedicated towards the topic of Secure Multiparty Computation and the smallest
amount of publications addressed the topic of synthetic data generation.
All in all, we started in this chapter with the definition of the research questions that we
attempt to answer with the results of this mapping study. Then, we defined the parameters
of the search process requiring search queries and an adequate selection of electronic data
sources. After executing the search queries in the respective electronic data sources, we
filtered the outcome of the search process, screened through the papers and included all
those which fulfilled our criteria and validated them with the support of the advisor of this
thesis. Ultimately, we executed the keywording of abstracts process in order to extract all
categories that we attempt to map on all the papers. In the end, we started the mapping
process and described the results of it at the end of this chapter. In the end, we executed the
methodology according to Petersen’s paper [11]. The next chapter will contain the results of
our interpretation of the mapping results and further analytical procedures.

61
5. Results

Mapping of NLP Privacy Issues on PETs

”Differential Privacy” was applied in order to solve the exploitability of word embeddings or
the disclosure of sensitive data for the training of a model or for a NLP task. Some use cases
involving the issue of ”NN memorizability” was also solved. In order to solve the issue of
sharing sensitive data for model training, the combination of differential privacy with secure
multiparty computation or federated learning and synthetic data generation were applied in
order to solve issues that were related to the exposure of sensitive data towards model training.
The direct disclosure of sensitive data for NLP tasks was mainly solved with the application
of homomorphic encryption, obfuscation and secure multiparty computation. The issue of
information disclosure by statistical language models was mostly solved with differential
privacy, homomorphic encryption and obfuscation. The exploitablity of word embeddings
is tackled by diverse PETs, namely differential privacy, federated learning, a combination
of federated learning and homomorphic encryption [244], homomorphic encryption and
obfuscation. The issue of memorizability of neural networks is also not just solved with
one PET but by multiple. The most publications were contributed in the solution category
of ”Obfuscation”, followed by the solution categories federated learning, secure multiparty
computation, homomorphic encryption, differential privacy and a combination of federated
learning and differential privacy [252]. The next chart will disclose the mapping of use case
categories to PET.

Mapping of Use Case Categories on PETs

Figure 5.27 provides on overview over the PETs and the use case category they occur in. This
illustration has a list of PETs and a few combinations on the left side and on the right are the
different use case categories with a given color.
In the case of classification without data disclosure the PETs differential privacy, homo-
morphic encryption and secure multiparty computation were applied. The use cases that
conducted an investigation to see the impact of NLP on privacy did not apply any PETs. The
use cases focusing on the model training without sharing data were mostly encountered by
the application of federated learning or a combination including it. Also, the application of dif-
ferential privacy, homomophic encryption or secure multiparty computation were an option.
The use cases referring to the Privacy-Utility-Trade-off were most frequently addressed by
differential privacy, also homomorphic encryption and obfuscation were part of the solution
domain. In order to make a private communication happen the application of obfuscation
was introduced. Use cases addressing similarity detection without the disclosure of data were
solved with different PETs. The most frequent publications addressing it were made with the
usage of obfuscation. Other less frequent approaches were made with differential privacy and
homomorphic encryption. Speech related services were mostly handled with the application
of homomorphic encryption of obfuscation. Storing and searching data was mostly solved
with homomorphic encryption. The next chart will focus on the development of publications
over time.

62
5. Results

Mapping of PETs on Years

Figure 5.28 shows use the amount of publications made during the last 14 years. On the left
side of the graph is the list of PETs with their color within the bars that are on the right of
the year number. Each bar represents a 100 percent of the number of publications made in
the respective year. On the left of the chart is the toal amount of publications made in the
respective year.
The popularity of PETs gained a lot of interest in the last three years. Especially, differential
privacy which started to appear in 2018 and since then it contributes over 10 percent of
the publications per year. A similar trend is also observable for federated learning and its
combinations after having a two year absence between 2015 and 2018. But since 2018 at least
one publication involving federated learning per year even in 2020 twelve publications were
made. Still, the biggest coverage is provided by homomorphic encryption and by obfuscation.
Secure multiparty computation has it first appearance in 2007 and had since then a few
publications per year. Synthetic data generation has a few publications in 2019 and 2020.

5.5. What are the current research gaps and possible future research
directions in the area of privacy-preserving NLP?
5.5.1. Research Gaps for NLP as a privacy Enabler
The complexity of privacy policies is one of the major pillar we detected as privacy issue
within the domain NLP as a privacy enabler. The focus of future research should be to
develop pipelines that support the idea of formulating privacy policies or privacy regu-
lating documents in cooperation with technical and law experts to create a standard easy
automatable.
Not just for the privacy related issues solved with NLP, but also for all privacy related
research fields the setup of a standard schema is imperative. The standard should require
every contribution addressing a standardized privacy issues to elaborate at least a few
consequences that might come up if a certain privacy mechanism is not applied or is not
sufficient enough. Avoiding the simple but still legit justification of a privacy violation. This
would support the common understanding of the importance regarding privacy and the
understanding of the consequences which a privacy violation might cause.
In the medical domain, we realized a high diversity of terms that needs to be standardized
by the medical research community in order to avoid repetitive research, especially for the
unstructured documents, because we observed a lot of synonyms.
Our results show that the solution approaches are frequently relying on statistical method
types which require a lot amount of data and an alternative is presented by Feyisetan, Drake,
Balle, and Diethe the application of active learning in order to reduce the required amount of
data for model training purposes [240]. Of course, this is not the solution to the problem but
an incentive in which direction the research might head. Another opportunity is the synthetic
generation of data in order to tackle the problem of the requirement of a lot of data.

63
5. Results

5.5.2. Research Gaps for NLP as a privacy Threat


Since this top category, we discovered, is upcoming, we noticed a lot of general approaches
within this category, a potential gap is to use the solutions for the issues and apply the general
approaches to specific scenarios. The combination of federated learning and differential
privacy could enable business models in which costumers could contribute their data in a
privacy preserving manner in exchange for some benefits to build a training data set. The
application of the privacy preserving concepts in a business model would introduce the
research field faster into society and increase its popularity.
We detected one paper applying quantum computing [253] that made us curious if the
concept is also applicable for other NLP related tasks to preserve the privacy in them. For
example, if this concept can also be applicable to text based tasks.
Based on our findings in section 5.3.2, speech data is still a topic that needs further research,
especially, because of its tremendous privacy value it is imperative to protect it.

64
5. Results

Figure 5.6.: Privacy Issues and Use Cases

65
5. Results

Figure 5.7.: Privacy Issues and Domains

66
5. Results

Figure 5.8.: Development of Privacy Issues

67
5. Results

Figure 5.9.: Mapping Results for the Domain category in the top category NLP as a Privacy
Threat

68
5. Results

Figure 5.10.: Mapping Results for the Data Type category in the top category NLP as a Privacy
Threat

69
5. Results

Figure 5.11.: Aggregated Mapping Results for the Use Case Classification category in the top
category NLP as a Privacy Threat

Figure 5.12.: Mapping Results for the Generalized Issue/Vulnerability solved for NLP cate-
gory in the top category NLP as a Privacy Threat

70
5. Results

Figure 5.13.: NLP Privacy Issues and Use Cases

71
5. Results

Figure 5.14.: NLP Privacy Issues and Data Type

Figure 5.15.: NLP Privacy Issues and Domains

72
5. Results

Figure 5.16.: Mapping Results for the Generalized Privacy Issue Solution category in the top
category NLP as a Privacy Enabler

Figure 5.17.: Mapping Results for the Generalized Applied NLP Concept category in the top
category NLP as a Privacy Enabler

73
5. Results

Figure 5.18.: Aggregated mapping results for the Generalized Applied NLP Concept category
in the top category NLP as a Privacy Enabler

74
5. Results

Figure 5.19.: Mapping Results for the Generalized Applied NLP Concept category in the top
category NLP as a Privacy Enabler

Figure 5.20.: Aggregated mapping results for the Generalized NLP Method category in the
top category NLP as a Privacy Enabler

75
5. Results

Figure 5.21.: Privacy solutions aiming to solve privacy issues

76
5. Results

Figure 5.22.: Privacy solutions and their analysis levels

77
5. Results

Figure 5.23.: Privacy solutions and their applied method category

78
5. Results

Figure 5.24.: Mapping Results for the Privacy Enhancing Technologies (PETs) category in the
top category NLP as a Privacy Threat

Figure 5.25.: Aggregated Mapping Results for the Privacy Enhancing Technologies (PETs)
category in the top category NLP as a Privacy Threat

79
5. Results

Figure 5.26.: NLP Privacy Threats and their Solutions

80
5. Results

Figure 5.27.: NLP Privacy Threat Solutions and their Use Cases

81
5. Results

Figure 5.28.: NLP Privacy Threat Solutions and their Development over time

82
6. Discussion
This chapter will list all essential findings from the previous chapter. The essential findings
are split into four parts. Two parts refer to the findings made towards the privacy-related
solutions in which one part is dedicated to the interpretation that NLP is a privacy enabler
and the other one that NLP as a privacy threat. The remaining two parts refer to the findings
made for the solution approaches. Finally, the last part of this chapter will delineate the
limitations.

6.1. Summary of Findings


This section will cover a summary of the results we discovered in the previous chapter. The
first two parts refer to follow the suggestion of Kitchenham to prove the relevance of the topic
by plotting the publication years and check the distribution of publications made by the single
electronic data source [12]. We observed that our topic and our subtopics are relevant and
gaining interest. Additionally, we realized that Google Scholar adds value as an electronic
data source selection to a SMS. Furthermore, we noticed the need to distinguish between two
top categories, namely NLP as a privacy enabler and as a privacy threat. The next part will
list the essential findings categorized by the research questions and the belonging to its top
category.

6.1.1. Main Privacy related Challenges for NLP as a Privacy Enabler


In this part of the thesis, we will point out the most important findings from based on the
results made within the SMS.

Domains Dealing with Privacy related Challenges and NLP

We discovered that the main domains applying NLP within their solutions are law and
medicine with 27.6 and 30.6 percent, respectively, meaning that over half of the publications
we found were related to the two topics. Still, 8.9 and 4.3 percent were located in the domains
of social networks and mobile applications, respectively.

Most frequent Use Case Categories affected by NLP solvable Privacy related Challenge

The three most frequent use case categories, we discovered, are protecting sensitive infor-
mation in unstructured data sets, the privacy-preserving sharing of information, and the
simplification of privacy-related regulations.

83
6. Discussion

Most frequent Privacy related Challenges Solved by the application of NLP

According to our mapping results, we observed that the most frequent privacy issues are the
secondary use of unstructured clinical data primarily for research purposes, the complexity
of privacy regulating documents, unintended data disclosure, profiling, and compliance with
requirements.

Domain Diversity of Privacy related Challenges solvable with NLP

Most of the privacy-related challenges we detected affected at least two or more domains. This
highlights the fact that one privacy-related challenge is also applicable to multiple domains.

Use Cases addressed by the highest variety of Privacy related Challenges

The use case categories which are affected the most by privacy-related challenges are the
privacy-preserving information sharing with 12 out of 14 and the protection of sensitive data
in unstructured data being addressed by 10 out of 14 privacy-related challenges.

Increased Attention of Privacy related Challenges solvable with NLP

Since 2018, we realized that the number of publications addressing privacy-related challenges
solvable with NLP and the diversity of categories of privacy-related challenges is increasing.

Increased Attention towards the Complexity of Privacy Policies

Since 2016, we observed that the amount of publications addressing the complexity of privacy
regulating documents solvable with NLP is increasing.

6.1.2. Main Privacy related Challenges for NLP as a Privacy Threat


This section will summarize the findings within the result chapter.

General Domain mostly affected by NLP as a Privacy Threat

The mapping results regarding the domain affected by NLP as a privacy threat highlight the
fact that most of the publications included in the SMS were mapped to the ”General” domain,
indicating that NLP as a privacy threat is a relatively new topic.

Popularity of Publications Towards the Written Data Type

Most of the papers in the top category which interprets NLP as a privacy threat deal with the
written data type more than with the speech data.

84
6. Discussion

Most frequent Use Case Classes

The three most frequent use case classes identified within the thesis are speech-based services
without the complete voice feature disclosure, model training without the disclosure of data,
and the privacy-utility-trade-off for NLP related concepts.

Most frequent NLP Threats

The four most frequent NLP privacy threats are The direct disclosure of sensitive data for NLP
model training or an NLP task, the exploitability of word embeddings, and the memorizability
of NN.

Direct Disclosure of Sensitive Data dominated by Speech Data

Most of the privacy issues were predominated by the written data type. However, the
disclosure of speech to a NLP task was the only one dominated by the speech data type.

Generality of Privacy Issues

As a result of our mapping of domains and the use case classes, we discovered that every
privacy issue was at least affecting three out of five domains. However, the most considerable
proportion was covered by the general domain, meaning that rather theoretical publications
addressed the privacy issues without a specific application.

Most affected Use Case Classes

As a result of this study, we identified that speech related services are affected by every
privacy issue we identified and that all privacy issues get addressed by the use case class
”Privacy-Utility-Trade-off for NLP related concepts” meaning that in every privacy issue
research is conducted to find a balance between utility and privacy.

6.1.3. Solutions supported by NLP as a Privacy Enabler


Here, we summarize the results we collected for the domain in which we discuss privacy
solutions supported by NLP.

Most frequent Solutions supported by NLP

This study received the most mapping results based on the publications that we included for
De-identification, Detection, and Designing.

85
6. Discussion

Most frequent Analysis Level supported by NLP

The two most often applied NLP concepts were semantic analysis and the combination with
the morphological analysis covering over 50 percent of the publications in the top domain in
which NLP is interpreted as privacy enabler.

Most frequently applied NLP Method

Our results show that almost 50 percent of the publications in which NLP is utilized as a
privacy enabler are based on statistical approaches. The neural networks and rule-based
approaches received 25.1 and 24.1 percent, respectively.

Diversity of Solution approaches towards Privacy Policy Complexity

We discovered that the widest variety of approaches towards solving the privacy-related
challenge are made to ease the complexity of privacy policies. 12 out of the 16 categories
for solving privacy-related challenges addressed the complexity of privacy policies. Most
publications were made within the information extraction category.

High Coverage of Semantic Analysis

Almost all privacy issue solutions supported by NLP are covered by either semantic analysis
or the combination with morphological analysis. No solution approach is purely based on
one level of analysis. Mixed-level approaches are less frequent than pure ones.

High Coverage of Solutions Approaches supported by NLP based on Neural Networks

All solution categories have approaches in which neural networks are applied. The highest
coverage is still dedicated to statistical approaches. Rule-based and mixed approaches are a
minority.

6.1.4. Main Solutions for NLP as a Privacy Threat


In this section, we will summarize the results we received for the category in which we solve
privacy issues caused by NLP.

Homomorphic Encryption as most applied PET

The three most frequent PETs according to our mapping results are homomorphic encryption,
obfuscation, and differential privacy, achieving 36, 30, and 14 publications, respectively.
Including the combinations into the count homomorphic encryption, obfuscation, and differ-
ential privacy achieve an overall coverage of 27.6, 22.4, and 14.9 percent, respectively.

86
6. Discussion

Most frequent PET combinations

The PETs, which are mainly used for combinations, are federated learning and homomorphic
encryption. The most prominent combination with the mapping results is federated learning
and differential privacy with a count of five.

PETs with Highest Solution Coverage

Three PETs contain all five privacy issues caused by NLP, which proves the fact of their
universal utilization possibilities, are differential privacy, homomorphic encryption, and
obfuscation.

Differential Privacy for Privacy-Utility-Trade-off

Differential privacy is the PET which is applied the most to introduce a trade-off between
privacy and utility into a data set to disclose data that still has value for a NLP task.

Federated Learning as main PET for Model Training

Federated learning is only applied for model training of data without sharing the raw data
and does not solve any further privacy issues caused by NLP.

Highest Diversity regarding application by use cases

Homomorphic encryption and obfuscation are the PET with the highest diversity regarding
being applied by six and seven out of nine use case categories.

Synthetic Data Generation as Solution for Speech related Services

We realized that mainly in the use case category referring to voice-based services, the PET
synthetic data generation is applied.

2018 as the beginning of the Diverse Application of PETs

In 2018, not just the amount of publications regarding the application of PETs to solve NLP
related privacy issues is increasing but also the diversity of approaches.

6.2. Limitations
Our systematic mapping study covers a lot of information that leads to a lot of thoughts.
However, this thesis was limited by time. Therefore not every thought could be followed.
Another major issue we faced during this thesis is the limited access to online papers that
are either not covered within the online service of the university or a third-party library
faces service problems that also lead to the fact that access to specific papers is not possible.

87
6. Discussion

Furthermore, the quality of abstracts and titles was crucial for us, mainly because of the
tremendous amount of papers and not always the title or abstracts provided a sufficient
amount of information. Privacy-Preserving NLP is not a mature and clearly defined term;
therefore, it might happen that some papers weren’t inspected because the required vital
words weren’t used. Thus, they were not detected by the search engines of the electronic data
sources. Another drawback was that not all electronic data sources did provide the optimal
search functionalities like filtering of keywords in a particular part of the document. This
lead to results that included the keywords but the topic covered in the document varied from
our topic significantly. Subsequently, the methodology left out some important papers that
are relevant to the topic. However, it would’ve been possible to find those papers with an
extended search query that would have led to an even more enormous amount of papers to
go through what would not be realistic in the given amount of time. Some examples for not
found papers from the PrivateNLP Workshop [254] have the following titles:

• On Log-Loss Scores and (No) Privacy [255]

• A Differentially Private Text Perturbation Method Using Regularized Mahalanobis


Metric [56]

• Identifying and Classifying Third-party Entities in Natural Language Privacy Policies


[52]

• Surfacing Privacy Settings Using Semantic Matching [54]

• Differentially Private Language Models Benefit from Public Pre-training [55]

88
7. Conclusion and Future Work
In this chapter, we will conclude the thesis and delineate possible future work.

7.1. Conclusion
All in all, we asked ourselves which privacy-related challenges exist in Natural Language
Processing. To answer this question, we conducted a systematic mapping study to investigate
the literature addressing privacy preservation and natural language processing. We included
431 publications into our mapping study and realized that 304 and 127 publications have
a different interpretation of privacy and NLP. 304 of them saw NLP as a privacy enabler
that faces different challenges than NLP as a privacy threat. We were curious about the
use cases and privacy issues those two interpretations contain and which methods and
concepts were applied. In the end, we used five major NLP related privacy threats partially
inspired by Pan, Carlini and Koppel [8, 239, 9] and 14 privacy-related challenges partially
inspired by the paper by [146] Boukharrou, Chaouche, and Mahdjar within our selected
literature. Through mappings of the use case categories on the privacy-related challenges,
we gained a better understanding of the challenges and their environment. The second
question we attempted to answer was addressing the solutions which the literature proposes
to solve the privacy-related challenges and the possibility of categorizing them. We applied
the categories suggested by Dilmegani [101] for our solutions for NLP privacy issues and
applied Petersen’s keywording approach [11] to extract the categories that describe privacy
issue solution approaches supported by NLP. To see the potential of the different solution
approaches for NLP related privacy issues or NLP as a privacy enabler, we mapped the
privacy issues on the solutions to also gain a better understanding of their applicability and
in which scenario they are used. In the end, we reported our findings and observations to
provide an overview of the research field of Privacy-Preserving Natural Language Processing
and its challenges and solutions. With this overview, we proposed different research activities
and in which direction the research field might head.

7.2. Future Work


We realized that privacy preservation techniques were developed frequently during the
mapping study or that some were reapplied. It would be interesting to investigate if there is
a trend or if some tools are more used than others and why that is the case. Furthermore,
our systematic mapping study delivers an overview for researchers that are interested in the
topic of privacy preservation and NLP. This provides the opportunity to conduct systematic

89
7. Conclusion and Future Work

literature reviews on one of the top categories identified during this thesis. This is a superficial
analysis of a significantly more critical research field that can be extended. One example is to
inspect further the applied NLP concepts in the context of privacy. Our research proves that
the intersection between privacy and NLP gains more and more interest every year.

90
A. General Addenda

A.1. Work Sheets containing the analysis of the top categories


A.1.1. NLP as a Privacy Enabler Analysis Sheet

91
Document Title Authors Abstract Year Source Generalized Domain
Generalized
1 Data Category (RQ1)Classification of Use Case Generalized Privacy Issue(RQ1)
Generalized Privacy Issue Solution
Generalized
(RQ2)Category of Applied
NLP Method
NLP Concept(RQ2)
QTIP: multi-agent NLP and privacy architecture
V. Keselj;for
D. information
JutlaWe present
retrieval
a generic
in usable
natural
Weblanguage
privacy
2005
software
processing
IEEE (NLP) architecture,
General acronym QTIL,
complex
based
sensitive
on a system
information
of cooperating
/Browsing
unstructured
multiple
in a data
private
agents
manner
(Q/A, T, I, and
Profiling
L agents) which can be used
Designing
in any information
a multi-agent
system
architecture
incorporating
Morphological
coupled
Internet
Analysis
with privacy
information
preserving
Rule
retrieval.
based
techniques
We then introduce a hybri
The Effects of OCR Error on the Extraction
Kazemof Private
TaghvaRussell
Information
OCR BeckleyJeffrey
error has been Coombs
shown not to affect
2006 the
Springer
average accuracyOther
of text retrieval or text
complex
categorization.Recent
sensitive information
studies
protecting
however sensitive
have indicated
Information
that in
information
unstructured
Sensitive
extraction
Information
text is significantly
in unstructured
Overview
degraded
Data
of the
by OCR
current
error.
state
Weof
Semantic
experimented
the legalAnalysis
regulations
with&information
Morphological
and analyse
Rule
extraction
based
different
Analysis
&
software
Statistical
data protection
on two collectio
and pri
Theoretical considerations of ethics in text
Suominen,
mining of
H;nursing
Lehtikunnas,
This documents
paperT;discusses
Back, B; Karstena,
theoreticalH;
considerations
2006
Salakoski,
Web T;ofof
Salantera,
Science
ethics in Sbuilding
Medicineand using a text
patient
mining
health
application
information
in nursing
(PHI)Mining
/documentation.
nursing
according
document
Nursing
to privacy
documentation
policies Secondary
is based
useonofthe
unstructured
process of
Suggestion
clinical
gathering
data
of
information
(Research)
security and
from
privacy
the
Semantic
patient,
requirements
Analysis
setting goals
for text
formining
care,
Ruledocumenting
based & Statistical
nursing & interventions
NN
Evaluating the state-of-the-art in automatic
Uzuner,
de-identification
O; Luo, Y;ToSzolovits,
facilitate P
and survey studies in automatic
2007 Web de-identification,
of Science as
Medicine
a part of the i2b2patient
(Informatics
healthfor
information
Integrating(PHI)
Biology
protecting
/ medical
to the patient
discharge
Bedside)privacy
project,
records
in unstructured
authorsDisclosure
organized
clinicala
oftext
Natural
sensitive
Language
Data De-identification
Processing (NLP) challenge Semantic
on automatically
Analysisremoving private
Rulehealth
basedinformation
& Statistical(PHI) from me
Reconciling privacy policies and regulations:
Krachina,
Ontological
O; Raskin,
How
semantics
V;well
Triezenberg,
the
perspective
privacyKpolicy follows 2007
a regulation
Web ofisScience
one of the current
Law concerns of the
Privacy
user. Policies
Such a task can be accomplished
Ease the automation
by directlyprocess
querying
of the
privacy
policy
Compliance
regulating
statement
with
documents
with
Requirements Ontological
the regulation semanticsof
text. Automation perspective forrequires
the process
Semantic checking
anPrivacy
Analysis Policies
expressive Statistical
meaning-based framework for Natura
Privacy and Artificial Agents, or, Is Google
Chopra,
Reading
S; White,
My Email?
WeL investigate legal and philosophical
2007notions
Web ofof Science
privacy in the
Law
context of artificialUser
agents.
Generated
Our analysis
Content
utilizes
(UGC)
a protecting
normative
/ emails sensitive
account of
Information
privacy that
in defends
unstructured
Profiling
its value
text and the extent Suggestion
to which it should
of security
be protected:
and privacy
Semantic
privacy
requirements
Analysis
is treated&as
Morphological
an interest
Rulewith
based
Analysis
moral
& Statistical
value, to supplement
& NN t
Distributed Latent Dirichlet allocation for
H.objects-distributed
Wang; Z. Li; Y.The
Cheng
cluster
paper ensemble
introduces the model of 2008
distributed
IEEElatent Dirichlet location
General(D-LDA) for complex
objects-distributed
sensitive information
cluster ensemble
/Mining
unstructured
which
according
can
data
handle
to privacy
the problems
policies Sensitive
of privacy
Information
preservation,
in unstructured
distributed
Semantic
computing
Data
similarityand
based
knowledge
on clustering
Semantic
reuse.and
Analysis
First,
k-annonymity
the latent variables
Statistical
in D-LDA and some terminologi
Finding Defects in Natural Language Confidentiality
J. H. Weber-Jahnke;
Requirements
Large-scale
A. Onabajo
software systems must2009
adhere
IEEE
to complex, multi-lateral
Law security and privacy
Privacy requirements
Policies from regulations.
Definition
It is
of industrial
Privacy Requirements
practice to define
Compliance
such requirements
with Requirements
in form G
ofeneration
natural language
of Artefacts
(NL)based
documents.
Semantic
on Privacy
Currently
Analysis
Policyexisting
Analysis
& Morphological
approaches
Rule based
Analysis
to analyzing NL confidentiality
Accurate Synthetic Generation of Realistic
PeterPersonal
ChristenAgus
Information
A large
Pudjijono
proportion of the massive amounts
2009 Springer
of data that are being
General
collected by many
Synthetic
organisations
Data Generation
today is about Privacy
people, Preserving
and often contains
Information
identifying
Sharing
Data
information
Linkage like names, addresses,
Generation
dates
ofof
Synthetic
birth, or Data
socialMorphological
security numbers.
Analysis
Privacy andRule
confidentiality
based are of great concern
Voice convergin: Speaker de-identification
Q. Jin;
by voice
A. R. Toth;
transformation
Speaker
T. Schultz;
identification
A. W. Blackmight be a suitable
2009 IEEE
answer to prevent unauthorized
General access Speech
to personal
Datadata. However we Privacy
also need
Preserving
to provideInformation
solutions toSharing
secure
Identification
transmission of spoken information.
Speech De-Identification
This challenge divides
Speech
into two
Processing
major aspects. First,
Statistical
the secure transmission of the con
A System for De-identifying Medical Message
A. Benton;
Board
S. Text
Hill;There
L. Ungar;
are millions
A. Chung;
of public
C. Leonard;
posts to
C.
2010
medical
Freeman;
IEEEmessage
J. H. Holmes
boards Medicine
by users seeking support
User Generated
and information
Content
on (UGC)
a wide
Protecting
/range
user posts
of Sensitive
medical conditions.
Information
It on
hasSocial
Secondary
been Media
shown
use
Platforms
that
of these
unstructured
posts De-identification
can
clinical
be used
datato(Research)
gain a greater understanding
Semantic Analysis
of patients'
& Morphological
experiences
Statistical
Analysis
and concerns. As investigat
Detecting Revelation of Private Information
N. Watanabe;
on OnlineH.
Social
Yoshiura
Online
Networks
social networks are being used
2010byIEEE
more and more people.
Social
While
Network
they enable
Userrich
Generated
communication,
Contentthey
(UGC)
cause
Protecting
privacySensitive
problems.Information
One problem
on Social
Unintended
that has
Media
notdata
been
Platforms
disclosure
previously researched
Detecting privacy-sensitive
is the revelation information
of private
Morphological
information
/ activities
Analysis
on blogs and
Statistical
comment parts of online social net
Data Leak Prevention through Named Entity
J. M. Gómez-Hidalgo;
RecognitionThe rise
J. M.
ofMartín-Abreu;
the social webJ.has
Nieves;
brought
2010
I. Santos;
a series
IEEEF.ofBrezo;
privacy
P. concerns
G. Bringas
Socialand
Network
threats. InUser
particular,
Generated
data Content
leakage (UGC)
is a risk
Protecting
that affects
Sensitive
the privacy
Information
of not only
on Social
Unintended
companies
Mediadata
but
Platforms
individuals.
disclosure Although
Detecting
there
privacy-sensitive
are tools that can
information
Semantic
prevent data
/ Analysis
activities
losses, they require
Ruleabased
prior step
& Statistical
that involves
& NNthe sen
Privacy Domain-Specific Ontology Building
L. Cai;
andC.Consistency
Lu; C. Zhang
WithAnalysis
the rapid development of the Internet,
2010 IEEEpeople increasingly
General
pay attention to the
complex
users' online
sensitive
privacy
information
issues. /And
Privacy
Privacy
so , Preserving
building
Ontologyprivacy
Information
domain-specific
Sharing
Profiling
ontology is a necessary method
Ontology
forbuild
usingfor
and
privacy
protectingSemantic
privacy data
Analysis
among computerStatistical
systems. Ontology is an explicit con
A Notation for Policies Using Feature Structures
Fujita, K; Tsukada,
New
Y security and privacy enhancing
2011
technologies
Web of Science
are demanded
Lawin the new information
Privacyand
Policies
communication environments
Automatedwhere
and Flexible
a huge Rule
number
Enforcement
of computers
Complexityinteract
of Privacy
withPoliciesAutomated
each other in a distributed
Access Control
and ad hoc
based
Morphological
manner
on notation
to access
Analysis
various resources.
Rule based
In this paper, we focus on ac
Privacy Measures for Free Text Documents:
Liqiang
Bridging
GengYonghua
thePrivacy
Gap YouYunli
between
compliance
WangHongyu
Theory
forand
freePractice
text
Liu documents
2011 Springer
is a challenge facing
Medicine
many organizations.
patient
Named
healthentity
information
recognition
(PHI)techniques
Measuring
/ Free Textand
the Compliance
machine learning
of apps
methods
Secondary
can be
use
used
of unstructured
to detect private
Automated
clinical
information,
data
Compliance
(Research)
such as
Checker
personally
Semanticidentifiable
Analysis information Rule
(PII) based
and personal health information

Towards Natural-Language Understanding


Papanikolaou,
and Automated
N; In
Pearson,
this
Enforcement
paper
S; Mont,
we of
survey
Privacy
MC existing
Ruleswork
and
2011
Regulations
on automatically
Web of Science
in theprocessing
Cloud:
LawSurvey
legal,and
regulatory
Bibliography
Privacy
and other
Policies
policy texts for the extraction
Automatedand
andrepresentation
Flexible Rule of
Enforcement
privacy
Compliance
knowledge
with
and
Requirements
rules. Our objective
Automaticispolicy
to linkenforcement
and apply some
Semantic
of these
Analysis
techniques
& Morphological
to policy
Ruleenforcement
based
Analysis and compliance, to p
Unlocking data for clinical research–theGanslandt,
German i2b2
Thomas;
experience
Objective:
Mate, Sebastian;
Data from Helbing,
clinical care
Krister;
is2011
increasingly
Sax,
Google
Ulrich;being
Scholar
Prokosch,
used for
HU;
Medicine
research purposes.
patient
The i2b2
health
platform
information
has been
(PHI)introduced
Privacy
/ clinicalPreserving
notes
in some US
Information
researchSharing
communities
Secondaryasuse
a tool
of unstructured
for data integration
De-identification
clinicaland
dataquerying
(Research)
by clinical
Semantic
users. The
Analysis
purpose of this project
Rule based
was to assess the applicability
On the Declassification of Confidential Abril,
Documents
D; Navarro-Arribas,
We introduce
G; Torra,
the anonymization
V of 2011
unstructured
Web of documents
Science toGeneral
settle the base of automatic
complex sensitive
declassification
information
of confidential
/protecting
unstructured
documents.
sensitive
data Information
Departing in
from
unstructured
Sensitive
known ideas
Information
textand methods
in unstructured
of De-identification
data privacy,
Data we introduce theSemantic
main issues
Analysis
of unstructured document
Statisticalanonymization and propose
Text Classification for Data Loss Prevention
Michael HartPratyusa
Businesses,
ManadhataRob
governments,
Johnson
and individuals
2011 Springer
leak confidential information,
General both accidentally
complexand
sensitive
maliciously,
information
at tremendous
/protecting
unstructured
cost
sensitive
in
data
money,
Information
privacy, in
national
unstructured
Unintended
security,
text
data
and disclosure
reputation. Several
Detecting
security
privacy-sensitive
software vendors
information
Semantic
now offer
/ Analysis
activities
“data loss prevention”
Statistical
(DLP) solutions that use simple
Automatic Anonymization of Natural Languages
H. Nguyen-Son;
Texts Posted
Q.
OneNguyen;
approach
on Social
M. Tran;
to
Networking
overcoming
D. Nguyen;
Services
theH.problem
2012
Yoshiura;
and Automatic
IEEE
of too
I. Echizen
much
Detection
information
Social
of Disclosure
about
Network
a user being
Userdisclosed
GeneratedonContent
social networking
(UGC)Protecting
/ user
services
posts
Sensitive
(by theInformation
user or by the
on Social
user's
Sensitive
friends)
Media
Information
Platforms
throughinnatural
unstructured
Automatic
language
Data
texts
Annonymization
(blogs, comments,
of Text
Morphological
Posts
status updates,
Analysis
etc.) is to Statistical
anonymize the texts. However, determ
A machine learning solution to assess privacy
Costante,
policy
Elisa;
completeness:
Sun,
A privacy
Yuanhao;
policy
(short
Petković,
is paper)
a legalMilan;
document,
den 2012
Hartog,
usedGoogle
by
Jerry;
websites
Scholar
to communicate
Law how the personal
Privacy Policies
data that they collect will
Paying
be managed.
more Attention
By accepting
to Privacy
it, the
Policies
user
Complexity
agrees of
to Privacy
release PoliciesAutomatic
his data under theAssessment
conditions stated
of Privacy
by
Semantic
thePoliciy
policy.
Analysis
Completness
Privacy policies should
Statistical
provide enough information to e
Evaluating current automatic de-identification
Ferrandez,
methods
O; South,
with
Background:
Veteran's
BR; Shen,The
health
SY;increased
Friedlin,
administration
FJ;
useSamore,
and
2012
clinical
adoption
MH;
Web
documents
Meystre,
of Electronic
Science
SM Health
Medicine
Records (EHR)patient
causeshealth
a tremendous
information
growth
(PHI)in
protecting
/ Electronic
digital information
patient
Healthprivacy
Records
usefulinfor
unstructured
(EHR)
clinicians,
Secondary
researchers
clinicaluse
textofand
unstructured
many other
De-identification
clinical
operational
data (Research)
purposes. However,
Semantic
this Analysis
information is rich inStatistical
Protected Health Information (PHI), w
Using Profiling Techniques to Protect the
Alexandre
User’s Privacy
ViejoDavid
The
in Twitter
emergence
SánchezJordi
of microblogging-based
Castellà-Roca 2012 social
Springer
networks shows
Social
how Network
important it isUser
for common
Generated
people
Content
to share
(UGC)information
Privacy Preserving
worldwide.
Information
In this environment,
Sharing
Profiling
Twitter has set it apart from
Designing
the resta of
scheme
competitors.
for a privacy
Users
Semantic
preserving
publish
Analysis
texttext
messages
&analysis
Morphological
containing
Statistical
Analysis
opinions and information abo
Detecting Sensitive Information from Textual
David Documents:
SánchezMontserrat
Whenever
An Information-Theoretic
BatetAlexandre
a document containing
Viejo
Approach
sensitive
2012 Springer
information needsGeneral
to be made public, User
privacy-preserving
Generated Content
measures
(UGC)should
Privacy
bePreserving
implemented.
Information
DocumentSharing
sanitization
Inflexibilityaims
of annonymization
at detecting sensitive
tools
Detecting
pieces
privacy-sensitive
of informationinformation
inSemantic
text, which
/ Analysis
activities
are removed
& Morphological
or hidden
Statistical
prior
Analysis
publication. Even though m
P2P Watch: Personal Health Information
Sokolova,
Detection
M;inElPeer-to-Peer
Emam,
Background:
K; Arbuckle,
File-Sharing
Users L;
of Neri,
peer-to-peer
Networks
E; Rose,
2012
(P2P)
S; Jonker,
Web
file-sharing
ofEScience
networks
Medicine
risk the inadvertentpatient
disclosure
healthofinformation
personal health
(PHI)Privacy
information
/ Electronic
Preserving
(PHI).
HealthInInformation
Records
addition (EHR)
toSharing
potentially
Unintended
causing
data
harm
disclosure Detection
to the affected of Personal
individuals, Health
this can Information
heighten
Semantic inofPeer-to-Peer
the risk
Analysis File-Sharing
data breaches Rule Networks
for health
based information custodians. Au
iDASH: integrating data for analysis, anonymization,
Ohno-Machado, andLucila;
iDASH
sharingBafna,
(integrating
Vineet;
data
Boxwala,
for analysis,
Aziz2012
A;anonymization,
Chapman,
Google Scholar
Brian
andE;sharing)
Chapman,
Medicine
is the
Wendy
newest
W; National
Chaudhuri,
patient health
Center
Kamalika;
information
for Biomedical
Day, (PHI)
Michele
Computing
Privacy
/ Electronic
E; Farcas,
Preserving
funded
Health
Claudiu;
byInformation
Records
theHeintzman,
NIH.(EHR)
It Sharing
focuses
Nathaniel
Handling
on algorithms
of
D;sensitive
Jiang,and
Xiaoqian;
data
toolsbyfor
applications
Overview
sharing data
of algorithms
in a privacy-preserving
and tools
Semantic
for sharing
manner.
Analysis
data
Foundational
in a privacy-preserving
privacy
Rule based
technology
&manner
Statistical
research performe
Improved de-identification of physician McMurry,
notes through
AJ; Fitch,
integrative
Background:
B; Savova,
modeling
Physician
G; Kohane,
of both
notes
public
IS; Reis,
routinely
and2013
BY
private
recorded
Webmedical
of during
Science
textpatientMedicine
care represent a vast
patient
and health
underutilized
information
resource
(PHI)for
Privacy
/ Physician
human
Preserving
disease
Notes studies
Information
on a Sharing
population
Secondary
scale.use
Their
of unstructured
use in research
De-identification
clinical
is primarily
data (Research)
limited by the Morphological
need to separate
Analysis
confidential patient
Statistical
information from clinical annot
A framework for privacy-preserving medical
Li, Xiao-Bai;
documentQin,
sharing
Health
Jialun; information systems have greatly
2013 increased
Google Scholar
availabilityMedicine
of medical documents
patient
andhealth
benefited
information
healthcare
(PHI)
management
Privacy
/ unstructured
Preserving
andmedical
research.
Information
documents
However,
Sharing
Secondary
there are growing
use of unstructured
concerns about
De-identification
clinical
privacy
datain(Research)
sharing medicalSemantic
documents.
Analysis
Existing approaches
Statistical
for privacy preserving data sharin
Anonymouth revamped: Getting closer McDonald,
to stylometric
AW;anonymity
Ulman, Jeffrey; Barrowclift, Marc; Greenstadt,
2013 Google
Rachel;
Scholar Other User Generated Content (UGC)protecting sensitive Information in unstructured
Identification
text De-identification / ObfuscatingSemantic
Authorship
Analysis Rule based & Statistical
A Versatile Tool for Privacy-Enhanced Web
Avi ArampatzisGeorge
Search We consider
DrosatosPavlos
the problem
S. Efraimidis
of privacy2013
leaksSpringer
suffered by InternetOther
users when they perform
User Generated
web searches,
Content
and(UGC)
propose
Searching
/ Query
a framework
Data
in a private
to mitigate
mannerthem. Our
Profiling
approach, which builds uponDesigning
and improves
a privacy-preserving
recent work onSemantic
search
web search
privacy,
Analysis
engine
approximatesStatistical
the target search results by replacing
Gateway to the Cloud - Case Study: A Privacy-Aware
R. Smith; J. Xu;Environment
S.We
Hima;
describe
O. for
Johnson
aElectronic
study in the
Health
domain
Records
2013
of health
Research
IEEEinformatics whichMedicine
includes some novel
patient
requirements
health information
for patient(PHI)
confidentiality
Privacy
/ Electronic
Preserving
in the
Health
context
Information
Records
of medical
(EHR)
Sharing
health
Secondary
research.
use of
Weunstructured
present a prototype
Designing
clinical data
which
a tailor-made
(Research)
takes health
data records
annonymity
Semantic
from
Analysis
approach
a commercial dataRule
provider,
basedanonymises
& Statisticalthem in an inn
Knowledge-based scheme to create privacy-preserving
SáNchez, David; but
Web
Castellà-Roca,
semantically-related
search engines
Jordi; (WSEs)
Viejo,
queries
Alexandre;
arefor
basic
2013
webtools
search
Google
for engines
finding
Scholar
and accessing
Other data in the User
Internet.
Generated
However,
Content
they also
(UGC)
put
Searching
/the
Query
privacy
Data
in aofprivate
their users
manner
at risk. This
Profiling
happens because users frequently
Generation
reveal
of Synthetic
private information
Data Semantic
in theirAnalysis
queries.&WSEs
Morphological
gather
Statistical
this
Analysis
personal data and build user
A taxonomy of privacy-preserving record
Vatsalan,
linkage Dinusha;
techniques
The
Christen,
processPeter;
of identifying
Verykios,which
Vassilios
records
2013
S; in
Google
two orScholar
more databases
Othercorrespond to complex
the samesensitive
entity is an
information
important/Privacy
unstructured
aspect of
Preserving
datadata
quality
Information
activities Sharing
suchData
as data
Linkage
pre-processing and data
Overview
integration.
of techniques
Known as
forrecord
privacy
Semantic
linkage,
preserving
Analysis
data data
matching
linkage
or entity
Rule resolution,
based this process has attra
Minimizing the disclosure risk of semantic
Sánchez,
correlations
David;inBatet,
Text
document
sanitization
Montserrat;
sanitization
isViejo,
crucialAlexandre
to enable2013
privacy-preserving
ScienceDirect declassification
General of confidential
complex
documents.
sensitiveMoreover,
information
considering
/Privacy
unstructured
Preserving
the advent
data Information
of new information
Sharing
Inflexibility
sharing technologies
of annonymization
that enable
tools
Semantic
the daily
correlations
publication
in document
of thousands
Semantic
sanitization
of
Analysis
textual
to documents,
MinimizingRule
risk
automatic
disclosure
based &and
Statistical
semi-automatic
& NN sani
Utility preserving query log anonymization
Batet,
viaMontserrat;
semantic microaggregation
Query
Erola,logs
Arnau;
areSánchez,
of great interest
David; for
Castellà-Roca,
2013
scientists
ScienceDirect
and
Jordi
companies for
Other
research, statistical
Query
and Data
commercial purposes. However,
Searchingthe
in availability
a private manner
of query logsProfiling
for secondary uses raises privacy
Semantic
issues
Microaggregation
since they allowfor
the
Semantic
annonymization
identification
Analysis
and/or
of&query
Morphological
revelation
logsStatistical
to of
preserve
Analysis
sensitive
utility
information about indi
A discussion of privacy challenges in user
Hasan,
profiling
Omar;with
Habegger,
big
User
data
profiling
techniques:
Benjamin;
is theBrunie,
process
The EEXCESS
Lionel;
of collecting
Bennani,
2013
use case
Google
information
Nadia;
Scholar
Damiani,
about aErnesto;
user
Otherin order to construct
complex
theirsensitive
profile. The
information
information
/Mining
unstructured
in a according
user profile
datatomay
privacy
include
policies
various
Profiling
attributes of a user such asSuggestion
geographical
of security
location,and
academic
privacy
Semantic
and
requirements
professional
Analysis for
& Morphological
background,
text mining
Rulemembership
based
Analysis
& Statistical
in groups, interests,
Privee: An architecture for automaticallyZimmeck,
analyzingSebastian;
web privacy
Privacy
Bellovin,
policies
policies
Steven
on websites
M; are based
2014onGoogle
the notice-and-choice
Scholar Lawprinciple. They notify
Privacy
Web
Policies
users of their privacyPaying
choices.
more
However,
Attention
many
to Privacy
users doPolicies
not
Complexity
read
& Ease
privacy
ofthe
Privacy
policies
comprehension
PoliciesAutomatic
or have difficulties
of PrivacyAnalysis
understanding
Policies
of Privacy
them.
Semantic
Policies
In order
Analysis
to increase
& Morphological
privacyRule
transparency
based
Analysis &
weSyntactic
proposeAnalysis
Privee—a
Preparing an annotated gold standard corpus
Deleger,
toL;
share
Lingren,
with
Objective:
T;
extramural
Ni, YZ;
TheKaiser,
investigators
current
M;study
Stoutenborough,
for
aims
de-identification
to
2014
fill the
L;
Webgap
Marsolo,
ofresearch
inScience
available
K; Kouril,
healthcare
Medicine
M; Molnar,
de-identification
K; Solti,patient
I resources
health information
by creating
(PHI)
a new
Annotations
/ Electronic
sharablewith
Health
dataset
Gold
Records
with
Standard
realistic
(EHR)Protected
Annotations
Health
andInformation
their Quality
(PHI)
Collect
without
annotated
reducing
corpus
the value Morphological
of the data for Analysis
de-identificationRule
research.
basedBy
& Statistical
releasing the annotate
Evaluating the effects of machine pre-annotation
South, BR;and
Mowery,
an The
interactive
D;Health
Suo, Y;
annotation
Insurance
Leng, JW;Portability
interface
Ferrandez,
on
and
2014
manual
O;
Accountability
Meystre,
Web
de-identification
of SM;
Science
Act
Chapman,
(HIPAA)
ofMedicine
clinical
Safe
WW text
Harbor method
patient
requires
healthremoval
information
of 18(PHI)
types
Privacy
/ of
clinical
protected
Preserving
text health
Information
information
Sharing
(PHI)
Secondary
from clinical
use documents
of unstructured
to be
De-identification
clinical
considered
data (Research)
de-identified prior
Semantic
to use for
Analysis
research purposes.
Rule
Human
based
review
& Statistical
of PHI elements fro
De-identification in natural language processing
V. Vincze; R. Farkas
Natural language processing (NLP)2014
systems
IEEE
usually require aMedicine
huge amount of textual
patient
data
health
but the
information
publication
(PHI)
of such
Privacy
/ clinical
datasets
Preserving
notesis often
Information
hindered Sharing
by privacy
Secondary
and data
use of
protection
unstructured
issues.
De-identification
clinical
Here,data
we discuss
(Research)
the questions
Semantic
of de-identification
Analysis related
Rule
to three
basedNLP
& Statistical
areas, namely,
& NN clinica
De-identification of unstructured paper-based
Fenz, Stefan;
health records
Heurix,
Whenever
Johannes;
for privacy-preserving
personal
Neubauer,
data isThomas;
secondary
processed,
2014
Rella,
use
privacy
Google
Antonio;
is Scholar
a serious issue.
Medicine
Especially in thepatient
document-centric
health information
e-health
(PHI)
area,
Privacy
/ clinical
the patients’
Preserving
notes privacy
Information
must beSharing
preserved
Secondary
in order
usetoofprevent
unstructured
any negative
De-identification
clinical data
repercussions
(Research)
for the Semantic
patient. Clinical
Analysisresearch, for example,
Statisticaldemands structured health
Text de-identification for privacy protection:
Meystre,
A study
SM;ofFerrandez,
itsAs
impact
moreon
O;
and
clinical
Friedlin,
moretext
electronic
FJ;information
South,clinical
BR;content
2014
Shen,
information
Web
SY; Samore,
of is
Science
becoming
MH Medicine
easier to access for patient
secondary
health
uses
information
such as clinical
(PHI)Privacy
/ research,
clinicalPreserving
notes
approaches
Information
that enable
Sharing
faster
Secondary
and more
use of
collaborative
unstructuredresearch
De-identification
clinical data
while(Research)
protecting patient
Semantic
privacy Analysis
and confidentiality are
Statistical
becoming more important. Clinical
De-identification of clinical narratives through
Li, MQ;writing
Carrell,
complexity
D;Purpose:
Aberdeen,
measures
Electronic
J; Hirschman,
health L;
records
Malin,2014
contain
BA Web a substantial
of Sciencequantity
Medicine
of clinical narrative,
patient
which
health
is increasingly
informationreused
(PHI)Privacy
/for
clinical
research
Preserving
narratives
purposes.
Information
To share
Sharing
data
Secondary
on a large
use
scale
of unstructured
and respectDe-identification
clinical
privacy,data
it is (Research)
critical to removeSemantic
patient identifiers.
Analysis &De-identification
Morphological
Statistical
Analysis
tools based on machine learni
Can Physicians Recognize Their Own Patients
Meystre,inS;De-identified
Shen,The
SY;adoption
Hofmann,
Notes?ofD;
Electronic
Gundlapalli,
Health
A Records
2014 Webis growing
of Science
at a fastMedicine
pace, and this growth
patient
results
health
in very
information
large quantities
(PHI)Privacy
/ clinical
of patient
Preserving
notes
clinicalInformation
informationSharing
becoming
Unintended
available
dataindisclosure
electronic format,
De-identification
with tremendous potentials,
Semantic
but also
Analysis
equally&growing
Morphological
concern
Rule based
Analysis
for patient
& Statistical
confidentiality brea
Profiling social networks to provide useful
Viejo,
andAlexandre;
privacy-preserving
Web
Sánchez,
search
web
David;
engines
search (WSEs) use search
2014 Google
queries Scholar
to profile users
Social
andNetwork
to provide personalized
User Generated
services
Content
like query
(UGC)Searching
disambiguation
/ Query Data
in a private
or refinement.
manner TheseProfiling
services are valuable because
Designing
users geta an
privacy-preserving
enhanced search
Semantic
web
experience.
search
Analysis
engine
However, the compiled
Statisticaluser profiles may contain se
Privacy detective: Detecting private information
Caliskan and
Islam,
collective
Aylin;
Detecting
Walsh,
privacy
the
Jonathan;
behavior
presenceGreenstadt,
inand
a large
amount
social
2014
Rachel;
of private
network
Googleinformation
Scholar being
Social
shared
Network
in online media
User Generated
is the first Content
step towards
(UGC)analyzing
Protectinginformation
Sensitive Information
revealing habits
on Social
Profiling
of users
Media
in social
Platforms
networks and
Detecting
a usefulprivacy-sensitive
method for researchers
information
Semantic
to study
/ Analysis
activities
aggregate privacy
Statistical
behavior.& In
NNthis work, we aim to fin
A platform for developing privacy preserving
S. Ucan;
diagnosis
H. Gu mobile
Healthcare
applications
Information Technology2014
has been
IEEEin great vogue inMedicine
the world today due
patient
to thehealth
dominant
information
need of (PHI)
computational
protecting
/ free textintelligence
patient privacy
for processing,
in unstructured
Disclosure
retrieval,
clinical
and
oftext
sensitive
the use of
data
health
forInformation
analysis
care information.
Extraction
This
from
paper
Health
Semantic
presents
Records
Analysis
a platform system forStatistical
developing self-diagnosis mobile ap
Filtering Personal Queries from Mixed-Use
Ary Fagundes
Query Logs
Bressane
QueriesNetoPhilippe
performed against
DesaulniersPablo
the open
2014
Web
Ariel
Springer
during
DuboueAlexis
working hours
Smirnov
Other
reveal missing content
QueryinData
the internal documentation
Mining
within
according
an organization.
to privacyMining
policiessuch
Compliance
queries iswith
thusRequirements
advantageous
Information
but it must
Extraction
strictly adhere
from privacy
to
Morphological
privacy
policies
policyAnalysis
and meet privacy
Statistical
expectations of the employees. P
AutoCog: Measuring the Description-to-permission
Qu, ZY; Rastogi,
Fidelity
V;
The Zhang,
inbooming
Android
XY;Applications
popularity
Chen, Y; Zhu,
of smartphones
TT; Chen,
2014Z isWeb
partly
of Science
a result of application
Law markets where
complex
users
sensitive
can easily
information
download/Ease
unstructured
widethe
range
comprehension
ofdata
third-partyof
applications.
Privacy Policies
Complexity
However,ofdue
Privacy
to thePoliciesMapping
open nature of markets,
of Description-to-permission
especially on
Semantic
Android,
Fidelity
Analysis
thereon
have
a
& scale
Morphological
been several
Rule based
Analysis
privacy&and
Statistical
& Syntactic
security & concerns
Analysis
NN w
Latent semantic analysis for privacy preserving
Selmi, Mouna;
peer feedback
Hage,
Today’s
Hicham;
e-learning
Aïmeur,
systems
Esma;enable 2014
students
Google
to communicate
Scholar with
Otherpeers (or co-learners)
User Generated
to ask or provide
Contentfeedback,
(UGC)protecting
leadingsensitive
to more efficient
Information
learning.
in unstructured
Unintended
Unfortunately,
text
datathis
disclosure
new optionSemantic
comes with
analysis
significantly
for privacy
increased
preserving
Semantic
risksAnalysis
peer
to thefeedback
privacy of theStatistical
feedback requester as well as the p
Document Title Authors Abstract Year Source Generalized Domain
Generalized
1 Data Category (RQ1)Classification of Use Case Generalized Privacy Issue(RQ1)
Generalized Privacy Issue Solution
Generalized
(RQ2)Category of Applied
NLP Method
NLP Concept(RQ2)
The Impact of Anonymization for Automated
Shermis,
Essay
Mark
Scoring
D.;This
Lottridge,
study investigated
Sue; Mayfield,
theElijah
impact of
2015
anonymizing
Wiley text on predicted
Generalscores made User
by two
Generated
kinds of automated
Content (UGC)
scoring
protecting
engines:
sensitive
one thatInformation
incorporates
in unstructured
elements
Handlingofofnatural
text
sensitive
language
data byprocessing
applications
Automated(NLP)
Essayand
Scoring
one that does
Semantic
not. Eight
Analysis
data sets (N = Rule
22,029)
based
were
& used
Statistical
to form
& NN
both traini
Annotating longitudinal clinical narratives
Stubbs,
for de-identification:
A; Uzuner,TheO 2014
The 2014
i2b2/UTHealth
i2b2/UTHealth
natural
corpus
language
2015 Web
processing
of Science
shared Medicine
task featured a trackpatient
focusedhealth
on the
information
de-identification
(PHI)Annotations
/ clinical
of longitudinal
narratives
with medical
Gold Standard
records. For
Complexity
this track,of
wePrivacy
de-identified
PoliciesCollect
a set of 1304
annotated
longitudinal
corpusmedical
Morphological
records describing
Analysis296 patients.
Rule based
This corpus was de-identified u
Iterative Classification for Sanitizing Large-Scale
B. Li; Y. Vorobeychik;
Datasets
CheapM. ubiquitous
Li; B. Malincomputing enables
2015
the IEEE
collection of massiveGeneral
amounts of personal
complex
data insensitive
a wide variety
information
of domains.
/Privacy
unstructured
Many
Preserving
organizations
data Information
aim toSharing
share
Unintended
such datadata
while
disclosure
obscuring features
De-identification
that could disclose identities
Semantic or other
Analysis
sensitive information.
Rule based
Much&of
Statistical
the data now collecte
Secure Obfuscation of Authoring Style Hoi LeReihaneh Safavi-NainiAsadullah
Anonymous authoring Galib
includes writing
2015reviews,
Springer
comments andSocial
blogs,Network
using pseudonyms
User Generated
with the general
Contentassumption
(UGC)protecting
that using
sensitive
these
Information
pseudonyms
in unstructured
will
Profiling
protect the
textreal identity of authors
De-identification
and allows / Obfuscating
them to freely
Semantic
Authorship
expressAnalysis
their views. It has been
Statistical
shown, however, that writing sty
MonkeyDroid: Detecting UnreasonableMa,
Privacy
K; Liu,
Leakages
MY; Guo,
Static
of SQ;
Android
and
Ban,
dynamic
Applications
T taint-analysis approaches
2015 Web have
of Science
been developed
Mobile to
Application
detect the complex
processing
sensitive
of sensitive
information
information.
/Investigating
unstructured
Unfortunately,
the
datapotential
faced with
Privacy
the result
Impact
Handling
ofofanalysis
smart
of sensitive
phone
about apps
data
operations
by applications
Detecting
of sensitive
privacy-sensitive
information, people
information
Semantic
have/ Analysis
no
activities
idea of which operation
Statistical
is legitimate operation and wh
A Novel Approach to Prevent Personal N.
Data
A. Patil;
on a Social
A. S. Manekar
Online
Networksocial
Usingnetwork
Graph such
Theory
as Twitter,
2015 LinkedIn
IEEE are widely used
Social
now Network
a days. In these
Userpeople
Generated
lists their
Content
personal
(UGC)information
Protecting Sensitive
and favorite
Information
activities on
which
Social
Unintended
meant
Media
to data
be
Platforms
secure.
disclosure
PrivateDetecting
information
privacy-sensitive
leakage becomesinformation
Morphological
key issues
/ activities
with
Analysis
social network
Rule
users.
based
Apart
& Statistical
from this user are ham
Oblivion: Mitigating Privacy Leaks by Controlling
Milivoj SimeonovskiFabian
the Discoverability
Search engines
BendunMuhammad
of Online
are the
Information
prevalently
Rizwan
2015
used
AsgharMichael
Springer
tools to collect
BackesNinja
information
OtherMarnauPeter
about individuals
Druschel
Query
on Data
the Internet. Search results
Searching
typically
in comprise
a private a
manner
variety of sources
Unintended
that contain
data disclosure
personal information
Detecting —
privacy-sensitive
either intentionally
information
Semantic
released/ Analysis
activities
by the person herself,
Rule or based
unintentionally
& Statistical
leaked
& NN
or publish
Autoppg: Towards automatic generationYu,
of Le;
privacy
Zhang,
policy
Tao;
Afor
privacy
Luo,
android
Xiapu;
policy
applications
Xue,
is a Lei;
statement informing
2015 users
Google
how
Scholar
their information
Law will be collected,
Privacy
used,
Policies
and disclosed. Failing
Ease
to provide
the writing
a correct
process
privacy
of privacy
policypolicies
may
Complexity
result
based
inofaon
Privacy
fine.
code
However,
PoliciesGeneration
writing privacy
of Artefacts
policy is based
tediousSemantic
on
and
Privacy
error-prone,
Analysis
Policy Analysis
because theStatistical
author may not well understand the
Towards an information type lexicon forJ.privacy
Bhatia;policies
T. D. Breaux
Privacy policies serve to inform consumers
2015 IEEE
about a company'sLaw
data practices, andPrivacy
to protect
Policies
the company from legal
Easerisk
thedue
comprehension
to undisclosed
of Privacy
uses of Policies
Complexity
consumer data.
of Privacy
In addition,
PoliciesMapping
US and EU Privacy
regulators
Policy
require
Content
companies
Morphological
to selected
to accurately
Dictionary
Analysisdescribe
Rule
their
based
practices in these policies, and
Ontology-Enabled Access Control and Marcel
PrivacyHeupelLars
Recommendations
Recent
FischerMohamed
trends in ubiquitous
BourimiSimon
computing
2015
Scerri
target
Springer
to provide user-controlled
Social Network
servers, providing
User Generated
a single Content
point of (UGC)
access
Protecting
for managing
Sensitive
different
Information
personalon
data
Social
Profiling
in different
Media Platforms
Online Social Networks
Ontology-Enabled
(OSNs), i.e.Access
profile Control
data
Semantic
andand
resources
Privacy
Analysis
Recommendations
from
& Morphological
various social
Rule based
Analysis
interaction services (e.g., Linke
Privacy-driven access control in social Imran-Daud,
networks by means
Malik;
In Sánchez,
online
of automatic
social
David;
networks
semantic
Viejo, (OSN),
annotation
Alexandre
users
2016quite
ScienceDirect
usually disclose sensitive
Social Network
informationUser
about
Generated
themselves
Content
by publishing
(UGC)Protecting
messages.
Sensitive
At the same
Information
time, they
on Social
Unintended
are (inMedia
manydata
Platforms
cases)
disclosure
unable toAutomatic
properly manage
semantic
the
annotation
access to
Semantic
mechanism
this sensitive
Analysis
information
& Morphological
due
Statistical
to the
Analysis
following issues: (i) the rigidn
Automatic Extraction of Metrics from SLAs
S. Mittal;
for Cloud
K. P.Service
Joshi;
To effectively
C.
Management
Pearce;manage
A. Joshicloud based 2016
services,
IEEEorganizations need
Lawto continuously monitor
Privacythe
Policies
performance metricsEase
listedthe
in the
automation
Cloud service
processcontracts.
of privacy
Complexity
However,
regulating
of
these
Privacy
documents
legalPoliciesAutomation
documents, like Service
process
Level
of SLAs
Agreements
Semantic
(SLA)
Analysis
or privacy
& Morphological
policy documents,
Rule based
Analysis
are currently managed as
Is the Juice Worth the Squeeze? CostsCarrell,
and Benefits
DS; Cronkite,
of Multiple
Background:
DJ;Human
Malin,
Clinical
BA;
Annotators
Aberdeen,
text contains
for Clinical
JS; Hirschman,
valuable
2016Text
Web
De-identification
information
of
L Science but must
Medicine
be de-identified before
patientithealth
can beinformation
used for secondary
(PHI)Annotations
/ clinical
purposes.
notes
with
Accurate
Gold Standard
annotation of
Annotations
personally and
identifiable
their Quality
information
Collect(PII)
annotated
is essential
corpusto the development
Morphological
ofAnalysis
automated de-identification
Rule based systems and to manual
The Creation and Analysis of a WebsiteWilson,
PrivacyS;Policy
Schaub,
Corpus
Website
F; Dara,
privacy
AA; Liu,
policies
F; Cherivirala,
are often S;
ignored
2016
Leon,
Web
by
PG;
Internet
ofAndersen,
Science
users,
MS;because
Law
Zimmeck,
these
S; documents
Sathyendra,
Privacy
tend
KM;
Policies
toRussell,
be longNC;
and Norton,
difficult
Paying
TB;
to understand.
Hovy,
more Attention
E; Reidenberg,
However,
to Privacy
the
J; Sadeh,
significance
Policies
Complexity
N of privacy
of Privacy
policies
PoliciesCollect
greatly exceeds
annotated
the attention
corpus paidMorphological
to them: theseAnalysis
documents areRule
binding
based
legal agreements between w
Optimizing annotation resources for natural
Li, MQ;
language
Carrell,de-identification
D;Objective:
Aberdeen,Electronic
J;via
Hirschman,
a game
medical
theoretic
L; Kirby,
records2016
framework
J; (EMRs)
Li, B;
WebVorobeychik,
are
of Science
increasingly
Y; Malin,
Medicine
repurposed
BA for activities
patientbeyond
health clinical
information
care,(PHI)
suchAnnotations
/as
Electronic
to support
with
Health
translational
Gold
Records
Standard
research
(EHR) Annotations
and public policy
and their
analysis.
QualityToDe-identification
mitigate privacy risks, healthcare
Morphological
organizations
Analysis
(HCOs) aimStatistical
to remove potentially identifying pati
Data sanitization for privacy preservation
P. Tambe;
on Social
D.Network
VoraOnline Social Networks (OSNs) are2016
becoming
IEEE increasingly important
Social Network
in our day toUser
day lives.
Generated
Statistics
Content
show(UGC)
that 74%
Protecting
of theSensitive
Internet users
Information
are involved
on Social
Unintended
in social
Media networking.
data
Platforms
disclosure
Unfortunately
De-identification
many of us are unaware
Semantic
of the threats
Analysis
and&vulnerabilities
Morphological
Rulethat
based
Analysis
come
& Statistical
with OSNs. & NN
These iss
New Challenge of Protecting Privacy due
L. Xu;
to Stained
Y. Wu Recognition
k-Anonymity is a good way to protect
2016
people's
IEEE privacy. It is efficient
Other to find which part
Usershould
Generated
be covered
Content
up(UGC)
by using
protecting
/ car
mosaic.
plates
sensitive
Traditional
Information
OCR could
in unstructured
lose
Unintended
efficiency
text
data
when
disclosure
the part covered
De-identification
by mosaic./ Obfuscating
ConsideringMorphological
Authorship
that words in text
Analysis
can be connected
Rule based
by logical
& Statistical
relationship, with the
Obfuscating gender in social media writing
Reddy, Sravana; Knight,
The vastKevin;
availability of textual data 2016
on social
Google
media
Scholar
has led to Social
an interest
Network
in algorithms
User to
Generated
predict user
Content
attributes
(UGC)such
Protecting
/ tweets
as gender
Sensitive
basedInformation
on the user’s
on Social
writing.
Profiling
Media
ThesePlatforms
methods are valuable
Designing
for social
a gender
science
obfuscation
research
Semantic
tool
as wellAnalysis
as targeted advertising
Ruleand
based
profiling,
& Statistical
but also
& NN
compromis
Working at the web search engine sidePàmies-Estrems,
to generate privacy-preserving
The
David;
popularity
Castellà-Roca,
user
of Web
profiles
Search
Jordi; Viejo,
Engines
2016
Alexandre;
(WSEs)
Googleenables
Scholarthem Other
to generate a lot of User
data in
Generated
form of query
Content
logs.(UGC)
These
Privacy
/ user
files contain
profiles
Preserving
all search
Information
queries
Sharing
submitted
Unintended
by users.
data disclosure
Economical benefits
Designingcould
a web
be search
earned engine
by means
Semantic
sideoftoselling
generate
Analysis
or releasing
privacy-preserving
those
Statistical
logs user
to third
profiles
parties. Nevertheless
Enforcing transparent access to privateViejo,
content
Alexandre;
in social Social
Sánchez,
networks
networks
by
David
means
haveofbecome
automatic
an sanitization
essential
2016 ScienceDirect
meeting point for millions
Social Network
of individualsUser
willing
Generated
to publish
Content
and consume
(UGC)Protecting
huge quantities
Sensitive
of heterogeneous
Information on Social
information.
Profiling
MediaSome
Platforms
studies have
Detecting
shown that
privacy-sensitive
the data published
information
Semantic
in these
/ Analysis
activities
platforms& may
Morphological
contain
Statistical
sensitive
Analysispersonal information an
PPMark: An Architecture to Generate Privacy
de Pontes,
Labels
DRG;
Using
Layman
Zorzo,
TF-IDF
SD
and
Techniques
non-layman
and
users
the Rabin
often
2016
have
Karp
Web
difficulties
Algorithm
of Science
to understand
Law privacy policyPrivacy
texts. The
Policies
amount of time spent
Ease
on reading
the comprehension
and comprehending
of Privacy
a policy
Policies
Complexity
posesofa Privacy
challenge
PoliciesGeneration
to the user, who rarely
of Artefacts
pays attention
based Morphological
ontoPrivacy
what hePolicy
orAnalysis
she
Analysis
is agreeing
Statistical
to. Given this scenario, this paper a
If You Can't Measure It, You Can't Manage
M. Alohaly;
It: Towards
H. Takabi
Quantification
Despite the best
of Privacy
efforts Policies
of usable privacy
2016 IEEE
community to ease the
Lawburden of understanding
Privacy Policies
privacy policies, there Investigating
is still a lack in
the
having
potential
a notice
Privacy
display
Impact
Handling
thatofhelps
smart
of sensitive
users
phoneto apps
data
compare
by applications
Generation
different application
of Artefacts
alternatives
based Semantic
on based
Privacyon
Analysis
Policy
their data
Analysis
gathering
Rule
practices.
based &InStatistical
this work, we propose
Extracting keyword and keyphrase fromAudich,
online Dhiren
privacyA;
policies
One
Dara,
ofRozita;
the keyNonnecke,
components
Blair;
of constructing
2016 Google
an ontology
Scholaris a taxonomy.
Law Creating aPrivacy
comprehensive
Policies taxonomy involves
Easeextracting
the comprehension
keywords and
of Privacy
keyphrases
Policies
Complexity
from the
of Privacy
domainPoliciesInformation
corpus. It is a time consuming
Extraction from
endeavour
privacy
Semantic
that
policies
involves
Analysisdomain expertise
Statistical
and syntactic and structural kno
Automatic Summarization of Privacy Policies
Tomuro, using
N; Lytinen,
Ensemble
When
S; Hornsburg,
Learning
customers K
purchase a product
2016
or sign
Webup
offor
Science
service from
Law
a company, they Privacy
often arePolicies
required to agree to aPaying
Privacymore
Policy
Attention
or Terms
to of
Privacy
Service
Policies
agreement.
Complexity
& Ease
Many
ofthe
Privacy
comprehension
of these
PoliciesInformation
policiesofare
Privacy
lengthy,
Extraction
Policies
and a from
typical
privacy
customer
Semantic
policies
agrees
Analysisto them without
NNreading them carefully if at all. To
Ambiguity in Privacy Policies and the Impact
Reidenberg,
of Regulation
JR; Bhatia,
WebsiteJ;privacy
Breaux,policies
TD; Norton,
often TB
contain
2016ambiguous
Web of Science
language that
Lawundermines the purpose
Privacy Policies
and value of privacy notices
Ease for
thesite
comprehension
users. This paper
of Privacy
compares
Policies
Complexity
the impact
of Privacy
of different
PoliciesMapping
regulatory models
of Ambiguity
on theon
ambiguity
a score
Semantic
board
of privacy
Analysis
policies in multiple
Rule
online
basedsectors.
& Statistical
First, the paper dev
Personal privacy protection in time of big
Sokolova,
data Marina;Personal
Matwin, privacy
Stan; protection increasingly
2016 becomes
Google Scholar
a story of privacy
Other protection in electronic
patient health
datainformation
format. Personal
(PHI)protecting
privacy protection
sensitive also
Information
becomesin aunstructured
showcase
Data Linkage
oftext
advantages and challenges
Overview ofofChallenges
Big Data phenomenon.
and applied
Semantic
methods
Accumulation
Analysis
for protection
of massive
of
Statistical
personal
data volumes
healthcombined
information
with(PHI)
deve
Privacy-preserving sound to degrade automatic
Hashimoto,
speaker
Kei; Yamagishi,
In
verification
this paper,
Junichi;
performance
a privacy
Echizen,
protection
Isao; method
2016 Google
to prevent
Scholar
speaker identification
General from recorded
Speech speech
Data is proposed andVoice
evaluated.
protection
Although
in public
many techniques
Identification
for preserving various private
Speech
information
De-Identification
included in speech
Speech
haveProcessing
been proposed, theirStatistical
impacts on human speech commun
A Privacy Guard Service X. Ye; F. Wang Mobile devices are changing the way
2017
thatIEEE
people conduct theirOther
daily businesses and
complex
carry sensitive
out their works.
information
More /and
protecting
unstructured
more people
sensitive
dataareInformation
using mobile
in unstructured
devices
Unintended
to view
text
data
personal
disclosure
or work-related
Automaticinformation
Conversioninofpublic
sensitive
places,
Speech
information
e.g.
Processing
café,intrain,
text
& Semantic
etc. This
Statistical
Analysis
type of information might contain s
A cascaded approach for Chinese clinical
Jian,
text
Z;de-identification
Guo, XS; With
Liu, SJ;
rapid
with
Ma,
adoption
less
HD;annotation
Zhang,
of Electronic
SD;
effort
Zhang,
Health
2017
R; Lei,
Records
Web JBof Science
(EHR) in China,
Medicine
an increasing amount
patientof
health
clinical
information
data has (PHI)
beenPrivacy
available
/ free clinical
Preserving
to support
text Information
clinical research.
Sharing
Secondary
Clinical data
use secondary
of unstructured
use De-identification
usually
clinicalrequires
data (Research)
de identification
Semantic
of personal
Analysis
information to protect
Rule based
patient
& privacy.
Statistical
Since manually
Semi-Automatic De-identification of Hospital
I. Calapodescu;
Discharge D.
Summaries
Patient
Rozier;
medical
S.with
Artemova;
Natural
recordsJ.Language
represent
Bosson aProcessing:
2017
very IEEE
rich and
A Case-Study
important source
of
Medicine
Performance
of information
andpatient
for
Real-World
clinical
health
research.
Usability
information
Still, (PHI)
this data
Privacy
/ discharge
cannot
Preserving
be
summaries
usedInformation
directly for research
Sharing
Secondary
purposes,
use as
of these
unstructured
documents
De-identification
clinical
contain
data (Research)
highly-sensitive personal
Semanticinformation
Analysis protectedStatistical
by the law. In this paper, we evaluate
De-Identification of Medical Narrative Data
Foufi, V; Gaudet-Blavignac,
MaintainingC;data
Chevrier,
security
R;and
Lovis,
privacy
C 2017
in an
Web
eraofofScience
cybersecurity
Medicine
is a challenge. The
patient
enormous
healthand
information
rapidly growing
(PHI)Privacy
/ amount
clinicalPreserving
narratives
of health-related
Information
dataSharing
available
Secondary
today use
raises
of unstructured
numerous questions
De-identification
clinical data
about(Research)
data collection,Morphological
storage, analysis,
Analysis
comparability
Ruleand
based
interoperability but also about
The UAB Informatics Institute and 2016Bui,
CEGSDDA;
N-GRID
Wyatt,de-identification
Clinical
M; Cimino,
narratives
JJ shared
(the text
tasknotes
challenge
found
2017inWeb
patients'
of Science
medical records)
Medicine
are importantpatient
information
healthsources
information
for secondary
(PHI)Privacy
/ clinical
usePreserving
in
narratives
research.
Information
However, Sharing
in order
Secondary
to protect
usepatient
of unstructured
privacy, they
De-identification
clinical
mustdata
be de-identified
(Research) priorMorphological
to use. ManualAnalysis
de-identification
Rule
is considered
based to be the gold standa
De-identification of medical records using
Jiang,
conditional
Zhipeng;random
Zhao,
The CEGS
Chao;
fields and
N-GRID
He, long
Bin; 2016
Guan,
short-term
Shared
Yi; Jiang,
memory
Task
2017
Jingchi
1 networks
in
ScienceDirect
Clinical Natural Language
MedicineProcessing patient
focuseshealth
on theinformation
de-identification
(PHI)protecting
/ofPsychiatry
psychiatric
patient
Notes
evaluation
privacy in
records.
unstructured
This
Secondary
paper
clinical
describes
use
textof unstructured
two participating
De-identification
clinicalsystems
data (Research)
of our team, based
Semantic
on conditional
Analysis random fields
Statistical
(CRFs)&and
NN long short-term mem
A natural language processing challenge
Uzuner,
for clinical
Ozlem;records:
Stubbs,
Research
Amber; Filannino,
Domains Criteria
Michele;
(RDoC)
2017 for
Google
psychiatry
Scholar Medicine patient health information (PHI)protecting patient privacy in unstructured
Secondary
clinicaluse
textof unstructuredDe-identification
clinical data (Research) Morphological Analysis Rule based
PHIs (Protected Health Information) identification
K. Rajput; G.from
Chetty;
free
To preserve
text
R. Davey
clinical
patient
records
confidentiality,
based on machine
there
2017 isIEEE
learning
a need to identify PHIs
Medicine
(Protected Health
patient
Information)
health information
from free text
(PHI)
text
protecting
/ Electronic
clinical records,
patient
Healthprivacy
and
Records
such
in unstructured
sensitive
(EHR) Secondary
information
clinicaluse
text
must
of unstructured
either be removed
De-identification
clinical or
data
replaced.
(Research)
Identification
Semantic
of the Analysis
PHI's are&normally
Morphological
performed
Rule based
Analysis
manually
& Statistical
on large
& NN
sets of s
Privacy-Aware Data Analysis Middleware
Thien-An
for Data-Driven
NguyenNhien-An
Privacy
EHR Systems
preservation
Le-KhacM-Tahar
is an essential
Kechadi2017
requirement
Springerfor information
Medicine
systems and alsopatient
it is regulated
health information
by law. However,
(PHI)Mining
/existing
Electronic
according
solutions
Health
tofor
Records
privacy
privacy
policies
(EHR)
protection
Disclosure
during of
data
sensitive
analysisdata
have
forDesigning
some
analysis
limitations
Securewhen
Viewsapplied
for privacy
Morphological
to data-driven
preserving
Analysis
electronic
data analysis
health
Rulerecord
based(EHR) systems such as da
Detecting Protected Health InformationHenriksson,
in Heterogeneous
A; Kvist,
ToClinical
enable
M; Dalianis,
Notes
secondary
H use of healthcare
2017data
Web inof
a privacy-preserving
Science Medicine
manner, therepatient
is a need
health
for information
methods capable
(PHI)Privacy
/ofclinical
automatically
Preserving
notes identifying
Information
protected
Sharing
Secondary
health information
use of unstructured
(PHI) in clinical
Detecting
clinical
text.
data
privacy-sensitive
To (Research)
that end, learning
information
Semantic
predictive
/ Analysis
activities
models from labeled
Statistical
examples has emerged as a prom
Protecting privacy in the archives: Preliminary
T. Hutchinson
explorations
Natural
of topic
language
modeling
processing
for born-digital
(NLP)2017
collections
is an IEEE
area of increased interest
Generalfor digital archivists,
complex although
sensitive
most
information
research/to
Protecting
unstructured
date hasSensitive
focused
data onInformation
digitized rather
on Social
Sensitive
thanMedia
born-digital
Information
Platforms
collections.
in unstructured
Detecting
This study
Data
privacy-sensitive
in progress explores
information
Semantic
whether
/ Analysis
activities
NLP techniques can
Statistical
be used effectively to surface docu
Social media data sensitivity and privacy
A. scanning
Lokhande;anS.experimental
Bansal
Now in these
analysis
days the
with
social
hadoop
network
2017
hasIEEE
becomes a daily habit
Social
for all.
Network
Most of the
User
young
Generated
and teenager
Content
are(UGC)
consuming
Protecting
/ tweets
theirSensitive
time on social
Information
media.onDue
Social
Unintended
to frequent
Mediadata
Platforms
reachability
disclosureof users
Detecting
the different
privacy-sensitive
marketing information
companies
Semanticare
/ Analysis
activities
also usages
& Morphological
this platform
Statistical
Analysis
for publishing advertises. Bu
An Evaluation of Constituency-Based Hyponymy
M. C. Evans;
Extraction
J. Bhatia;
Requirements
from
S. Privacy
Wadkar;
analysts
Policies
T. D. Breaux
can model regulated
2017 IEEE data practices to Law
identify and reason Privacy
about risks
Policies
of non-compliance.Ease
If terminology
the comprehension
is inconsistent
of Privacy
or ambiguous,
Policies
Complexity
however,
of Privacy
these
PoliciesDetection
models and theirofconclusions
vagueness will beMorphological
unreliable. To Analysis
study this problem,
Rulewe
based
investigated an approach to a
Learning Data Privacy and Terms of Service
M. Bahrami;
from Different
M. Singhal;
People
Cloud
W.
are
Service
Chen
using Providers
on daily basis websites,
2017 IEEE
mobile applications,Law
and online softwarePrivacy
nowadays.
Policies
Major mobile and desktop
Ease thesoftware
comprehension
applications
of Privacy
have been
Policies
Complexity
movedof
toPrivacy
cloud computing
PoliciesInformation
environment
Extraction
that allows
fromusers
privacy
Semantic
to interact
policies
Analysis
with a variety of applications
Rule based on the move and pay only
Large-scale readability analysis of privacy
Fabian,
policies
Benjamin;Online
Ermakova,
privacy
Tatiana;
policiesLentz,
notifyTino;
users 2017
of a Website
Google how
Scholar
their personal
Law information is Privacy
collected,
Policies
processed and stored.
Paying
Against
moretheAttention
background
to Privacy
of rising
Policies
privacy
Complexity
&concerns,
Easeofthe
Privacy
comprehension
privacy
PoliciesInformation
policies of
seem
Privacy
to represent
Extraction
Policies an
from
influential
privacy
Semantic
instrument
policies
Analysis
for&increasing
Morphological
Statistical
customer
Analysis
trust and loyalty. However,
Enhancing sensitivity classification withMcDonald,
semantic features
Graham;
Government
using
Macdonald,
worddocuments
embeddings
Craig; Ounis,
mustIadh;
be reviewed
2017 Google
to identify
Scholar
any sensitive
Otherinformation they
Government
may contain,
Databefore they canPrivacy
be released
Preserving
to the Information
public. However,
Sharing
Sensitive
traditionalInformation
paper-based
in unstructured
sensitivity
Semantic
review
Data
features
processes
usingare
word
notSemantic
embeddings
practical for
Analysis
reviewing
for classification
born-digital
NN documents. Therefore, there is a
How Well Can WordNet Measure Privacy:
N. Zhu;
A Comparative
M. Zhang;Privacy
D.
Study?
Feng;
is aJ.fundamental
He issue in big
2017
data.
IEEE
Meanwhile, determining
General
semantic relationships
User Generated
betweenContent
words and
(UGC)
phrases
Ease the
in privacy
automation
is required
processforofeffective
privacy
Compliance
regulating
privacy protection
with
documents
Requirements
to the data
Semantic
that originates
Similarities
from
Tool
a variety
Comparison
Semantic
of sources,
Analysis
a main characteristic
Statistical
of big data. WordNet has been
Semantic Information Retrieval from Patient
Sicuranza,
Summaries
M; Esposito,
This paper
A; Ciampi,
presents
M a novel architectural
2017 Web
model
of approach
Science for Medicine
the extraction of strictly
patient
necessary
health information
information(PHI)
fromprotecting
/Patient
free clinical
Summary
patient
text privacy
documents.
in unstructured
These Disclosure
clinical
clinical
documents,
oftext
sensitivesynthesizing
data forSemantic-centered
analysis
the medical history
Rulesof a patient,
Semanticcontain
Analysis
heterogeneous information,
Statistical characterized by differen
Influence of speaker de-identification inLopez-Otero,
depression detection
Paula;
Depression
Magariños,
is aCarmen;
commonDocio-Fernandez,
mental disorder
2017 Wiley
that
Laura;
is usually
Rodriguez-Banga,
addressed
Medicine
byEduardo;
outpatient
Erro,
treatments
Speech
Daniel;
DataGarcia-Mateo,
that favour patients’
CarmenVoice-Based
inclusion in the
Healthcare
society. This
in a raises
privatethe
Identification
manner
need for tools to remotely monitor
Speechthe
De-Identification
emotional state of the
Speech
patients,
Processing
which can
& Morphological
be carried
Statistical
out Analysis
via telephone or the Internet u
Voicemask: Anonymize and sanitize voice
Qian,
input
Jianwei;
on mobile
Du,Voice
Haohua;
devices
inputHou,
has been
Jiahui;tremendously
Chen, Linlin;2017
improving
Jung,Google
Taeho;
theScholar
user
Li, Xiang-Yang;
experience
Other
Wang,
of mobile
Yu; devices
Deng,
Speech
Yanbo;
by freeing
Data our hands from typing
Voice-based
on the small
Service
screen.
in a private
Speech manner
recognition
Identification
is the key technology
Speech
that powers
de-identification
voice input, andSpeech
it is usually
Processing
outsourced to the NN
cloud for the best performance. How
A Data Purpose Case Study of Privacy J.
Policies
Bhatia; T. D. Breaux
Privacy laws and international privacy
2017
standards
IEEE require that companies
Law collect only
Privacy
the data
Policies
they have a stated purpose
Ease thefor,
comprehension
called collection
of Privacy
limitation.
Policies
Misusage
Furthermore,
of Sensitive
these Information
regimesSuggestion
prescribe
by Datathat
ofCollector
security
companies
or
andAdversary
privacy
willSemantic
not use
requirements
data
Analysis
for purposes other
Statistical
than the purposes for which they w
Privacy Matters: Detecting Nocuous Patient
Baumer,
Data
FS;
Exposure
Grote,
Consulting
N;
in Kersting,
Onlinea Physician
physician
J; Geierhos,
was
Reviews
long
M regarded
2017 Web
as an
of intimate
Science and Medicine
private matter. The physician-patient
patient health information
relationship
(PHI)
was
protecting
/ Physician
perceived
patient
Notes
as sensitive
privacy in
and
unstructured
trustful.
Identification
Nowadays,
clinical text
there is a change,
Unintended
as medical
data
procedures
disclosure
and Semantic
physiciansAnalysis
consultations are reviewed
Rule based
like other services on the Int
Privacy Preserving Information SharingTian,
in Modern
Yuan; and Emerging
Users share
Platforms
a large amount of information
2018 Google
with modern
Scholarplatforms
Law such as web platforms
Privacy and
Policies
social platforms forPrivacy
various Preserving
services. However,
Information
theySharing
faceComplexity
the risk of of
information
Privacy PoliciesAutomated
leakage because Access
modern Control
platforms still
Semantic
lack proper
Analysis
security
& Morphological
policies.
Rule
Existing
based
Analysis
security
& Statistical
policies,
& NNsuch as
Privacy and Security Issues Due to Permissions
N. Chiluka;
Glut
A. K.
in Android
Singh;
The security
R.System
Eswarawaka
and privacy issues are2018
very paramount
IEEE of any kind
Mobile
theseApplication
days. Android
User
operating
Generated
system
Content
supported
(UGC)Investigating
with
/ Code
the broad
therange
potential
of applications
Privacy Impact
Handling
whichofdeals
smart
of sensitive
with
phone
dataapps
data
which
by holds
applications
Code-based
personalprivacy
data toanalysis
uselessof
Semantic
data
Smart
within
phone
Analysis
the apps
devices,
& Morphological
whereRule
privacy
based
Analysis
and&security
Statistical
issues
& NNraise for
FlowCog: Context-aware Semantics Extraction
Pan, X; Cao,
and Analysis
YZ; Android
Du, XC;
of Information
apps
He, BY;
having
Fang,
Flow
access
G;
Leaks
Chen,
to in
private
Y
Android
2018information
Web
Appsof Science
may be legitimate,
Mobile Application
depending on
complex
whether
sensitive
the appinformation
provides users
/Investigating
unstructured
enough semantics
the
datapotential
to justify
Privacy
theImpact
access.
Handling
of Existing
smart
of sensitive
phone
worksapps
data
analyzing
by applications
Code-based
app semantics
privacy
areanalysis
coarse-grained,
of
Semantic
Smart staying
phone
Analysis
apps
on &
theMorphological
app-level.
RuleThat
based
Analysis
is, they
& Statistical
can only&identify
NN whe
Document Title Authors Abstract Year Source Generalized Domain
Generalized
1 Data Category (RQ1)Classification of Use Case Generalized Privacy Issue(RQ1)
Generalized Privacy Issue Solution
Generalized
(RQ2)Category of Applied
NLP Method
NLP Concept(RQ2)
Finding Clues for Your Secrets: Semantics-Driven,
Nan, Yuhong; Learning-Based
Yang,
A long-standing
Zhemin;Privacy
Wang,
challenge
Xiaofeng;
Discovery
in analyzing
Zhang,
in Mobile
2018
Yuan;
information
Apps.
Google
Zhu, Donglai;
Scholar
leaks within
Yang,
Mobile
mobile
Min; Application
apps is to automatically
User Generatedidentify
Content
the code
(UGC)operating
Investigating
/ user profiles
on sensitive
the potential
data.Privacy
With allImpact
existing
Handling
of solutions
smart
of sensitive
phone
relying
apps
data
onbySystem
applications
Code-based
APIs (e.g.,
privacy
IMEI,
analysis
GPS location)
of
Semantic
web applications
or Analysis
features of
& Morphological
user interfaces
Statistical
Analysis
(UI), the content from app serv
Protecting Privacy in the Archives: Supervised
T. Hutchinson
Machine Learning
This paper
and
documents
Born-Digital
theRecords
iterations2018
attempted
IEEE in developing training
Generalsets for supervised
complexmachine
sensitivelearning
information
relating
/Annotations
unstructured
to identification
with
dataGold
of documents
Standard relating
Sensitive
to human
Information
resources
in unstructured
and containing
Collect Data
annotated
personalcorpus
information.Semantic
Overall, Analysis
these results show promise,
Statisticalalthough we have so far bee
Query-Based Linked Data Anonymization
Remy DelanauxAngela
We introduce
BonifatiMarie-Christine
and develop a declarative
RoussetRomuald
2018 framework
Springer
Thion
for privacy-preserving
Law Linked Data
Privacy
publishing
Policies in which privacyPrivacy
and utility
Preserving
policiesInformation
are specified
Sharing
as Data
SPARQL
Linkage
queries. Our approach
De-identification
is data-independent and leads
Semantic
to inspect
Analysis
only the privacy Rule
and utility
basedpolicies in order to determi
Redaction of Protected Health Information
A. Madan;
in EHRs
A.using
M. George;
This
CRFs
paper
and
A. Singh;
describes
Bi-directional
M. P.
the
S.de-identification
LSTMs
Bhatia 2018 IEEE of personally identifiable
Medicine
information (PIIs)
patient
in electronic
health information
health records
(PHI)(EHRs)
Privacy
/ Electronic
using
Preserving
Health
two models
Information
Records
of conditional
(EHR)
Sharing
Secondary
randomuse
fields
of (CRFs)
unstructured
and De-identification
bidirectional
clinical datalong
(Research)
short term memory
Semantic
networks
Analysis
(LSTMs). Most
Statistical
medical &records
NN store private inform
Numerical Age Variations within ClinicalD.Notes:
Hanauer;
The Q.
Potential
Mei;
ManyV.Impact
G.
kinds
V. Vydiswaran;
of
onnumbers
De-Identification
and
K. Singh;
numerical
and
Z.
2018
Landis-Lewis;
Information
concepts
IEEE appear
Extraction
C. Weng
frequently
Medicine
in free text clinical
patient
notes
health
frominformation
electronic health
(PHI)protecting
/records,
Electronic
including
patient
Healthprivacy
patient
Records
inages.
unstructured
(EHR)
TheSecondary
variability
clinicaluse
intext
how
of unstructured
ages are described
De-identification
clinical may
data impact
(Research)
the success
Semantic
of information
Analysisextraction strategies
Statistical
as well as the accuracy of de
Processing Text for Privacy: An Information
Natasha
FlowFernandesMark
Perspective
The problem
DrasAnnabelle
of text document
McIver obfuscation
2018 Springer
is to provide an automated
Other mechanismUser
which
Generated
is able to Content
make accessible
(UGC)protecting
the content
sensitive
of a text
Information
documentin unstructured
without
Identification
revealing
text the identity of its
De-identification
writer. This is more
/ Obfuscating
challenging
Speech
Authorship
thanProcessing
it seems, because anStatistical
adversary&equipped
NN with powerful m
Privacy preserving, crowd sourced crime
Mohler,
HawkesGeorge;
processes
Brantingham,
Social sensingP plays
Jeffrey;
an important role
2018in Google
crime analytics
Scholarand predictive
Social Network
policing. When
User Generated
humans play
Content
the role
(UGC)
of sensor,
Privacyseveral
Preserving
issues
Information
around privacy
Sharing
Identification
and trust emerge that must beDesigning
carefully handled.
a framework
We for
provide
sharing
Semantic
a framework
of sensitive
Analysis
forinformation
deploying predictive
Rule based
crime
& Statistical
models based upon cr
Privacy preserving network security data
DeYoung,
analytics:
Mark
Architectures
E;AnKobezak,
incessant
andPhilip;
system
rhythm Raymond,
design
of data breaches,
David2018
Richard;
data
Google
leaks,
Marchany,
Scholar
and privacy
Randolph
General
exposure
C; Tront,
highlights
Josephcomplex
the
G; need
sensitive
to improve
information
control over
/Privacy
network
potentially
Preserving
data sensitive
Information
data. History
Sharing
Disclosure
has shownofthat
sensitive
neitherData
publicDesinging
nor privatea sector
systemorganizations
that supports
Semantic
are
privacy
immune.
Analysis
preserving
Lax& data
Morphological
data
handling,
publication
Rule based
incidental
Analysis
of network
& Statistical
leakage,
security
and data
adversa
Detecting Complex Sensitive Information
Janvia
NeerbekIra
Phrase Structure
AssentPeter
State-of-the-art
in Recursive
Dologsensitive
Neuralinformation
Networks2018
detection
Springer
in unstructuredGeneral
data relies on the frequency
complex sensitive
of co-occurrence
information
of keywords
/Protecting
unstructured
with
sensitive
sensitive
data Information
seed words.
in unstructured
In
Sensitive
practice,Information
however,
text this
in unstructured
may fail
Detecting
to detect
Data
privacy-sensitive
more complex information
patterns
Semantic
of sensitive
/ Analysis
activitiesinformation. NN
In this work, we propose learning ph
Anonymization of Unstructured Data viaFadi
Named-Entity
HassanJosep
Recognition
The
Domingo-FerrerJordi
anonymization of structured
Soria-Comas
data
2018
hasSpringer
been widely studiedGeneral
in recent years. However,
complexanonymizing
sensitive information
unstructured
/protecting
unstructured
data (typically
sensitive
datatext
Information
documents)in remains
unstructured
Unintended
a highly
text
datamanual
disclosure
task and
Detecting
needs more
privacy-sensitive
attention from
information
researchers.
Semantic/ Analysis
activities
The main difficultyStatistical
when dealing with unstructured data
P-GENT: Privacy-preserving geocodingWang,
of non-geotagged
Shuo; Sinnott,
With
tweets
Richard;
the widespread
Nepal, Surya;
proliferation of2018
location-aware
Google Scholar
devices and
General
social media applications,
User-generated
more and
content
more(UGC)
peopleProtecting
/ share
Tweetsinformation
Sensitiveon
Information
location-based
on Social
Disclosure
social
Media
networks
ofPlatforms
sensitive
such data
as Twitter.
forDetection
analysis
Such data
of Locations
can be beneficial
withoutSemantic
user
to better
involvement
Analysis
plan and& manage
Morphological
individual's
Statistical
Analysis
activities and other social
Towards an Application-Based Notion of
M.Anomalous
M. Al Sobeihy
Privacy
Detecting
Behavior
privacy
in Android
leakageApplications
in Android
2018
applications
IEEE has gained Mobile
much focus
Application
in recentPrivacy
years. Policies
Many solutions
& Code have been
Investigating
introducedthetopotential
address Privacy
this problem.
Impact
Complexity
Though
of smart
of
solutions
phone
Privacyapps
differ,
PoliciesDetection
they generally
of identify
potentialprivacy
pitfallsleaks
inSemantic
theon
privacy
merely
Analysis
policies
the existence
of companies
ofStatistical
sensitive
on the&data
We
NNtransmissions out of u
An Investigation Into Android Run-TimeG.
Permissions
L. Scoccia;from
S. To
Ruberto;
the
protect
EndI.Users'
the
Malavolta;
privacy
Perspective
of
M.end
Autili;
users
P. Inverardi
2018
from intended
IEEE or unintended
Mobile
malicious
Application
behaviour,
Userthe
Generated
Android operating
Content (UGC)
system
Investigating
/ user
provides
reviews
athe
permissions-based
potential Privacysecurity
Impact
Handling
of
model
smart
of sensitive
that
phone
restricts
apps
dataaccess
by applications
Detection
to privacy
of the
relevant
Correlation
parts of
between
Morphological
the platform.
User Reviews
Starting
Analysiswith
and Privacy
Android
Statistical
Issues
6, the permission system has be
Modeling Security and Privacy Requirements:
Mai, PX;a Goknil,
Use Case-Driven
A;
Context
Shar, LK;
Modem
Approach
Pastore,
internet-based
F; Briand, LC;
services,
2018
Shaame,
Web
ranging
Sof Science
from food-delivery
Law to home-caring,
Privacy
leverage
Policies
the availability of multiple
Definition
programmable
of Privacy Requirements
devices to provide
Compliance
handy services
with Requirements
tailored toGend-user
enerationneeds.
of Artefacts
Thesebased
services
Semantic
on Privacy
are delivered
Analysis
Policythrough
Analysisan ecosystem
Statistical of device-specific software
SynthNotes: A Generator Framework for
E.High-volume,
Begoli; K. Brown;
High-fidelity
OneS.ofSrinivas;
the Synthetic
key, S.
emerging
Tamang
Mental
challenges
Health2018
Notes
thatIEEE
connects the "Big Data"
Medicine
and the AI domain
Synthetic
is the availability
Data Generation
of sufficientPrivacy
volumesPreserving
of training
Information
data for AI/Machine
Sharing
Secondary
Learning
usetasks.
of unstructured
SynthNotes
Generation
clinical
is a framework
data
of(Research)
Synthetic
for generating
Data Morphological
standards-compliant,
Analysis realistic
Statistical
mental health progress report note
Email security level classification of imbalanced
Huang, Jen-Wei;
data using
Email
Chiang,
artificial
is Chia-Wen;
far neural
more convenient
network:
Chang, The
Jia-Wei
than
real
traditional
2018
caseScienceDirect
in amail
world-leading
in the delivery
enterprise
Other
of messages. However,
User Generated
it is susceptible
Contentto (UGC)
information
protecting
/ emails
leakage
sensitive
in business.
Information
This in
problem
unstructured
Unintended
can betext
alleviated
data disclosure
by classifying
Information
emailsExtraction
into different
fromsecurity
Emails
Morphological
levels using
Analysis
text mining and
Statistical
machine & learning
NN technology. In th
PrivacyGuide: Towards an Implementation
Tesfay,
of the
WB;EUHofmann,
GDPR
Nowadays
onP;Internet
Nakamura,
Internet
Privacy
services
T; Kiyomoto,
Policyhave
Evaluation
S;
dramatically
2018
Serna,
Web
J ofchanged
Sciencethe way
Lawpeople interact with
Privacy
eachPolicies
other and many of our Ease
daily activities
the comprehension
are supported
of Privacy
by those
Policies
Complexity
services. of
Statistical
Privacy PoliciesInformation
indicators show thatExtraction
more thanfrom
half privacy
of Semantic
the world's
policies
Analysis
population uses the
Statistical
Internet generating about 2.5 quinti
How Dangerous Permissions are Described
Baalous,
in Android
R; Poet,
Apps'
Google
R Privacy
requires
Policies?
Android apps which
2018
handle
Web users'
of Science
personal data
Lawsuch as photos and
Privacy
contacts
Policies
information to postInvestigating
a privacy policy
the potential
which describes
Privacy Impact
comprehensively
Complexity
of smart
of phone
Privacy
how the
apps
PoliciesInformation
app collects, usesExtraction
and sharesfrom
users'
privacy
Morphological
information.
policies Unfortunately,
Analysis while
Statistical
knowing why the app wants to ac
RECIPE: Applying Open Domain Question
Shvartzshanider,
Answering to We
Y;
Privacy
Balashankar,
describe
Policies
ourA;
experiences
Wies, T; Subramanian,
in using
2018an Web
open
L ofdomain
Science
question
Lawanswering modelPrivacy
(Chen et
Policies
al., 2017) to evaluatePaying
an out-of-domain
more Attention
QA task
to Privacy
of assisting
Policies
Compliance
in analyzing
& Easewith
the
privacy
comprehension
Requirements
policiesMof
apping
of
companies.
Privacy
Privacy
Policies
Specifically,
Policies to Contextual
Relevant
SemanticCIIntegrity
Analysis
Parameters
(CI) with
Extractor
Q&A
NN(RECIPE) seeks to answer questio
KnIGHT: Mapping Privacy Policies to GDPR
Najmeh Mousavi Although
NejadSimon
the ScerriJens
use of appsLehmann
and online
2018
services
Springer
comes with accompanying
Law privacy policies,
Privacy Policies
a majority of end-usersPaying
ignore more
them Attention
due to their
to length,
Privacycomplexity
Policies
Complexity
& Ease
and of
unappealing
the
Privacy
comprehension
PoliciesMapping
presentation.pite
of Privacy
Privacy
the
Policies
potential
Policies risks.
to GDPR
Semantic
In light ofAnalysis
the, now enforced EU-wide,
NN General Data Protection Re
Stories unfold: Issues and challenges on
Cinco,
data Jeffrey
privacyC;
using
An emerging
latent dirichlet
trendallocation
in the cyberspace
algorithm
2018
is called
Googledata
Scholar
privacy that
Other
is often unattended
complex
and ignored
sensitive
by Netizens.
information
Thisprotecting
paper looks
sensitive
into available
Information
documents
in unstructured
Unintended
found intext
journals,
data disclosure
news articles
Overview
and published
over the research
reports tofield
understand
Semantic Analysis
and uncover hidden
Statistical
issues and challenges about data p
PrivOnto: A semantic framework for theOltramari,
analysis of
A;privacy
Piraviperumal,
Privacy
policies
policies
D; Schaub,
are intended
F; Wilson,
to inform
S;2018
Cherivirala,
users
Webabout
ofS;
Science
Norton,
the collection
TB;Law
Russell,
and useNC;
of Story,
their Privacy
data
P; Reidenberg,
by Policies
websites,J;mobile
Sadeh,apps
N and
Paying
other
more
services
Attention
or appliances
to Privacy they
Policies
Complexity
interact
& Ease
with.
ofthe
Privacy
This
comprehension
also
PoliciesSemantic
includes of Framework
informing
Privacy users
Policies for the
about anyAnalysis
choicesof
Semantic Privacy
they mightPolicies
Analysis have regarding
Statistical
such data practices. However, fe
Semantic Incompleteness in Privacy Policy
J. Bhatia;
GoalsT. D. Breaux
Companies that collect personal information
2018 IEEE online often maintain
Law privacy policiesPrivacy
that arePolicies
required to accurately Ease
reflectthe
their
comprehension
data practicesofand
Privacy
privacy
Policies
Complexity
goals. ToofbePrivacy
comprehensive
PoliciesSemantic
and flexible
Incompleteness
for future practices,
Detection
Semantic
policies
in Privacy
Analysis
contain
Policy
ambiguity
Goals that
Statistical
summarize practices over multip
Semantic Inference from Natural Language
Hosseini,
Privacy
MBPolicies
Mobile
andapps
Android
collect
Codedifferent categories
2018ofWeb
personal
of Science
information
Law
to provide users with
Privacy
various
Policies
services. Companies
Ease
use the
privacy
comprehension
policies containing
of Privacy
critical
Policies
Complexity
requirements
of Privacy
to inform
PoliciesSemantic
users about their
Inference
data practices.
from Privacy
With
Semantic
Policies
the growing
Analysisaccess to personal
NN information and the scale of m
Towards privacy-preserving speech data
Qian,
publishing
Jianwei; Han,
Privacy-preserving
Feng; Hou, Jiahui;
data
Zhang,
publishing
Chunhong;
2018
has been
Google
Wang,
a heated
Yu;
Scholar
Li, research
Xiang-Yang;
General
topic in the last decade.
SpeechNumerous
Data ingenious attacks
Privacy
on Preserving
users' privacy
Information
and defensive
Sharing
Identification
measures have been proposed
Speech
for thede-identification
sharing of various data,
Speech
varying
Processing
from relational
& Morphological
data,
Rule
social
based
Analysis
network
& Statistical
& data,
Semantic
spatiotempora
& NN
Analysis
Towards Smart Citizen Security Based J.
onRoa;
Speech
G. Jacob;
Recognition
Citizen
L. Gallino;
security
P. C.isK.a Hung
fundamental determinant
2018 IEEEof the well-beingGeneral
of households andSpeech
communities.
Data Robberies, rapes,
protecting
domesticsensitive
violence,Information
gender violence,
in unstructured
Identification
kidnapping,textamong others, are
Speech
examples
de-identification
of situations of insecurity
Speech Processing
having a negative effect
Statistical
on citizens
& NNand governments worl
Extraction of Natural Language Requirements
H. Guo;from
Ö. Kafali;
BreachWe
M.Reports
address
Singh Using
the problem
Event Inference
of extracting
2018
useful
IEEE information contained
Otherin security andcomplex
privacy breach
sensitive
reports.
information
A breach
/Definition
unstructured
report tells
of Privacy
data
a shortRequirements
story describing
Sensitive
how a breach
Information
happened
in unstructured
andSuggestion
the follow-up
Dataof security
remedialand
actions
privacy
Semantic
taken
requirements
byAnalysis
the responsible parties.
NN By predicting sentences that ma
VACCINE: Using Contextual Integrity For
Shvartzshnaider,
Data Leakage Modern
Y;
Detection
Pavlinovic,
enterprises
Z; Balashankar,
rely on Data
A; Leakage
Wies,
2019T;Web
Subramanian,
Prevention
of Science
(DLP)
L; Nissenbaum,
systems
Law to enforce
H; Mittal,
privacy
Privacy
P policies
Policies
that prevent unintentional
Automated flow
and
ofFlexible
sensitiveRule
information
Enforcement
Compliance
to unauthorized
with Requirements
entities. However,
Automated
theseCompliance
systems operate
Checker
based
Semantic
on rule
Analysis
sets that are limited
NN to syntactic analysis and therefor
Automatic Anonymization of Textual Documents:
F. Hassan;Detecting
D. Sánchez;
Data
Sensitive
sharing
J. Soria-Comas;
Information
is key in a J.
wide
via
Domingo-Ferrer
Word
rangeEmbeddings
2019
of activities
IEEE but raises serious
General
privacy concerns
complex
when the
sensitive
data contain
information
personal
/Privacy
unstructured
information.
Preserving
data
Anonymization
Information mechanisms
Sharing
Secondary
provide
use of
ways
unstructured
to transform
Automatic
clinical
the data
data
Anonymization
so
(Research)
that identities
of Textual
Semantic
and/orDocuments
sensitive
Analysisdata are not disclosed
NN (i.e., data are no longer per
Revealing the unrevealed: Mining smartphone
Hatamian,
users
M; Serna,
privacy
Popular
J;
perception
Rannenberg,
smartphone
on app
Kapps
markets
may receive
2019several
Web ofthousands
Science of user
Mobile
reviews
Application
containing
Userstatements
Generatedabout
Content
apps'
(UGC)
functionality,
Ease
/ appthe
reviews
comprehension
interface, user-friendliness,
of privacy related
Complexity
etc. They
usersometimes
of
reviews
PrivacyforPoliciesAutomatic
also
developers
comprise privacy
Summarization
relevant information
of User
Semantic
Reviews
that can
Analysis
be extremely helpful
Statistical
for app developers to better und
A Multilateral Privacy Impact Analysis Method
Majid HatamianNurul
for AndroidSmartphone
Apps
MomenLothar
apps have
FritschKai
the power
Rannenberg
to
2019
monitor
Springer
most of people’sLaw
private lives. Apps can
Privacy
permeate
Policies
private spaces, access
Investigating
and mapthesocial
potential
relationships,
Privacy Impact
Complexity
monitor
of smart
whereabouts
of phone
Privacyapps
and
PoliciesCode-based
chart people’s activities
privacy inanalysis
digital and/or
of
Semantic
Smart
real
phone
Analysis
world.
apps
We are therefore
Statistical
interested in how much informat
ESARA: A Framework for Enterprise Smartphone
Hatamian, M;
Apps
Pape,
Risk
Protecting
S;Assessment
Rannenberg,
enterprise's
K confidential 2019
data and
Webinfrastructure
of Science against
Mobileadversaries
Applicationandcomplex
unauthorized
sensitive
accesses
information
has been
/Investigating
unstructured
always challenging.
the
datapotentialThis
Privacy
gets Impact
even
Handling
more
of smart
critical
of sensitive
phone
whenapps
data
it comes
by applications
Code-based
to smartphones
privacy
dueanalysis
to their of
mobile
Semantic
Smartnature
phone
Analysis
which
appsenables them
Statistical
to have access to a wide range o
The case for a gdpr-specific annotated Gallé,
datasetMatthias;
of privacy
Christofi,
Inpolicies
this position
Athena;paper
Elsahar,
we analyse
Hady; the
2019
prosGoogle
and contras
Scholar
for the Law
need of a dataset ofPrivacy
privacyPolicies
policies, annotated with
Annotation
GDPR-specific
Revision
elements. We revise
Complexity
existing related
of Privacy
data-sets
PoliciesCollect
and provide
annotated
an analysis
corpus
of how they
Semantic
could Analysis
be augmented
& Morphological
in order
Ruletobased
Analysis
facilitate &machinelearning
Syntactic Analysis
tech
On GDPR Compliance of Companies’ Privacy
Nicolas Policies
M. MüllerDaniel
We introduce
KowatschPascal
a data set DebusDonika
of privacy policies
2019MirditaKonstantin
Springer
containing more
Böttinger
than
Law18,300 sentencePrivacy
snippets,
Policies
labeled in accordance
Collection
to five General
of Privacy
DataPolicies
Protection Regulation
Complexity
(GDPR)
of Privacy
privacy
PoliciesCollect
policy core requirements.
annotated corpus
We hopeSemantic
that this data
Analysis
set will enable practitioners
Statistical &to
NNanalyze and detect pol
Towards privacy preserving unstructured
Mehta,
big data
B; Rao,
publishing
UP;
Various
Gupta,
sources
R; Conti,
andMsophisticated 2019
tools are
Webused
of Science
to gather and
General
process the comparatively
complex sensitive
large volume
information
of data /or
Privacy
unstructured
big data
Preserving
thatdata
sometimes
Information
leadsSharing
to privacy
Data Linkage
disclosure (at broader or
De-identification
finer level) for the data owner.
Semantic
Privacy Analysis
preserving data publishing
Statistical
approaches such as k-anonym
Named Entity Recognition in Clinical Text
Changjian
Based on
LiuJiaming
Capsule-LSTM
Clinical
LiYuhan
Named
forLiuJiachen
Privacy
Entity Recognition
Protection
DuBuzhoufor
2019
TangRuifeng
identifying
Springersensitive
Xu information
Medicinein clinical text,
patient
alsohealth
knowninformation
as Clinical (PHI)
De-identification,
Privacy
/ free clinical
Preserving
has
textlong
Information
been critical
Sharing
task
Secondary
in medical
useintelligence.
of unstructured
It aims
De-identification
clinical
at identifying
data (Research)
various typesSemantic
of protected
Analysis
health information
NN(PHI) from clinical text and then rep
Impact of De-Identification on Clinical Text
Obeid,
Classification
JS; Heider,Clinical
Using
PM; Weeda,
Traditional
text de-identification
ER;and
Matuskowitz,
Deep Learning
enables
AJ;2019
Carr,
collaborative
Classifiers
Web
CM; of
Gagnon,
Science
research
K; Crawford,
while
Medicine
protecting
T; Meystre,
patient
SM
patient
privacy
health
and information
confidentiality;
(PHI)however,
Privacy
/ Electronic
Preserving
concerns
Health
persist
Information
Records
about
(EHR)
Sharing
the reduction
Secondaryin use
the utility
of unstructured
of the de-identified
De-identification
clinical data
text(Research)
for information extraction
Semantic and
Analysis
machine
& Morphological
learning
Statistical
tasks.
Analysis
In&the
NNcontext of a deep learn
An Extensible De-Identification Framework
Braghin,
for Privacy
S; Bettencourt-Silva,
Protection
The volumeof Unstructured
of
JH;unstructured
Levacher,
Health
K;
health
Antonatos,
Information:
2019
records
Web
S has
Creating
ofincreased
Science
Sustainable
exponentially
Medicine
Privacy
across
Infrastructures
healthcare
patient health
settings.
information
Similarly,(PHI)
the Privacy
number
/ free clinical
Preserving
of healthcare
text Information
providersSharing
that Secondary
wish to exchange
use of unstructured
records hasDe-identification
also
clinical
increased
data (Research)
and, as a result,
Semantic
de-identification
Analysis & and
Morphological
the preservation
Statistical
Analysis
of
& NN
privacy features have be
A study of deep learning methods for de-identification
Yang, X; Lyu, TC;of Background:
clinical
Li, Q; Lee,
notes
CY;
De-identification
in Bian,
cross-institute
J; Hogan, issettings
WR;
a critical
2019
Wu,Web
technology
YH of Science
to facilitate
Medicine
the use of unstructured
patientclinical
health information
text while protecting
(PHI)protecting
/ free
patient
clinical
patient
privacy
text privacy
and confidentiality.
in unstructured
Secondary
Theclinical
clinical
use
text
natural
of unstructured
languageDe-identification
processing
clinical data(NLP)
(Research)
communitySemantic
has invested
Analysis
great&efforts
Morphological
in developing
NN Analysis
methods and corpora for d
Privacy- And utility-preserving textual analysis
Feyisetan
viaO.,
calibrated
Balle
Accurately
B.,multivariate
Drakelearning
T., Diethe
perturbations
from
T. user data 2019
while providing
SCOPUSquantifiableGeneral
privacy guaranteestext
provides
location
andata
opportunity to build
Mining
betteraccording
ML modelsto privacy
while maintaining
policies Profiling
user trust. This paper presents
Designing
a formala approach
geo-indistinguishability
to carrying
Semantic
outtext
privacy
Analysis
pertubation
preserving
algorithm
text NN
perturbation using the notion of dχ -p
Privacy Disclosures Detection in Natural-Language
Mehdy, Nuhil;Text
Kennington,
An
Through
increasing
Linguistically-Motivated
Casey;
number
Mehrpouyan,
of peopleHoda;
are
Artificial
2019
sharingGoogle
Neural
information
Networks
Scholarthrough
Social
textNetwork
messages, emails,
User Generated
and socialContent
media without
(UGC)Privacy
proper
/ emails
Preserving
privacy checks.
Information
In manySharing
situations,
Unintended
this could
data disclosure
lead to serious
Designing
privacy athreats.
methodology
This paper
for providing
Semantic
presentsextra
aAnalysis
methodology
safety
& Morphological
precautions
for providing
NN without
Analysis
extra
being
safety
intrusive
precautions
to users
with
iCAT: An Interactive Customizable Anonymization
Oqaily, M; Jarraya,
Tool Today's
Y; Zhang,
dataMY;
owners
Wang,
usually
LY; Pourzandi,
resort 2019
to data
M;Web
Debbabi,
anonymization
of Science
M tools
General
to ease their privacy
complex
and confidentiality
sensitive information
concerns.
/Ease
unstructured
However,
the automation
those
datatoolsprocess
are typically
of privacy
ready-made
Compliance
regulating
and
with
documents
inflexible,
Requirements
leaving
Designing
a gap both
a tailor-made
between the
datadata
annonymity
Morphological
owner and
approach
data
Analysis
users' requirements,
Rule basedand
& Statistical
between those
& NN require
Privacy in Text Documents Dias, M; Ferreira,The
JC; process
Maia, R;of
Santos,
sensitive
P; Ribeiro,
data preservation
R2019 Webis aofmanual
Science
and a General
semi-automatic procedure.
complexSensitive
sensitivedata
information
preservation
/protecting
unstructured
sufferssensitive
various
data problems,
Informationininparticular,
unstructured
Sensitiveaffect
Information
text
the handling
in unstructured
of confidential,
Detecting
Dataprivacy-sensitive
sensitive and personal
information
Semantic
information,
/ Analysis
activities
such as the identification
Statistical of sensitive data in docum
A DNS Tunneling Detection Method Based
Jiacheng
on Deep
ZhangLi
Learning
DNS
YangShui
tunneling
Models
YuJianfeng
toisPrevent
a typical
Ma
Data
DNS
Exfiltration
attack
2019that
Springer
has been used for Other
stealing information complex
for manysensitive
years. The
information
stolen data
/Protecting
network
is encoded
data
sensitive
and encapsulated
Information into
in unstructured
Profiling
the DNS request
text to evade intrusion
Detecting
detection.
privacy-sensitive
The popularinformation
detection
Semanticmethods
/ Analysis
activities
of machine learning
NN use features, such as networ
A Nlp-based Solution to Prevent from Privacy
Canfora,
Leaks
G; DiinSorbo,
Social
Private
A;
Network
Emanuele,
and sensitive
PostsE; information
Forootani, S;
is2019
Visaggio,
often Web
revealed
CA
of Science
in posts appearing
Social Network
in Social Networks
User Generated
(SN). ThisContent
is due to
(UGC)
the P
users'
rotecting
willingness
Sensitive
to Information
increase their
on interactions
Social
Unintended
Mediawithin
data
Platforms
disclosure
specific social
Detecting
groups, privacy-sensitive
but also to a poorinformation
knowledge
Morphological
/about
activities
Analysis
the risks for privacy.
Rule based
We argue
& Statistical
that technologies
& NN ab
Detecting Private Information in Large Social
De Luca,
Network
Pasquale;
using
The mixed
violation
Machine
of privacy,
Learning
others
Techniques
people
2019or Google
personal,
Scholar
is a very current
Social problem,
Network which
User
concerns
Generated
not only
Content
on the
(UGC)
webPbut
rotecting
also inSensitive
private life.
Information
In the years
on Social
Unintended
1990 itMedia
was data
expected
Platforms
disclosure
that nowadays,
Detecting
that
privacy-sensitive
any routine operation
information
Semantic
was carried
/ Analysis
activities
out "manually",Statistical
and it would be performed through
Short Text, Large Effect: Measuring theD.
Impact
C. Nguyen;
of UserE.Reviews
Application
Derr; M.on
Backes;
Android
markets
S.App
streamline
Bugiel
Securitythe
& 2019
Privacy
end-users'
IEEE task of finding and
Mobile
installing
Application
applications.
User Generated
They also Content
form an (UGC)
immediate
Investigating
/ appcommunication
reviews
the potential
channel
Privacy
between
Impact
Handling
appofdevelopers
smart
of sensitive
phoneandapps
data
theirbyend-users
applications
Detectioninofform
the of
Correlation
app reviews,
between
Morphological
whichUser
allowReviews
users
Analysis
toand
provide
Privacy
Statistical
developers
Issues feedback on their apps.
PolicyLint: Investigating Internal PrivacyAndow,
Policy B;
Contradictions
Mahmud,
Privacy
SY;onWang,
policies
GoogleWY;
are
Play
Whitaker,
the primary
J; Enck,
mechanism
2019W; Web
Reaves,
byofwhich
Science
B; Singh,
companies
K;Law
Xie,
inform
T users about
Privacy
data Policies
collection and sharing practices.
Ease the comprehension
To help users better
of Privacy
understand
Policies
Complexity
theseof
long
Privacy
and complex
PoliciesGeneration
legal documents,
of Artefacts
recentbased
research
Semantic
on Privacy
has proposed
Analysis
Policy Analysis
&
tools
Morphological
that summarize
Statistical
Analysis
&
collection
NN and sharing. How
Automatic Detection and Analysis of DPP
S. P.
Entities
Nayak;inS.
Legal
Pasumarthi
Due
Contract
to introduction
Documents
of General Data2019
Protection
IEEE Regulation (GDPR)
Law in EU; all the cloud
Privacy
hosted
Policies
applications (spanAutomated
across multiand
geoFlexible
location)
Rule
wherein
Enforcement
captures
Compliance
the personal
with Requirements
data need
Information
to first identify
Extraction
data privacy
from privacy
protection
Semantic
policies
Analysis
(DPP) entities and handle
Rule based
it as per EU norms and regulat
Natural language processing for mobileStory,
app privacy
Peter; Zimmeck,
compliance
Many Sebastian;
Internet services
Ravichander,
collect aAbhilasha;
flurry
2019of data
Google
Smullen,
fromScholar
their
Daniel;
users.
Wang,
Law
Privacy
Ziqi;policies
Reidenberg,
arePrivacy
intended
Joel; Policies
Russell,
to describe
N Cameron;
the services’
Sadeh,
Easeprivacy
the
Norman;
comparison
practices.of
However,
Privacy Policies
dueCompliance
to their length
with
and
Requirements
complexity,Information
reading privacy
Extraction
policies
from
is aprivacy
challenge
Semantic
policies
for
Analysis
end users, government
Statistical
regulators, and companies. Na
Automated and Personalized Privacy Policy
ChengExtraction
ChangHuaxin
Under
Along
LiYichi
GDPR
with the
ZhangSuguo
Consideration
popularity DuHui
of mobile
CaoHaojin
2019
devices,
Springer
Zhu
people share a growing
Law amount of personal
Privacydata
Policies
to a variety of mobile
Ease
applications
the comprehension
for personalized
of Privacy
services.
Policies
Complexity
In mostofcases,
Privacy
users
PoliciesInformation
can learn their data
Extraction
usage from
from privacy
theSemantic
privacy
policies
policy
Analysis
along with the NN
application. However, current privacy
Quantifying the effect of in-domain distributed
Kumar, word
Vinayshekhar
representations:
Privacy
Bannihatti;
policies
A study
Ravichander,
are documents
of privacyAbhilasha;
policies
that2019
describe
Story,
Google
what
Peter;
Scholar
dataSadeh,
is collected
Law
Norman;
by a website or
Privacy
an appPolicies
and how that data is handled.
Ease thePrivacy
comprehension
policies are
of Privacy
often long
Policies
Complexity
and difficult
of Privacy
to understand.
PoliciesInformation
Recently people
Extraction
have out
started
of Privacy
Semantic
to turn Policies
to Natural
Analysis
with
Language
Word Embeddings
Processing
Statistical (NLP)
& NN to automatically extr
Document Title Authors Abstract Year Source Generalized Domain
Generalized
1 Data Category (RQ1)Classification of Use Case Generalized Privacy Issue(RQ1)
Generalized Privacy Issue Solution
Generalized
(RQ2)Category of Applied
NLP Method
NLP Concept(RQ2)
Adapting State-of-the-Art Deep Language
Zhou,
Models
LY; Suominen,
to Clinical
Background:
H;
Information
Gedeon,
Deep
T
Extraction
learningSystems:
(DL) has
2019
Potentials,
beenWeb
widely
ofChallenges,
Science
used to solve
and
Medicine
problems
Solutionswith success
patientinhealth
speechinformation
recognition,
(PHI)protecting
visual object patient
recognition,
privacy
andinobject
unstructured
detection
Secondary
clinical
for drug
use
textof
discovery
unstructured
and Overview
genomics.
clinical data
ofNatural
Challenges
(Research)
language
and applied
processing
Semantic
methods
Analysis
has achieved
for protection
noticeable
of
NNpersonal
progress
health
in artificial
information
intelligence
(PHI)
Towards a Privacy Compliant Cloud Architecture
Blohm, M;for
Dukino,
Natural
Natural
C;Language
Kintz,
language
M; Kochanowski,
Processing
processing
Platforms
M;
in Koetter,
combination
2019 F;Web
Renner,
with
of Science
advances
T in General
artificial intelligencecomplex
is on thesensitive
rise. However,
information
compliance
/Automated
unstructured
constraints
and
data
Flexible
while handling
Rule Enforcement
personal
Compliance
data inwith
many Requirements
types of documents
Overviewhinder
of challenges
various of
application
working
Semantic
with
scenarios.
personal
Analysis
Weand
describe
particularly
the
NN challenges
sensitive data
of working
in practice
with person
Privacy-preserving social media forensic
Naqvi,
analysis
Syed;forEnderby,
preventive
SocialSean;
media
policing
Williams,
is extensively
of online
Ian;activities
Asif,
used Waqar;
nowadays
2019Rajarajan,
Google
and isScholar
Muttukrishnan;
gaining popularity
Social
Potlog,
Network
among
Cristi;
theFlorea,
users
User with
Generated
Monica;
the increasing
Contentgrowth
(UGC)Pin
rotecting
the network
Sensitive
capacity,
Information
connectivity,
on Social
Disclosure
and speed.
Media
ofPlatforms
Moreover,
sensitive data
affordable
forSemantic
analysis
pricesAnalysis
of data of
plans,
social
especially
media
Speech
forensic
mobile
Processing
analysis
data packages,
& for
Semantic
preventive
have
Statistical
Analysis
considerably
policing of online
increased
activities
the use o
Identifying incompleteness in privacy policy
Bhatia,
goals
J; Evans,
using semantic
Companies
MC; Breaux,
frames
that
TD collect personal information
2019 Webonline
of Science
often maintain
Law privacy policiesPrivacy
that arePolicies
required to accurately Ease
reflectthe
their
comprehension
data practicesofand
Privacy
privacy
Policies
Complexity
goals. ToofbePrivacy
comprehensive
PoliciesSemantic
and flexible
Incompleteness
for future practices,
Detection
Semantic
policies
in Privacy
Analysis
contain
Policy
&ambiguity
Morphological
Goals that
Statistical
summarizes
Analysis practices over multip
Deep Reinforcement Learning-based Text
Mosallanezhad,
Anonymization
Ahmadreza;
User-generated
against Private-Attribute
Beigi,textual
Ghazaleh;
data
Inference
is
Liu,
rich
Huan;
2019
in content
Google
andScholar
has been used
General
in many user behavioral
Use of third
modeling
party / tasks.
sharing
However,
sensitive
Privacy
it information
could
Preserving
also leak
Information
user private-attribute
Sharing
Unintended
information
data disclosure
that they may
Semantic-based
not want to disclose
text pertubation
such asSemantic
age
approach
and Analysis
location. User’s privacy
NN concerns mandate data publishe
Speech sanitizer: Speech content desensitization
Qian, Jianwei;
andDu,
voice
Voice
Haohua;
anonymization
inputHou,
users'
Jiahui;
speech
Chen,
recordings
Linlin;2019
Jung,
are Google
being
Taeho;
collected
Scholar
Li, Xiangyang;
by service
Generalproviders andSpeech
shared Data
with third parties, who may
Voice-based
abuse users'
Service
voiceprints,
in a private
identify
manner
Profiling
them by voice, and learn their
Speech
sensitive
De-Identification
speech content. In Speech
this work,
Processing
we design&Speech
Morphological
Sanitizer
Rule based
Analysis
to perturb
& Statistical
users' speech rec
An AI-assisted Approach for Checking the
D. Torre;
Completeness
S. Abualhaija;
Privacy
of Privacy
M.policies
Sabetzadeh;
Policies
are Against
critical
L. Briand;
for
GDPR
helping
K.2020
Baetens;
individuals
IEEEP. Goes;
makeS.informed
Forastier
Lawdecisions about Privacy
their personal
Policies
data. In Europe, Ease
privacy
thepolicies
compliance
are subject
processto through
compliance
Compliance
automation
with the
with
General
Requirements
Data Protection
Automated Regulation
Compliance(GDPR).
Checker
Semantic
If done entirely
Analysis
manually, checking
Statistical
whether
& NN
a given privacy policy c
Establishing a Strong Baseline for Privacy
Najmeh
Policy
Mousavi
Classification
Digital
NejadPablo
service
JabatRostislav
users are routinely
NedelchevSimon
exposed
2020 Springer
to ScerriDamien
Privacy Policy Graux
consent
Law forms, through
Privacy
whichPolicies
they enter contractualPaying
agreements
more consenting
Attention to to
Privacy
the specifics
Policies
Complexity
of how their
of Privacy
personal
PoliciesAutomated
data is managedPrivacy
and used.
Policy
Nevertheless,
Classification
Semantic despite
Analysis
renewed importance
Statistical
following legislation such as th
A Comparative Study of Sequence Classification
Lindner, Zachary;Privacy
Models for Privacy
policies
Policy
areCoverage
legal documents
Analysis
2020
that Google
describe
Scholar
how a website
Law will collect, use,Privacy
and distribute
Policiesa user’s data. Unfortunately,
Ease the comprehension
such documents
of Privacy
are often
Policies
Complexity
overly complicated
of Privacy PoliciesAutomated
and filled with legalPrivacy
jargon; Policy
makingCoverage
it difficult
Semantic Check
for Analysis
users
with
toSequence
fully graspClassification
what
Statistical
exactly
& NN
is
Models
being collected and w
Code Element Vector Representations Heaps,
ThroughJohn;
the Application
The increasing
of Natural
popularity
Languageof Processing
mobile2020
and web
Techniques
Google
appsScholar
has
forprompted
Automation
Lawanof
increase
Software
in Privacy
the
User
collection
Generated
Analysis
and storage
Contentof
(UGC)
personally
Automated
/ Codeidentifiable
and Flexible
information
Rule Enforcement
of appCompliance
users, causing
withusers
Requirements
to be at Ccontinually
ode-basedgreater
privacyprivacy
analysis
risk
of
Semantic
ifSmart
that information
phone
Analysis
appsis misusedNN
or mishandled. In order to reduce su
Correlating UI Contexts with Sensitive API
J. Liu;
Calls:
D. He;
Dynamic
D. Wu;
The
Semantic
J.
Android
Xue Extraction
frameworkand
provides
Analysis
sensitive
2020 IEEEAPIs for Android apps
Mobile
to access
Application
the user's
complex
private
sensitive
information,
information
e.g., SMS,
/Investigating
unstructured
call logs and
the
dataaccess
locations.
legitimacy
WhetherofComplexity
asmart
sensitive
phoneof
APIPrivacy
apps
call in PoliciesCode-based
an app is legitimate or
privacy
not depends
analysison
of
Semantic
whether
Smart phone
the
Analysis
app
apps
has providedStatistical
enough natural-language semantics
Enhanced Privacy and Data ProtectionF.using
Martinelli;
NaturalF. Language
Marulli;
Artificial
F. Processing
Mercaldo;
Intelligence
S.and
systems
Marrone;
Artificial
have
A.Intelligence
Santone
2020
enabledIEEE
significant benefitsLaw
for users and society,
complex
but whilst
sensitive
the data
information
for their /feeding
Annotation
unstructured
are Collection
always
data increasing, a sideSensitive
to privacy
Information
and security
in unstructured
leaks Collect
is offered.
Data
annotated
The severe
corpus
vulnerabilities
Semantic to the
Analysis
right to& privacy
Morphological
obliged
NN Analysis
governments to enact specific
From Prescription to Description: Mapping
POPLAVSKA,
the GDPR Ellen;
to The
a Privacy
NORTON,
European
Policy
Thomas
Union’s
CorpusGeneral
B;Annotation
WILSON,
Data2020
Scheme
Shomir;
Protection
Google
SADEH,
Regulation
Scholar
Norman;
(GDPR)
Law has compelled
Privacy
businesses
Policies
and other organizations
Annotations
to update
for research
their privacy
acceleration
policies
Complexity
to state specific
of Privacy
information
PoliciesCollect
about their
annotated
data practices.
corpus Simultaneously,
Morphologicalresearchers
Analysis in natural
Rule based
language processing (NLP) hav
Privacy Policies over Time: Curation and
Amos,
Analysis
Ryan;of Acar,
a Million-Document
Automated
Gunes; Lucherini,
analysis
Dataset
Elena;
of privacy
Kshirsagar,
policies
2020has
Mihir;
Google
proved
Narayanan,
Scholar
a fruitful
Arvind;
research
LawMayer,
direction,
Jonathan;
with
Privacy
developments
Policies such as automated
Collection
policyofsummarization,
Privacy Policies
questionComplexity
answering systems,
of Privacyand
PoliciesCollect
compliance annotated
detection. So
corpus
far, prior Morphological
research has been
Analysis
limited to analysis
Rule based
of privacy policies from a sing
Crosslingual named entity recognition for
Catelli,
clinical
Rosario;
de-identification
Gargiulo,
The COrona Francesco;
applied
VIrus
toDisease
aCasola,
COVID-19
19
Valentina;
(COVID-19)
Italian
2020data
De
ScienceDirect
Pietro,
pandemic
set Giuseppe;
requiredFujita,
Medicine
the work
Hamido;
of all Esposito,
global
patient
experts
Massimo
health
to tackle
information
it. Despite
(PHI)the
Privacy
/ clinical
abundance
Preserving
notesof new
Information
studies, privacy
Sharing
Secondary
laws prevent
use of
their
unstructured
dissemination
De-identification
clinical
for medical
data (Research)
investigations:Semantic
through Analysis
clinical de-identification,
NN the Protected Health Information
Detecting Protected Health InformationB.
with
Singh;
an Incremental
Q. Sun;Clinical
Y. S.
Learning
Koh;
narratives
J. Ensemble:
Lee; host
E. Zhang
vast
A Case
accumulations
Study
2020onIEEE
New
of patient
Zealanddata
Clinical
pivotal
Medicine
Text
for research andpatient
development
health information
of health related
(PHI)Privacy
products.
/ clinicalPreserving
narratives
In order for
Information
this data to
Sharing
beSecondary
utilized, theuse
underlying
of unstructured
protected
De-identification
clinical
healthdata
information
(Research)
needs toSemantic
be de-identified
Analysisto ensure medical
NN confidentiality. Given the volum
N-Sanitization: A semantic privacy-preserving
Iwendi,framework
Celestine;The
for
Moqurrab,
unstructured
introduction
Syedand
medical
Atif;rapid
Anjum,
datasets
growth
Adeel;
of
2020
Khan,
the Internet
Google
Sangeen;
of
Scholar
Medical
Mohan, Things
Medicine
Senthilkumar;
(IoMT), aSrivastava,
subset
patient
of the
Gautam;
health
Internet
information
of Things(PHI)
(IoT)
Privacy
/ in
free
theclinical
medical
Preserving
text
andInformation
healthcareSharing
systems,
Secondary
has brought
use of numerous
unstructured
changes
De-identification
clinicaland
datachallenges
(Research)to current
Semantic
medical
Analysis
and healthcare systems.
Rule based
Healthcare
& Statistical
organizations
& NN shar
Using word embeddings to improve theAbdalla,
privacy M;
of clinical
Abdalla,
Objective:
notes
M; Rudzicz,
In thisF;work,
Hirst,we
G introduce
2020
a privacy
Web of
technique
Sciencefor anonymizing
Medicine clinical notes
patientthat
health
guarantees
information
all private
(PHI)protecting
/health
clinicalinformation
notes
patient privacy
is secured
in unstructured
(including
Secondary
sensitive
clinicaluse
text
data,
of unstructured
such as family
De-identification
clinical
history,
datathat
(Research)
are not adequately
Morphological
covered Analysis
by current techniques).
Rule based
Materials
& NN and Methods: We
An Embedding-based Medical Note De-identification
H. Zhou; D. RuanMedical
Approach with
noteMinimal
de-identification
Annotationis critical
2020forIEEE
protecting private information.
Medicine The taskpatient
demandshealth
the information
complete removal
(PHI)protecting
/ of
free
allclinical
patient
patient
text
names
privacy
andinother
unstructured
sensitive
Secondary
clinical
information
use
textofsuch
unstructured
as addresses
De-identification
clinical
anddata
phone
(Research)
numbers from
Semantic
medicalAnalysis
records. Accomplishing
NN this goal is challenging, with ma
Privacy guarantees for de-identifying text
Adelani,
transformations
David Ifeoluwa;
MachineDavody,
Learning
Ali;approaches
Kleinbauer,to
Thomas;
Natural
2020 Google
Klakow,
LanguageScholar
Dietrich;
Processing
Medicine
tasks benefit fromcomplex
a comprehensive
sensitive information
collection of/protecting
unstructured
real-life user
sensitive
data.
data AtInformation
the same time,
in unstructured
Misusage
there is aofclear
text
Sensitive
need for
Information
protecting
De-identification
bythe
Data
privacy
Collector
of theorusers
Adversary
whose
Semantic
data
Analysis
is collected
& Morphological
and processed.
NN Analysis
For text collections, such a
On divergence-based author obfuscation:
Bevendorff,
An attackJanek;
on the
Authorship
Wenzel,
state of Tobias;
the
verification
art inPotthast,
statistical
is theMartin;
task
authorship
of2020
Hagen,
determining
Google
verification
Matthias;
whether
Scholar
Stein,
twoBenno;
Other
texts were written byUser
the same
Generated
authorContent
based on
(UGC)
a writing
protecting
stylesensitive
analysis.Information
Author obfuscation
in unstructured
Profiling
is the adversarial
text task of preventing
De-identification
a successful
/ Obfuscating
verification
Semantic
Authorship
by altering
Analysis
a text’s style soStatistical
that it does not resemble that of its
Understanding the Potential Risks of Sharing
Ü. Meteriz;
Elevation
N. Fazιl
Information
TheYιldιran;
extensive
J.
onKim;
Fitness
use D.
of smartphones
Mohaisen
Applications 2020
and wearable
IEEE devices has Law
facilitated many useful
complex
applications.
sensitiveFor
information
example, with
Investigating
Global Positioning
the potential
System
Privacy
(GPS)-equipped
Impact
Complexity
of smart
of
smart
phone
Privacy
andapps
PoliciesDemonstration
wearable devices, manyof applications
Threats Semantic
can gather,
Analysis
process, and shareStatistical
rich metadata,
& NN such as geolocation,
DatashareNetwork: A Decentralized Privacy-Preserving
Edalatnejad, Kasra;
Search
Investigative
Lueks,
Engine
Wouter;
journalists
for Investigative
Martin,
collect
Julienlarge
Journalists
Pierre;
2020
numbers
Ledésert,
Google
of digital
Scholar
Soline;
documents
L'Hôte,
OtherAnne;
during
Thomas,
their investigations.
Bruno;
User-Centric
Girod,These
Privacy
Laurent;
documents
GuidanceSearching
Troncoso,
canCarmela;
greatlyinbenefit
a private
other
manner
journalists’ work.
Identification
However, many of theseDesigning
documents a contain
decentralized
sensitive
peer-to-peer
information.
Morphological
private
Hence,
Analysis
document
possessing
search
Statistical
such
engine
documents
& NN can endanger repo
Assuring privacy-preservation in miningMa,
medical
Bo; Wu,
textJinsong;
materials
Currently,
Song,
for covid-19
there
Shuang;
is acases-a
very
Liu, William;
large
natural
volume
2020
language
of Google
Covid-19
processing
Scholar
related
perspective
medical
Medicine
data that have patient
been stored
healthininformation
cloud based
(PHI)
systems
protecting
/ free and
clinical
patient
madetextavailable
privacy inforunstructured
studingSecondary
the clinical
diseaseuse
text
dynamics.
of unstructured
without
Designing
clinical
any privacy-preservation.
data
a privacy
(Research)
preserved
In Semantic
order
medical
to reduce
text
Analysis
data
possible
analysis
privacy
NNleakage and also accommodate m
TIPRDC: task-independent privacy-respecting
Li, Ang; data
Duan,crowdsourcing
Yixiao;
The success
Yang, framework
Huanrui;
of deepChen,
learning
for deep
Yiran;
partially
learning
Yang,
2020 benefits
with
Jianlei;
Google
anonymized
from
Scholar
the availability
intermediate
SocialofNetwork
various
representations
large-scale
User Generated
datasets.Content
These datasets
(UGC)Privacy
are often
Preserving
crowdsourced
Information
fromSharing
individual
Misusage
users
ofand
Sensitive
contain
Information
private
Designing
information
by Data
a task-independent
Collector
like gender,
or Adversary
age,Semantic
privacy-respecting
etc. The emerging
Analysis data
privacy
crowdsourcing
concerns
NN from
framework
users on data sharing hin
Preserving Privacy in Analyses of Textual
Diethe,
DataTom; Feyisetan,
Amazon Oluwaseyi;
prides itself
Balle,
on being
Borja;the
Drake,
most
2020Thomas;
customer-centric
Google Scholar company
General
on earth. That Use
means
of third
maintaining
party / storing/sending/encoded
the highest Mining
possible
according
standards
sensitive
to of
privacy
both
information
security
policies and
Disclosure
privacyof
when
sensitive
dealingdata
with
forDesigning
customer
analysis data.
a textThis
perturbation
month, atmechanism
Semantic
the ACM Analysis
Web
for aSearch
privacyand
preserving
Data
NNMining
text analysis
(WSDM) Conference, my c
Privacy and Utility Trade-Off for TextualJingye
Analysis
TangTianqing
via Calibrated
In recent
ZhuPing
Multivariate
years,
XiongYu
thePerturbations
problem
WangWeiof data
Ren
2020
leakage
Springer
often appears in Other
our lives. As of today,
User
a number
Generated
of enterprises
Content (UGC)
haveprotecting
been finedsensitive
heavily Information
for the leakage
in unstructured
of
Disclosure
user data,of
text
including
sensitiveFacebook,
data forDesigning
analysis
Uber and
a text
Equifax.
perturbation
This paper
mechanism
Semantic
makesAnalysis
deep
for a research
privacy preserving
on the
NNprivacy
text analysis
protection of the text. We p
A User-Centric and Sentiment Aware Privacy-Disclosure
Nuhil Mehdy, AKM; Data
Detection
Mehrpouyan,
and information
Framework
Hoda;privacy
Based is
onaMulti-Input
major
2020concern
Google
Neural
of
Scholar
today’s
Networkworld.
Social
More
Network
specifically,
User
users’
Generated
digital privacy
Content
has
(UGC)
become
Privacy
/ user
oneposts
Preserving
of the mostInformation
important issues
Sharing
Unintended
to deal with,
data
as disclosure
advancements
Designing
are being
A User-Centric
made in information
Privacy-Disclosure
Semantic
sharingAnalysis
technology.
Detection
AnFramework
increasing
NN number of users are sharing i
Research on Intelligent Security Protection
H. Yang;
of Privacy
L. Huang;
Data
Based
C.
in Luo;
Government
on Q.
theYu
analysis
Cyberspace
of the difficulties
2020and
IEEE
pain points of privacy
Other
protection in theGovernment
opening andData
sharing of government
Privacydata,
Preserving
this paper
Information
proposesSharing
a new
Unintended
method for
data
intelligent
disclosure
discovery
Detecting
and protection
privacy-sensitive
of structured
information
Semantic
and unstructured
/ Analysis
activities privacy data.
RuleBased
basedon
& the
Statistical
improvement
& NN of the
Using Natural Language Processing to Silva,
DetectP;Privacy
Goncalves,
Violations
As information
C; Godinho,
in Online
systems
C;Contracts
Antunes,
deal N;
with
Curado,
contracts
2020 MWebandofdocuments
Science in Law
essential services, there
Privacy
is aPolicies
lack of mechanisms toProtecting
help organizations
sensitive Information
in protectinginthe
unstructured
Complexity
involved data
text
of Privacy
subjects.
PoliciesDetecting
In this paper, weprivacy-sensitive
evaluate the useinformation
ofSemantic
named entity
/ Analysis
activities
recognition as aRule
waybased
to identify,
& Statistical
monitor &
and
NNvalidate
Tweet Classification Using Deep Learning
R. GeethaS.
ApproachKarthikaS.
to Predict
TwitterSensitive
Mohanavalli
is one of Personal
the most successful
Data 2020
online
Springer
social networks that
Social
present
Network
user’s opinions,
User Generated
personal experience,
Content (UGC)
daily
Protecting
activities
Sensitive
and ideas
Information
to the world.
on Social
The
Unintended
analysis
Mediadata
of
Platforms
user
disclosure
tweets gives
Detecting
variousprivacy-sensitive
interesting perspectives
information
Semantic
of vulnerable
/ Analysis
activities cyber-crimes
Statistical
and information
& NN losses in the micro
Using NLP and Machine Learning to Detect
Silva,Data
Paulo;
Privacy
Gonçalves,
Privacy
Violations
Carolina;
concerns Godinho,
are constantly
Carolina;
increasing
2020
Antunes,
Google
in different
Nuno;
Scholar
Curado,
sectors.Law
Marilia;
Regulations such as
complex
the EU's
sensitive
Generalinformation
Data Protection
/Protecting
Data Regulation
Sets Sensitive
(GDPR)
Information
are pressuring
on Social
Sensitive
organizations
Media
Information
Platforms
to handle
in unstructured
the
Detection
individual's
Data
of Data
dataPrivacy
with reinforced
Violations
Semantic
caution.
Analysis
As information
& Morphological
systems
Rule based
Analysis
deal with
& Statistical
increasingly
& NNlarge am
On the Blockchain-Based Decentralized
R.Data
Doku;
Sharing
D. B. Rawat;
forIt Event
is human
C. Based
Liu nature
Encryption
to anticipate
to Combat
events
2020
Adversarial
in IEEE
the future
Attacks
and plan accordingly.
General As such,Data
the possibility
set containing
of letting
sensitive Privacytrigger
Preserving
futureinformation
events Information
the decryption of Sharing
a message
Disclosure
is a of
form
sensitive
of an encryption
Data Encryption
mechanism
based
weon
deem
Events
to be Semantic
significantAnalysis
in today's
& Morphological
information
NNage.
Analysis
In this paper, we propose a v
Privacy Policy Classification with XLNetMustapha,
(Short Paper)
Majd; Krasnashchok,
Popularization of
Katsiaryna;
privacy policies
Al Bassit,
has
2020
Anas;
become
Google
Skhiri,
an Scholar
attractive
Sabri; subject
Law of research in Privacy
recent years,
Policies
notably after General
EaseData
the comprehension
Protection Regulation
of Privacy
camePolicies
Complexity
into force of
in Privacy
the European
PoliciesGeneration
Union. While of
GDPR
Artefacts
givesbased
Data Semantic
on
Subjects
Privacy
more
Analysis
Policy
rights
Analysis
& Morphological
and control
NN over
Analysis
the use of their personal data
Privacy-Preserving Data Generation and
Shuo
Sharing
WangLingjuan
Using Identification
In this
LyuTianle
paper, Sanitizer
we
ChenShangyu
propose a practical
ChenSurya
2020
privacy-preserving
NepalCarsten
Springer RudolphMarthie
generative
General
model
Grobler
for dataUser
sanitization
Generated
and Content
sharing,(UGC)
calledPSanitizer-Variational
rivacy Preserving Information
Autoencoder
Sharing
(SVAE).
Identification
We assume that the data
Generation
consists of
of Synthetic
identification-relevant
Data SemanticandAnalysis
irrelevant
& components.
Morphological
NN AAnalysis
variational autoencoder (VAE)
Generation and evaluation of artificial mental
Ive, Julia;
health
Viani,
records
Natalia;
A serious
for Kam,
Natural
obstacle
Joyce;
Language
toYin,
the Lucia;
development
Processing
Verma,
2020of
Somain;
Google
NaturalPuntis,
Scholar
Language
Stephen;
Processing
Medicine
Cardinal,
(NLP)
Rudolf
methods
patient
N; Roberts,
inhealth
the clinical
Angus;
information
domain
Stewart,
(PHI)
isRobert;
the
Privacy
/ Psychiatry
accessibility
Velupillai,
Preserving
Notes
ofSumithra;
textual
Information
data. The
Sharing
mental
Secondary
health
use
domain
of unstructured
is particularly
Generation
clinical
challenging,
data
of(Research)
Synthetic
partly because
Data Semantic
clinicalAnalysis
documentation relies
NNheavily on free text that is difficult to
Differential Privacy and Natural Language
Panchal,
Processing
Kunjal;to Honey
Generate
Encryption
Contextually
is an Similar
approachDecoy
to2020
encrypt
Messages
Google
the messages
Scholar
in Honey Encryption
using
Other
low min-entropy
Scheme Userkeys,Generated
such as weak
Content
passwords,
(UGC)Searching
/OTPs,
messeges
PINs,
in a credit
privatecard
manner
numbers. Profiling
The ciphertext is produces, when
Generation
decrypted
of with
Synthetic
any number
Data Semantic
of incorrect
Analysis
keys, produces plausible-looking
NN but bogus plaintext c
PolicyQA: A Reading Comprehension Dataset
Ahmad,for
Wasi
Privacy
Uddin;
Privacy
Policies
Chi, policy
Jianfeng;
documents
Tian, Yuan;
are long
Chang,
2020
andKai-Wei;
verbose.
Google AScholar
question answering
Law (QA) system
Privacy
can assist
Policies
users in finding the
Ease
information
the comprehension
that is relevant
of Privacy
and important
Policies
Complexity
to them.
of Privacy
Prior studies
PoliciesInformation
in this domain
Extraction
frame thefrom
QA task
privacy
Semantic
as retrieving
policies
Analysis
the most relevant
NNtext segment or a list of sentences
Surfacing Privacy Settings Using Semantic
Khandelwal,
MatchingRishabh;
Online
Nayak,
services
Asmit;
utilize
Yao,
privacy
Yao; Fawaz,
settings
2020
Kassem;
to Google
provide Scholar
users with control
Other over their data.
complex
However,
sensitive
these privacy
information
settings
Paying
are often
more hard
Attention
to locate,
to Privacy
causing
Policies
Complexity
the user to rely
of Privacy
on provider-chosen
PoliciesInformation
defaultExtraction
values. Infrom
this work,
privacy
Semantic
wepolicies
train
Analysis
privacy-settings-centric
Statistical
encoders and leverage them to
Lucene-P2: A Distributed Platform for Privacy-Preserving
Uplavikar, Nitish; Information
Malin,
Text-based
Bradley
retrieval
Search
A; Jiang,
(IR)Wei;
plays an essential
2020 Google
role inScholar
daily life. However,
Other currently deployed
Query Data
IR technologies, e.g., Apache
Searching
Lucene
in a -private
open-source
mannersearchProfiling
software, are insufficient whenInformation
the information
Extraction
is protected
if information
or
Semantic
deemed
is harboured
Analysis
to be private.
by a second
For example,
Rule
party
based
submitting a query to a public
Chatbot Security and Privacy in the AgeW.ofYe;
Personal
Q. Li Assistants
The rise of personal assistants serves
2020as IEEE
a testament to the growing
Other popularity of Speech
chatbots.
Data
However, as the fieldVoice-based
advances, itAssistance
is importantinfor
a privacy
the conversational
Unintended
preserving manner
AI
data
community
disclosure
to keep
Overview
in mind
of any
existing
potential
dialogue
vulnerabilities
system
None vulnerabilities
in existing architectures
in security
None
and
andprivacy.
how attackers could take adv
A Survey on Privacy-Preserving Data Mining
Yang, Methods
Yunlu; Zhou,
In Yajian;
recent years, the increase in massive
2020 data
Google
in various
Scholarfields Other
has promoted the development
complex sensitive
of datainformation
mining. Still,
/Mining
unstructured
the storage
according
and
datato
mining
privacy
of user
policies
dataProfiling
bring about the threat of privacy
Overview
leakage.
over
Researches
the research
onfield
privacy-preserving
None data mining None
have become an increasingly signific
Data Leakage Detection and Prevention:
Suvendu
ReviewKumar
and Research
NayakAnanta
Disclosure
Directions
of Charan
confidential
Ojhadata to an2020
unauthorized
Springer person, internal
Otheror external to the
complex
organization
sensitive
is termed
information
as data
/protecting
unstructured
leakage.sensitive
It may
data happen
Information
inadvertently
in unstructured
DataorLinkage
deliberately
text by a person.
Overview
Data leakage
over the
inflicts
research
huge field
financial
None and nonfinancial lossesNone
to the organization. Whereas data is
Enhancing Privacy Preservation in Speech
Zhang,
Data
Guanglin;
Publishing
In
Ni,speech
Sifan; Zhao,
data publishing,
Ping; users' data
2020privacy
Google
is disclosed
Scholar andGeneral
thereby more privacy
Speech
of users
Datais breached since speech
Privacydata
Preserving
contains
Information
a large amount
Sharing
Profiling
of information about speakers.
Speech
Existing
de-identification
work focused on sanitization
Speech Processing
in speech content, speakers'
Statistical
voice,
& NNand data descriptions,
Privacy-Enhanced Emotion RecognitionM.Approach
ThenmozhiK.
for Remote
Narmadha
Speech
Health
recognition
Advisory
also
System
known as 2020
voice recognition
Springer involves in
Medicine
identifying the words
Speech
in theData
speech and translatesVoice-Based
it into machine-readable
Healthcare inform.
a private
Speech
Identification
manner
recognition applications are
Speech
currently
De-Identification
being used for health
Speech
care and
Processing
special needs’ assistance
NN such as transcription of med
Exploiting Privacy-preserving Voice Query
T. Altuwaiyan;
in Healthcare-based
M. Voice
Hadian;
Assistant
Voice
S. Rubel;
Assistant
Systems
X. Liang
System
(VAS) such
2020
as Amazon
IEEE Echo and Google
Medicine
Home are becoming
Speecha Data
popular technology forVoice-Based
medical health
Healthcare
systems among
in a private
patients
Profiling
manner
and caregivers. The VAS Speech
devices De-Identification
allow patients and caregivers
Speech to
Processing
interact with
& Morphological
them via
Statistical
voiceAnalysis
commands. In most cases, th
Speaker Anonymization for Personal Information
Yoo, IC; Lee,
Protection
K; Leem,
As speech-based
Using
S; Oh,
Voice
H; Ko,
Conversion
user
B; Yook,
interfaces
Techniques
D integrated
2020 Web
in of
theScience
devices such
General
as AI speakers become
Speechubiquitous,
Data a large amount
Voice-based
of user voice
Service
data in
is abeing
private
collected
manner
Identification
to enhance the accuracy ofSpeech
speechDe-Identification
recognition systems. Speech
Since such
Processing
voice data containNN
personal information that can endan
Voice-indistinguishability: Protecting voiceprint
Han, Yaowei;
in privacy-preserving
Li, Sheng;
With theCao,
development
speech
Yang; Ma,
data
ofQiang;
smart
release
Yoshikawa,
devices,
2020 such
Google
Masatoshi;
as the
Scholar
Amazon Echo
Otherand Apple's HomePod,
Speech Data
speech data have become
Voice-based
a new dimension
Service inof
a private
big data.
manner
However,
Identification
privacy and security concerns
Speech De-Identification
may hinder the collection
Speech
andProcessing
sharing of real-world speech
Statistical
data,
& NN
which contain the speak
Privacy-Enabled Smart Home Framework
Singh,
withDeepika;
Voice Assistant
Psychoula,
Smart homeIsmini;
environment
Merdivan,
plays
Erinc;
a prominent
2020
Kropf,Google
Johannes;
role in
Scholar
improving
Hanke, Sten;
Other
the quality
Sandner,
of life
Emanuel;
of the
Speech
residents
Chen,
Data
Liming;
by enabling
Holzinger,
home Andreas;
Voice-based
automation, Service
health care
in a and
private
safety
manner
Profiling
through various Internet of Things
Speech
(IoT)De-Identification
devices. However, a Speech
large amount
Processing
of data generated
NNby sensors in a smart home enviro
Anonymization for the GDPR in the Context
Francopoulo,
of CitizenGil;
and The
Schaub,
Customer
General
Léon-Paul;
Relationship
Data Protection
Management
Regulation
2020and(GDPR)
Google
NLP Scholar
is the regulation
Law in the EuropeanPrivacy
Economic
Policies
Area (EEA) law on Ease
data protection
the automation
and privacy
processforofallprivacy
citizens.
Compliance
regulating
There with
isdocuments
a Requirements
dilemma between
Suggestion
sharingofdata
security
and and
theirprivacy
subjects’
Nonerequirements
confidentiality
for to
text
respect
mining
None
GDPR in the commercial, legal and
A Machine Learning Based Methodology
Beniamino
for Automatic
Di MartinoFiammetta
Annotation
Textual resources
and Anonymisation
MarulliPietro
annotationLupiAlessandra
is
ofcurrently
Privacy-Related
2021 Springer
performed
Cataldi
Itemsboth
in manually
Textual
LawDocuments
by human experts
for Justice
complex
selecting
Domain
sensitive
hand-crafted
information
features
/Annotation
unstructured
and automatically
Automatization
data by any trainedSensitive
systems.Information
Human experts
in unstructured
annotations
Collect Data
annotated
are very accurate
corpus butSemantic
require heavy
Analysis
effort
& Morphological
in compilation
NN andAnalysis
most often are not publicly
A Novel COVID-19 Data Set and an Effective
R. Catelli;
DeepF. Gargiulo;
Learning
In theApproach
V.last
Casola;
years,
for
G.the
the
Deneed
De-Identification
Pietro;
to de-identify
H. Fujita;
2021ofM.
privacy-sensitive
IEEE
Italian
Esposito
Medical Records
information
Medicinewithin Electronic
patient
Health
health
Records
information
(EHRs)
(PHI)
has
Annotation
/ electronic
become increasingly
Collection
health records
felt (EHRs)
and extremely
Annotations
relevant
andto their
encourage
Qualitythe
Collect
sharing
annotated
and publication
corpus of their
Semantic
contentAnalysis
in accordance withNN
the restrictions imposed by both nati
Document Title Authors Abstract Year Source Generalized Domain
Generalized
1 Data Category (RQ1)Classification of Use Case Generalized Privacy Issue(RQ1)
Generalized Privacy Issue Solution
Generalized
(RQ2)Category of Applied
NLP Method
NLP Concept(RQ2)
Combining contextualized word representation
Catelli, Rosario;
and sub-document
Casola,
Clinical Valentina;
de-identification
level analysis
De Pietro,
through
aimsGiuseppe;
to identify
Bi-LSTM+CRF
2021Fujita,
Protected
ScienceDirect
Hamido;
architecture
HealthEsposito,
Information
for
Medicine
clinical
Massimo
inde-identification
clinical data,
patient
enabling
healthdata
information
sharing and
(PHI)publication.
Privacy
/ clinicalPreserving
notes
First automatic
Information
de-identification
Sharing
Secondary
systems
use of
were
unstructured
based onDe-identification
clinical
rules ordata
on machine
(Research)
learning Semantic
methods, Analysis
limited by language NN
changes, lack of context awareness
A deep reinforcement learning model for
Ahmed,
data sanitization
Usman; Srivastava,
User-generated
in IoT networks
Gautam;
textual
Lin,data
Jerry
is Chun-Wei;
rich 2021
in content
Google
andScholar
has been used
Other
in many user behavioral
User Generated
modeling
Content
tasks. However,
(UGC)protecting
it couldsensitive
also leakInformation
user private-attribute
in unstructured
Misusage
information
oftext
Sensitive
that Information
they may
De-identification
not
bywant
DatatoCollector
disclose
/ Obfuscating
or
such
Adversary
asSemantic
Authorship
age and Analysis
location. User’s privacy
NN concerns mandate data publishe
A Conceptual Framework for Sensitive Nancy
Big Data
VictorDaphne
Publishing
Big data
Lopezcan be referred to as the massive
2021 Springer
quantity of data which
Social
is collected
Network fromUser
numerous
Generated
sources
Content
at high
(UGC)
velocities
PrivacyinPreserving
different data
Information
formats. Sharing
Social
Unintended
network data,
datamedical
disclosure
data and
Designing
sensoradata
Big Data
comprise
Framework
the Semantic
major
for sources
Publishing
Analysis
of big
Social
data.
Media
OneStatistical
Data
of the major concerns with data inc
Artificial Empathy for Clinical Companion
Miguel
Robots
Vargas
with MartinEduardo
Privacy-By-Design
We present aPérez
prototype
ValleSheri
whereby
Horsburgh
we2021
enabled
Springer
a humanoid robot
Medicine
to be used to assist
Speech
mentalData
health patients and their
Voice-Based
families. Our
Healthcare
approach
in removes
a private Unintended
the
manner
need fordata
Cloud-based
disclosureautomatic
Designing
speech
a locally
recognition
deployed
systems
system
Speech
to address
Processing
healthcare privacy
NN expectations. Furthermore, we d
Toward a Privacy Guard for Cloud-Based
Radja
Home
BoukharrouAhmed-Chawki
Assistants
Theand
Internet
IoT Devices
of Things
ChaoucheKhaoula
is a technology
2021
Mahdjar
which
Springer
is dominating ourOther
lives, making it more
Speech
comfortable,
Data safer and smarter.
Voice-based
Using smart
Assistance
speakersinas
a privacy
voice assistants
Disclosure
preserving
forof
manner
smart
sensitive
homesData
provides
Designing
valuable
a Privacy
serviceGuard
and an
foreasy
Speech
Cloud-Based
control
Processing
ofHome
IoT devices,
&Assistants
Morphological
despite
NNand allowing
IoT
Analysis
Devices
the emergence of priva
Context-Rich Privacy Leakage AnalysisY.Through
Luo; L. Cheng;
InferringEmerging
H.
Apps
Hu; in
G.Smart
Internet
Peng;Home
D.ofYao
Things
IoT (IoT) systems
2021 IEEE
leverage connectedOther
devices to enable complex
intelligentsensitive
and automated
information
functionalities.
/Investigating
unstructured
Despite
the
datapotential
the benefits,
Privacy
there
Impact
Handling
existof
privacy
smart
of sensitive
risks
phoneof apps
data
network
by applications
Detecting
traffic, which
privacy-sensitive
have been studied
information
Semantic
by the previous
/ Analysis
activitiesresearch. However,
Statistical
with
& NN
the current privacy infere
An OOV Recognition Based Approach Xiao
to Detecting
LiangNingyu
Sensitive
Sensitive
AnNing
Information
WuYunfeng
word recognition
in Dialogue
ZouLijiao
technology
Texts
Zhao2021
of Electric
is of
Springer
great
Power
significance
CustomerOther
toServices
the protection of enterprise
complex sensitive
privacy data.
information
In electric
/Protecting
unstructured
power custom
sensitive
dataservices
Information
systems,
in unstructured
Unintended
the dialoguetext
data
textsdisclosure
recording the
Detecting
conversational
privacy-sensitive
informationinformation
between
Semanticelectric
/ Analysis
activities
power customers
NN and the customer services staffs
A Method for Generating Synthetic Electronic
J. Guan;Medical
R. Li; S.
Record
Machine
Yu; X.Text
Zhang
learning (ML) and Natural2021
Language
IEEEProcessing (NLP)
Medicine
have achieved remarkable
Synthetic Data
success
Generation
in many fields
Privacy
and have
Preserving
broughtInformation
new opportunities
Sharing
Secondary
and highuse
expectation
of unstructured
in the Generation
analyses
clinical data
ofofmedical
(Research)
Synthetic
data,
Data
of which
Semantic
the most
Analysis
common type isNN
the massive free-text electronic med
Exsense: Extract sensitive information Guo,
from unstructured
Yongyan; Liu,
Large-scale
data
Jiayong; Tang,
sensitive
Wenwu;
information
Huang, leakage
Cheng
2021 ScienceDirect
incidents are frequently
General
reported in recent
complex
years.sensitive
Once sensitive
information
information
/protecting
unstructured
is leaked,
sensitive
data
it may
Information
lead to serious
in unstructured
Sensitive
effects. Information
Intext
this context,
in unstructured
sensitive
Information
information
DataExtraction
leakage
from
haslarge-scale
long
Semantic
been unstructured
a
Analysis
question oftextual
great interest
data
NN in the field of cybersecurity. Ho
Analysis of gender and identity issues in
Lopez-Otero,
depression detection
Paula;
Research
Docio-Fernandez,
on de-identified
in the area of
Laura
speech
automatic monitoring
2021 ScienceDirect
of emotional state
Medicine
from speech permits
Speech
envisaging
Data future novel applications
Voice-Basedfor the
Healthcare
remote monitoring
in a privateofIdentification
manner
some common mental disorders,
Speechsuch
De-Identification
as depression. However,
Speechthese
Processing
tools raise some privacy
NN concerns since speech is sent
Semantic knowledge and privacy in theDas
physical
P.K., web
Kashyap
InA.,
theSingh
past few
G., Matuszek
years, the C.,
Internet
Finin of
T.,
2016
Things
Joshi
SCOPUS
A.
has started to become
Law a reality; however,
Privacy
its growth
Policieshas been hampered
Automated
by privacy
and and
Flexible
security
Ruleconcerns.
Enforcement
Compliance
One promising
with Requirements
approach is A
toutomated
use Semantic
Compliance
Web technologies
Checker
Morphological
to mitigate
Analysis
privacy concerns
Ruleinbased
an informed,
& Statistical
flexible
& NN
way. We p
VerHealth: Vetting Medical Voice Applications
Shezan,through
FaysalPolicy
Hossain;
Healthcare
Enforcement
Hu,applications
Hang; Wang, onGang;
VoiceTian,
Personal
2020Yuan
ACM
Assistant System (e.g.,
Medicine
Amazon Alexa),Speech
have shown
Data a great promise to
Voice-Based
deliver personalized
Healthcarehealth
in a private
servicesCompliance
manner
via a conversational
with Requirements
interface.
Automated
However,Privacy
concerns
Assessment
are alsoSemantic
raised
of health
about
Analysis
Related
privacy,
&Alexa
Morphological
safety,
Application
and
Ruleservice
based
Analysis
quality.
& NN In this paper, we p
Privacy-aware text rewriting Xu Q., Qu L., Xu C.,
Biased
Cui R.
decisions made by automatic
2019
systems
SCOPUS
have led to growing
Otherconcerns in research
User Generated
communities.
Content
Recent
(UGC)
work
protecting
from thesensitive
NLP community
Information
focuses
in unstructured
Handling
on building
oftext
sensitive
systems that
datamake
by applications
Automated
fair decisions
Rewriting
based on text. Semantic
Instead ofAnalysis
relying on unknownStatistical
decision systems
& NN or human decision
Towards integrating the FLG frameworkRabinia
with theA.,NLP
Dragoni
combinatory
Automatic
M., Ghanavati
framework
modeling
S. of privacy regulationsSCOPUS
is a highly demanded
Lawgoal in requirements
Privacy
engineering.
Policies The FOL-based
Automatic
Legal-GRL
modeling
(FLG) of
is privacy
a semi-automated
regulations
Complexity
goal-oriented
of Privacy
modeling
PoliciesAutomatic
framework for
modeling
extracting
of privacy
and representing
Semantic
regulations
Analysis
the legal& requirements
Morphological
Ruleofbased
Analysis
IT systems.
& Statistical
One limitation
& NN of t
Usable privacy and security for personal
Karat
information
C.-M., Brodie
management
IBMC.,T.J.
Karat
Watson
J. Research Center in
2006
Hawthorne,
SCOPUS NY, focuses Law
on the design and development
Privacy Policies
of policy SPARCLE,
Automated
authoringand
andFlexible
transformation
Rule Enforcement
toolsCompliance
that enable with
organizations
Requirements
to create
Automatic
machine-readable
policy enforcement
policies
Semantic
for real-time
Analysis
enforcement decisions.
Rule basedUsable privacy and security
Automatic policy enforcement on semantic
Nguyen
social
T.-V.T.,
data Fornara
Web-based
N., Marfia
data collection
F. of non-reactive
2015 SCOPUS
data is becoming increasingly
Law important
Privacy
for many
Policies
social science fields.Automated
Being ableand
to introduce
Flexible Rule
and Enforcement
automatically
Compliance
enforce
withpolicies
Requirements
that regulate
Automatic
the collection
policy enforcement
and the use
Semantic
on of
semantic
thoseAnalysis
data
social
is crucial
data for taking
NN into account the privacy and co
On privacy preservation in text and document-based
Olsson F. active
Thelearning
preservation
for named
of theentity
privacy
recognition
of persons
2009 SCOPUS
mentioned in text requires
Generalthe ability tocomplex
automatically
sensitive
recognize
information
and identify
/Annotation
unstructured
names.
Process
data
NamedSupport
entity recognition
Complexity
is a mature
of Privacy
field and
PoliciesCollect
most current annotated
approaches
corpus
are basedSemantic
on supervised
Analysis
machine learning
Ruletechniques.
based Such learning require
A framenet for cancer information in clinical
Roberts
narratives:
K., Si Y.,Schema
This
Gandhi
paper
and
A., presents
Bernstam
annotation
aE.V.
pilot project named
2018 SCOPUS
Cancer FrameNet. The
Medicine
project's goal ispatient
a general-purpose
health information
natural(PHI)
language
Annotations
/ free processing
clinicalfortext
research
(NLP) resource
acceleration
forScancer-related
econdary use of
information
unstructured
in Collect
clinical
clinicalnotes
annotated
data (i.e.,
(Research)
corpus
patient records
Semantic
in an electronic
Analysis health record
Statistical
system).& NN
While previous cancer N
Analyzing Privacy Policies at Scale: From
Wilson,
Crowdsourcing
Shomir; Schaub,
Website
to Automated
Florian;
privacyAnnotations
Liu,
policies
Frederick;
are often
Sathyendra,
long
2018and
ACM
difficult
Kanthashree
to understand.
Mysore;
Law Smullen,
While research
Daniel;
Privacy
shows
Zimmeck,
that
Policies
Internet
Sebastian;
users
Ramanath,
careEase
about
Rohan;
the
their
comprehension
privacy,
Story, Peter;
they do
Liu,
of Privacy
not
Fei;
have
Sadeh,
Policies
Complexity
the time
Norman;
toof
understand
Privacy
Smith, Noah
PoliciesCollect
the A.
policies ofannotated
every website
corpus
they visit,
Morphological
and most users
Analysis
hardly everStatistical
read privacy policies. Some recent e
A Large Publicly Available Corpus of Website
Zaeem,Privacy
RaziehPolicies
Nokhbeh;
StudiesBased
have
Barber,
on
shown
DMOZ
K Suzanne;
website privacy
2021
policies
Google
are Scholar
too long andLaw
hard to comprehendPrivacy
for their
Policies
target audience. These
Ease
studies
the comprehension
and a more recent
of Privacy
body ofPolicies
Complexity
research that
of Privacy
utilizes PoliciesCollect
machine learningannotated
and natural
corpus
language
Semantic
processing
Analysis
to automatically summarize
Statistical privacy policies greatly be
Question answering for privacy policies:Ravichander
Combining computational
A., Black
Privacy
A.,policies
Wilson
and legal
S.,
areNorton
long
perspectives
and
T., complex
Sadeh
2019
N.documents
SCOPUSthat are difficult
Law for users to read
Privacy
and understand,
Policies and yet, theyPaying
have legal
moreeffects
Attention
on to
howPrivacy
user data
Policies
Complexity
is collected,
& Easeofthe
managed
Privacy
comprehension
PoliciesCollect
and used.ofIdeally,
Privacy
annotated
wePolicies
would
corpus
like to empower
Morphological
users to
Analysis
inform themselves
Rule based
about &
issues
NN that matter to them
An emerging strategy for privacy preserving
Rashiddatabases:
F., Miri A. Differential
Data De-identification
privacy and Differential
2020 Privacy
SCOPUS
are two possibleMedicine
approaches for providing
complex data
sensitive
security
information
and user privacy.
/Mining
unstructured
according
Data de-identification
datato privacy policies
is the process
Disclosure
where
of sensitive
the personal
dataidentifiable
forDe-identification
analysisinformation of individuals
Morphological
is extracted
Analysis
to create anonymized
Statisticaldatabases. Data de-identifica
Anonimytext: Anonimization of unstructured
Perez-Lainez
documents
R., The
Iglesias
anonymization
A., De Pablo-Sanchez
of unstructured
C. 2009
texts is
SCOPUS
nowadays a task ofMedicine
great importance in
patient
several
health
text mining
information
applications.
(PHI)Mining
/ clinical
Medical
according
notes
records
to anonymization
privacy policiesisSensitive
needed both
Information
to preserve
in unstructured
personal
De-identification
health
Data information privacy
Morphological
and enable further
Analysis
data mining
Rule
efforts.
basedThe described ANONYMIT
Secure secondary use of clinical data with
Christoph
cloud-based
J., Griebel
NLP
Objectives:
L.,
services:
Leb I.,
TheEngel
Towards
secondary
I., Köpcke
a highly
useF.,
scalable
of Toddenroth
clinical
2015 research
SCOPUS
dataD.,
provides
Prokosch
infrastructure
large
H.-U.,
opportunities
Medicine
Laufer J.,for
Marquardt
clinical
patient
and
K.,health
translational
Sedlmayr
information
M.research
(PHI)Privacy
as well as
Preserving
quality assurance
Information
projects.
Sharing
Secondary
For such purposes,
use of unstructured
it is necessary
De-identification
clinical
to data
provide
(Research)
a flexible andMorphological
scalable infrastructure
Analysis that is Statistical
compliant & with
NNprivacy requirements.
Anonymization of sensitive information Saluja
in medical
B., Kumar
healthDue
G.,
records
Sedoc
to privacy
J., Callison-Burch
constraints, clinical
C. records
2019 SCOPUS
with protected health
Medicine
information (PHI)patient
cannothealth
be directly
information
shared.(PHI)
De-identification,
Privacy
/ clinicalPreserving
notes
i.e., the
Information
exhaustive
Sharing
removal,
Secondary
or replacement,
use of unstructured
of all mentioned
De-identification
clinical data
PHI phrases
(Research)
has to be
Semantic
performed
Analysis
before making theNN
clinical records available outside of
De-identification of free-text medical records
Johnson
using
A.E.W.,
pre-trained
Bulgarelli
The ability
bidirectional
L.,
ofPollard
caregivers
transformers
T.J. and investigators
2020 SCOPUS
to share patient data
Medicine
is fundamental to patient
many areas
healthofinformation
clinical practice
(PHI)and
Privacy
/ clinical
biomedical
Preserving
notes research.
Information
Prior to
Sharing
sharing,
Secondary
it is often
use of
necessary
unstructured
to remove
De-identification
clinicalidentifiers
data (Research)
such as names,
Semantic
contact
Analysis
details, and dates NN
in order to protect patient privacy. De
NLNDE: The neither-language-nor-domain-experts’
Lange L., Adel
way
H.,Natural
ofStrötgen
Spanish
language
J.medical
processing
documenthas
de-identification
huge
2020potential
SCOPUSin the medicalMedicine
domain which recently
patient
ledhealth
to a lotinformation
of research
(PHI)
in this
Privacy
/ patient
field.Preserving
However,
notes aInformation
prerequisiteSharing
of secure
Secondary
processing
use of unstructured
of medical documents,
De-identification
clinical data
e.g.,
(Research)
patient notes Semantic
and clinical
Analysis
trials, is the properNN
de-identification of privacy-sensitive
Sensitive data detection and classification
García-Pablos
in Spanish clinical
A.,Massive
Pereztext:
N.,
digital
Experiments
Cuadros
data M.
processing
with BERT
provides
2020 SCOPUS
a wide range of opportunities
Medicine and benefits,
patient
but at
health
the cost
information
of endangering
(PHI)Privacy
/ free
personal
clinical
Preserving
data
textprivacy.
Information
Anonymisation
Sharing
Secondary
consists
use
in removing
of unstructured
or replacing
De-identification
clinicalsensitive
data (Research)
information from
Semantic
data, enabling
Analysis its exploitation
NN for different purposes while prese
Developing a standard for de-identifying
Velupillai
electronic
S.,patient
Dalianis
Background:
records
H., Hassel
written
Electronic
M.,inNilsson
Swedish:
patient
G.H.
Precision,
records
2009(EPRs)
recall
SCOPUSand
contain
F-measure
a largeMedicine
in
amount
a manual
of information
and computerized
patient
written
health
in
annotation
free
information
text. This
trial(PHI)
information
Privacy
/ clinicalPreserving
isnotes
considered
Information
very valuable
Sharing
Secondary
for research
use
butofisunstructured
also very sensitive
De-identification
clinical since
data (Research)
the free text parts
Semantic
may contain
Analysis
information thatRule
could
based
reveal
& the
Statistical
identity of a patient
Evaluation of PHI Hunter in Natural Language
Redd A.,Processing
Pickard OBJECTIVES:
S.,Research
Meystre S., We
Scehnet
introduce
J., Bolton
and evaluate
D.,
2015Heavirland
SCOPUS
a new, J.,
easily
Weaver
accessible
A.L.,
Medicine
Hope
tool using
C., Garvin
a common
patient
J.H. health
statistical
information
analysis (PHI)
and business
Privacy
/ free clinical
Preserving
analytics
text software
Information
suite,
Sharing
SAS,
Secondary
which can
usebe
ofprogrammed
unstructuredDe-identification
to
clinical
removedata
specific
(Research)
protected health
Semantic
information
Analysis(PHI) from aStatistical
text document. Removal of PHI is im
Generalizability and comparison of automatic
Ferrández
clinical
O., text
South
Inde-identification
this
B.R.,
paper,
Shenwe S.,present
methods
FriedlinanF.J.,
and
evaluation
resources.
Samore
2012of
M.H.,
SCOPUS
the hybrid
Meystrebest-of-breed
S.M. Medicine
automated VHA
patient
(Veteran's
health
Health
information
Administration)
(PHI)Privacy
/ clinical
clinical
Preserving
notes
text de-identification
Information Sharing
system,
Secondary
nicknamed
use of
BoB,
unstructured
developed
De-identification
clinical
within data
the VHA
(Research)
Consortium for
Semantic
Healthcare
Analysis
Informatics Research.
Statistical
We also evaluate two availabl
A deep learning-based system for the meddocan
Jiang D., Shen
task Y.,Due
Chen
to S.,
privacy
Tangconstraints,
B., Wang X.,de-identification,
Chen 2019
Q., XuSCOPUS
R.,identifying
Yan J., Zhou
and Y.
removing
Medicineall PHI mentions,
patientishealth
a prerequisite
information
for accessing
(PHI)Privacy
/ freeand
clinical
Preserving
sharing
text clinical
Information
records
Sharing
outside
Secondary
of hospitals.
use of Large
unstructured
quantities
De-identification
clinical
of studies
data (Research)
on de-identification
Semantic
have been
Analysis
conducted in recent
Statistical
years,& NN
especially with the effort
Automatic de-identification of medical texts
Marimon
in Spanish:
M., Gonzalez-Agirre
The
There
Meddocan
is an increasing
A.,track,
Intxaurrondo
corpus,
interest
guidelines,
A.,inRodríguez
exploiting
2019 methods
SCOPUS
the
H., Martin
content
and evaluation
J.A.L.,
of electronic
Villegas
Medicine
of results
health
M., Krallinger
records
patient
by
M.means
health
of information
natural language
(PHI)Privacy
/processing
nursing
Preserving
document
and text-mining
Information
technologies,
Sharing
Secondary
as they
usecan
of unstructured
result in resources
De-identification
clinicalfor
data
improving
(Research)
patient health/safety,
Semantic Analysis
aid in clinical decision
Statistical
making,& NN
facilitate drug re-purpos
Technical Note: An embedding-based medical
Zhou, Hanyue;
note de-identification
Ruan,
Purpose
Dan Medical
approach
notewith
de-identification
sparse annotation
2021is Wiley
critical for the protection
Medicine
of private information
patientand
health
the information
security of data
(PHI)sharing
Privacy
/ Electronic
in
Preserving
collaborative
Health Information
Records
research.
(EHR)
Sharing
TheSecondary
task demands
use of
theunstructured
complete removal
De-identification
clinical of
data
all (Research)
patient names and
Semantic
other sensitive
Analysisinformation such
Rule as
based
addresses
& Statistical
and phone
& NN numbe
Protected health information recognition
Colón-Ruiz
BybilSTM-CRF
C., Segura-Bedmar
Medical records
I. contain relevant information
2019 SCOPUS
about patients, which
Medicine
can be beneficial
patient
to improving
health information
healthcare(PHI)
andPrivacy
research
/ Electronic
Preserving
in the
Health
clinical
Information
Records
domain.
(EHR)
Sharing
DueSecondary
to this, there
useisof
a growing
unstructured
interest
De-identification
clinical
in developing
data (Research)
automatic methods
Semantic
toAnalysis
extract and exploit the
Statistical
information
& NNfrom medical records.
Utility-preserving privacy protection of textual
Sánchezhealthcare
D., BatetThe
documents
M.,adoption
Viejo A. of ITs by medical organisations
2014 SCOPUSmakes possible Medicine
the compilation of large
patient
amounts
healthof
information
healthcare(PHI)
data,
Privacy
/ which
clinicalPreserving
are
notes
quite often
Information
needed Sharing
to beSecondary
released touse
third
of parties
unstructured
for research
De-identification
clinicalor
data
business
(Research)
purposes.Semantic
Many of this
Analysis
data are
& Morphological
of sensitive
Statistical
nature,
Analysis
because they may includ
A machine learning based approach to Du
identify
L., Xia
protected
C., Deng
Background:
health
Z., Luinformation
G., Xia
WithS.,the
in
MaChinese
increasing
J. clinical
application
2018textSCOPUS
of electronic health
Medicine
records (EHRs) inpatient
the world,
health
protecting
information
private
(PHI)
information
Privacy
/ electronic
Preserving
inhealth
clinicalrecords
Information
text has(EHRs)
drawn
Sharing
extensive
Secondaryattention
use of unstructured
from healthcare
De-identification
clinical
providers
data (Research)
to researchers.Semantic
De-identification,
Analysisthe
& Morphological
processStatistical
of identifying
Analysisand removing protected
A patient like me - An algorithm-based program
Koren G.,toSouroujon
informTo
patients
date,
D., Shaul
consumer
on the
R.,likely
Bloch
health
conditions
A.,tools
Leventhal
available
people
2019
A., with
Lockett
over
SCOPUS
symptoms
theJ.,web
Shalev
suffer
likeV.theirs
from
Medicine
serious
have limitations
patient
that health
lead toinformation
low quality (PHI)
health-
Privacy
/ Physician
related
Preserving
information.
Notes Information
While health
Sharing
data
Secondary
in our world
use ofare
unstructured
abundant,De-identification
clinical
access data
to it is
(Research)
limited because
Semantic
of liability
Analysis
and privacy
& Morphological
constraints.The
Statistical
Analysis
objective of the present stu
De-identifying Swedish Ehr text using public
Chomutare
resources
T., Yigzaw
inSensitive
the general
K.Y.,data
Budrionis
domain
is normally
A., Makhlysheva
required 2020
to A.,
develop
Godtliebsen
SCOPUS
rule-based
F., Dalianis
or train
Medicine
H.
machine learning-based
patient health
modelsinformation
for de-identifying
(PHI)protecting
/ clinical
electronic
notes
patient
healthprivacy
recordin(EHR)
unstructured
clinical
Secondary
clinical
notes;useand
textof
this
unstructured
presents important
De-identification
clinical data
problems
(Research)
for patient Semantic
privacy. InAnalysis
this study, we add non-sensitive
NN public datasets to EHR
Data Science and Natural Language Processing
Vydiswaranto V.G.V.,
Extract
In Zhao
the
Information
past
X., decade,
Yu D.
from a
Clinical
growingNarratives
adoption
2021ofSCOPUS
electronic health record
Medicine
(EHR) systemspatient
have made
healthmassive
information
amounts
(PHI)of
protecting
/ clinical narratives
narrative
patient privacy
data available
in unstructured
in computable
Secondary
clinicaluse
form.
textof However,
unstructured
extracting
De-identification
clinical relevant
data (Research)
information from
Semantic
clinicalAnalysis
narratives remains NN
challenging. Clinical notes often cont
Deidentification of free-text medical records
Johnson
using
A.E.W.,
pre-trained
Bulgarelli
The ability
bidirectional
L.,
ofPollard
caregivers
transformers
T.J. and investigators
2020 SCOPUS
to share patient data
Medicine
is fundamental to patient
many areas
healthofinformation
clinical practice
(PHI)protecting
and biomedical
patient
research.
privacy in
Prior
unstructured
to sharing,
Secondary
clinical
it is often
use
textof
necessary
unstructured
to remove
De-identification
clinicalidentifiers
data (Research)
such as names,
Semantic
contact
Analysis
details, and dates NN
in order to protect patient privacy. De
De-identification of psychiatric intake records:
Stubbs Overview
A., Filannino
of
The2016
M.,
2016
Uzuner
CEGS
CEGS Ö.
N-GRID
N-GRID
shared
shared
tasks
tasks
2017
Track
forSCOPUS
clinical
1 records contained
Medicine
three tracks. patient
Track 1health
focused
information
on de-identification
(PHI)protecting
/ Psychiatry
of a new
patient
Notes
corpus
privacy
of 1000
in unstructured
psychiatric
Secondary
clinical
intakeuserecords.
textof unstructured
This trackDe-identification
tackled
clinical data
de-identification
(Research) in two
Semantic
sub-tracks:
Analysis
Track&1.A
Morphological
was a Rule
“sightbased
Analysis
unseen”
& Statistical
task, where& NN
nine team
A Privacy-Preserving Natural LanguagePinto,
Clinical
Jeffrey
Information
Agnel;
Automated
Extraction
clinical
pipeline
information extraction
2018 Google
(IE) fromScholar
scanned, unstructured
Medicine clinical psychiatric
patient health
notes
information
can be improved
(PHI)protecting
/ Psychiatry
beyondpatient
the
Notes
state-of-the-art,
privacy in unstructured
rule-based
Secondary
clinical
toolsuse
by
text
using
of unstructured
Natural Language
De-identification
clinical data
Processing
(Research)
(NLP) Machine
Semantic
Learning
Analysis
(ML)
& Morphological
models, Statistical
transfer
Analysis
learning
& NN and features derived
The pattern of name tokens in narrativeKayaalp
clinical text
M., Browne
and a
Objective:
comparison
A.C., Callaghan
To understand
of fiveF.M.,
systems
Dodd
the factors
forZ.A.,
redacting
2014
Divita
that SCOPUS
influence
them
G., Ozturk
success
S., McDonald
inMedicine
scrubbing
C.J. personal patient
names health
from narrative
information
text.(PHI)
Materials
protecting
/ clinical
andnarratives
patient
methods:privacy
We developed
in unstructured
aSecondary
scrubber,
clinicalthe
use
text
NLM
of unstructured
Name Scrubber
De-identification
clinical
(NLM-NS),
data (Research)
to redact personal
Semantic
names
Analysis
from &
narrative
Morphological
clinical
Rule based
Analysis
reports, hand tagged words in a
De-identification of emails: Pseudonymizing
Eder E.,
privacy-sensitive
Krieg-Holz
WeU.,
deal
data
Hahn
with
inU.
athe
German
pseudonymization
email corpus
2019
of those
SCOPUS
stretches of text in
Social
emails
Network
that might User
allowGenerated
to identify real
Content
individual
(UGC)persons.
protectingThis
patient
taskprivacy
is decomposed
in unstructured
into
Identification
twoclinical
steps.text
First, named entities
De-identification
carrying privacy-sensitive Semantic
information
Analysis
(e.g., names of persons,
Statistical
locations, phone numbers or d
Personal information privacy: What’s next?
Hammoud K., Benbernou
In recentS.,
events,
Ouziriuser-privacy
M., Saygin Y.,
hasHaque
been
2019 a
R.,
SCOPUS
main
Taher
focus
Y. for all technological
Medicine and data-holding
patient health
companies,
information
due (PHI)protecting
to the global interest
sensitive
in protecting
Information
personal
in unstructured
Misusage
information.
oftext
Sensitive
Regulations
Information
likeDe-identification
theby
General
Data Collector
Data Protection
or Adversary
Regulation
Semantic Analysis
(GDPR) set firm laws
Rule
and
based
penalties
& Statistical
around &the
NNhandling
Code Alltag 2.0 - A pseudonymized German-language
Eder E., Krieg-Holz
email
TheU.,corpus
vast
Hahn
amount
U. of social communication
2020 SCOPUS
distributed over various
Social
electronic
Networkmedia
User
channels
Generated
(tweets,
Content
blogs,(UGC)
emails,
Protecting
etc.), so-called
Sensitiveuser-generated
Information oncontent
Social
Unintended
Media
(UGC),
data
Platforms
creates
disclosure
entirely De-identification
new opportunities for today's NLP
Semantic
research.
Analysis
Yet, data privacy
Rule
concerns
based implied by the unauthoriz
Novel location de-identification for machine
Taguchi
andK.,
human
Aramaki
In recent
E. years, the protection of personal
2018 SCOPUS
information has drawn
Social
muchNetwork
attention, User
requiring
Generated
an advanced
Content
technology
(UGC)Protecting
/ tweets
on de-identification
Sensitive Information
to removeon
personal
Social
Unintended
Media
information
data
Platforms
disclosure
from data. Among
De-identification
various personal information
Semantic
suchAnalysis
as personal
& Morphological
names,Statistical
phoneAnalysis
numbers, and so forth, this s
Information leakage through document Lopresti
redaction:
D.,Attacks
Lawrence
It and
hasSpitz
countermeasures
beenA.recently demonstrated,2005
in dramatic
SCOPUSfashion, that sensitive
Generalinformation complex
thought tosensitive
be obliterated
information
through
/protecting
unstructured
the process
sensitive
data
of redaction
Information
can be
in unstructured
successfully
Unintendedrecovered
text
data disclosure
via a combination
De-identification
of manual
/ Obfuscating
effort, document
Morphological
Authorship
image analysis,
Analysis and natural
Statistical
language processing technique
Don’t Let Google Know I’m Lonely Aonghusa, P\'{o}l From
Mac; buying
Leith, Douglas
books toJ.finding the perfect
2016 partner,
ACM we share ourOther
most intimate wants
User
andGenerated
needs withContent
our favourite
(UGC)Searching
online
/ Query
systems.
Data
in a private
But howmanner
far should we
Profiling
accept promises of privacy De-identification
in the face of personalized
/ Obfuscating
profiling?
Morphological
Authorship
In particular,
Analysis
we ask how
Statistical
we can improve detection of sensit
LN-Annote: An alternative approach to Jung
information
Y.H., Stratos
extraction
Personal
K., Carloni
frommobile
emails
L.P. devices
using locally-customized
offer a growing
2015 variety
SCOPUS
named-entity
of person-Alized
recognition
Mobile
services
Application
that enrich
Userconsiderably
Generated Content
the user(UGC)
experience.
protecting
/ emails
Thissensitive
is madeInformation
possible byinincreased
unstructured
Handling
access
oftext
sensitive
to personal
data infor-mation,
by applications
Designingwhich
a locally
to adeployed
large extent
system
Semantic
is extracted
Analysis
from user emailNN
messages and archives. There are,
Distortion Search–A Web Search Privacy
Mivule,
Heuristic
Kato; Hopkinson,
Search engines
Kenneth;have vast technical2017
capabilities
GoogletoScholar
retain Internet
Other
search logs for each
Useruser
Generated
and thus
Content
present
(UGC)
major
Searching
/ privacy
Query Data
vulnerabilities
in a private manner
to both individuals
Profilingand organizations in revealing
Designing
user
a privacy-preserving
intent. Additionally,
Semantic
web
manysearch
ofAnalysis
theengine
web search privacy
Statistical
enhancing
& NNtools available today r
Document Title Authors Abstract Year Source Generalized Domain
Generalized
1 Data Category (RQ1)Classification of Use Case Generalized Privacy Issue(RQ1)
Generalized Privacy Issue Solution
Generalized
(RQ2)Category of Applied
NLP Method
NLP Concept(RQ2)
Towards query logs for privacy studies:Biega
On deriving
A.J., Schmidt
search
Detailed
J.,
queries
Royquery
R.S.
fromhistories
questionsoften contain
2020
a precise
SCOPUS picture of a person’s
Other life, includingQuery
sensitive
Dataand personally identifiable
Searching
information.
in a private
As sanitization
manner of Profiling
such logs is an unsolved research
Designing
problem,
a scheme
commercial
for a Web
privacy
Morphological
search
preserving
enginesAnalysis
search
that possess
log analysis
Statistical
large datasets of this kind at their dis
Interactive Topic Search System BasedChen,
on Topic
LC Cluster In
Technology
this paper, we develop an interactive
2020hierarchical
Web of Science
topic search
General
system. In our system,
User Generated
the generation
Content
of topic
(UGC)names
Searching
/ userisbinary
mainly
in aoperations
private
based onmanner
the N-gramProfiling Designing
statistical language model. The a search
construction tree without
of hierarchical tracking userAnalysis
Morphological
tree interaction
relationships between topics
Statistical
is mainly based on the concept o
Design of an Inclusive Financial PrivacyAkanfe,
Index (INF-PIE):
Oluwafemi;
Financial
A Valecha,
Financial
privacy
Rohit;
Privacy
isRao,
an
and
important
H.Digital
Raghav Financial
part
2020
of an
ACM
Inclusion
individual's
Perspective
privacy,Other
but efforts to enhance
Privacy
financial
Policies
privacy have oftenInvestigating
not been given
theenough
potentialprominence
Privacy Impact
Complexity
by some
of smart
countries
of phone
Privacy
when
apps
PoliciesDesigning
advancing financial
an Inclusive
inclusion.
Financial
This Semantic
impedes
Privacyunder-served
Analysis
Index communities
Statistical
from utilizing financial services
A Framework for Estimating Privacy Risk
Chang
Scores
K.C.,
of Mobile
Zaeem
WithApps
R.N.,
the rapidly
Barber growing
K.S. popularity 2020
of smart
SCOPUS
mobile devices, the
Lawnumber of mobilePrivacy
applications
Policies
available has surged
Investigating
in the pastthe
few
potential
years. Such
Privacy
mobile
Impact
Handling
applications
of smart
of sensitive
phone
collectapps
data
a treasure
by applications
Detecting
trove ofprivacy-sensitive
Personally Identifiable
information
Morphological
Information
/ activities
Analysis
(PII) attributes Statistical
(such as age, gender, location, and f
Aquilis: Using Contextual Integrity for Privacy
Kumar,Protection
Abhishek;Smartphones
on
Braud,
Mobile
Tristan;
Devices
areKwon,
nowadays
Youngthe
D.;dominant
Hui,
2020PanACM
end-user device. AsMobile
a result,
Application
they havecomplex
become sensitive
gatewaysinformation
to all users'/Investigating
unstructured
communications,
the
datapotential
includingPrivacy
sensitive
Impact
Handling
personal
of smart
ofdata.
sensitive
phone
In thisapps
data
paper,
by applications
we
Detecting
presentprivacy-sensitive
Aquilis, a privacy-preserving
information
Semantic/system
Analysis
activities
for mobile platforms
Statistical
following the principles of con
Detecting privacy-sensitive events in medical
Jindal P.,
textGunter C.A.,
In thisRoth
paper,
D. we present a novel semi-supervised
2014 SCOPUStechnique for
Medicine
finding privacy-sensitive
patient events
health information
in clinical text.
(PHI)
Unlike
protecting
/ clinical
traditional
notes
patient
semi-supervised
privacy in unstructured
methods,
Secondary
clinical
we dousetext
notofrequire
unstructured
large Detecting
amounts
clinical data
of
privacy-sensitive
unannotated
(Research) data.
information
Morphological
Instead, /our
activities
approach
Analysis relies on
Rule
information
based contained in the hierarc
Clustering Help-Seeking Behaviors in LGBT
Liang Online
C., Abbott
Communities:
D.,
Online
HongLesbian,
Y.A.,
A Prospective
Madadi
Gay, Bisexual,
M.,Trial
Whiteand
A. 2019
Transgender
SCOPUS (LGBT) support
Medicine
communities have
User
emerged
Generated
as aContent
major social
(UGC)media
Protecting
platform
Sensitive
for sexual
Information
and gender
on Social
Identification
minorities
Media (SGM).
Platforms
These communities
Detectingplay
privacy-sensitive
a crucial role in
information
providing
Morphological
/LGBT
activities
Analysis
individuals a private
Statistical
and safe space for networking
Detecting and editing privacy policy pitfalls
Santos
on the
C., Gangemi
web Privacy
A., Alam
policies
M. are the locus where2017
consequences
SCOPUS concerning
Law
privacy and personal
Privacy
dataPolicies
are produced, but content
Paying
features
more explain
Attention
why
to Privacy
they arePolicies
largely
Complexity
ignoredofby
Privacy
its addressees.
PoliciesDetection
To abridge
of Inconsistencies
users with policies,
within
Semantic
wePrivacy
propose
Analysis
Policies
a policybased system
Statistical
that identifies potential pitfalls
Towards a multi-stage approach to detect
Bäumer
privacy
F.S.,
breaches
Kersting
Physician
inJ.,
physician
Orlikowski
Reviewreviews
Websites
M., Geierhos
allowM.
users
2018 toSCOPUS
evaluate their experiences
Medicinewith health services.
patient health
As these
information
evaluations
(PHI)are
protecting
/ Physician
regularly
patient
Notes
contextualized
privacy in with
unstructured
facts
Profiling
fromclinical
users'text
private lives, theyDetection
often accidentally
of Privacydisclose
Breachespersonal
Semantic
in Physican
information
Analysis
Reviews&onMorphological
the Web.
Statistical
ThisAnalysis
poses a serious threat to user
Automated Detection of Privacy Sensitive
Bouhaddou
Conditions
O.,inDavis
C-CDAs:
CareM.,
coordination
Donahue
Security Labeling
M.,
across
Mallia
healthcare
Services
A., Griffin
2016
at
organizations
S.,
theSCOPUS
Teal
Department
J., Nebeker
depends
of Veterans
J.upon
Law health
Affairsinformation
Privacy
exchange.
Policies
Various policies and
Privacy
lawsPreserving
govern permissible
Information
exchange,
Sharing
Compliance
particularly
withwhen
Requirements
the information
Detection
includes
of Privacy
privacy
Sensitive
sensitive
Semantic
Conditions
conditions.
Analysis
in C-CDAs
The DepartmentRule
of Veterans
based Affairs (VA) privacy polic
Automatic detection of vague words and
Lebanoff
sentences
L., Liu
in privacy
F.Website
policies
privacy policies represent 2018
the single
SCOPUS
most important source
Law of informationPrivacy
for users
Policies
to gauge how their personal
Ease thedata
comprehension
are collected,ofused
Privacy
andPolicies
Complexity
shared by companies.
of Privacy PoliciesDetection
However, privacyofpolicies
vagueness
are often vague
Semantic
and Analysis
people struggle to understand
NN the content. Their opaquen
Mitigating file-injection attacks with natural
Liu H.,
language
Wang B.
processing
Searchable Encryption can bridge the
2020gapSCOPUS
between privacy protection
Other and data utilization.
Use of third
As party
it leaks
/ storing/sending/encoded
access pattern
Searching
to attain
in practical
asensitive
private search
manner
information
performance,
Profilingit is vulnerable under advanced
Encryption-based
attacks. While
searchthese
system
advanced
Semantic
mitigating
Analysis
attacks
file-Injection
show
& Morphological
signi.cant
Attacks
Ruleprivacy
based
Analysis
leakage,
& Statistical
the &
assumptions
NN
Automated Understanding Of Cloud Terms
Papanikolaou,
Of ServiceNick;
And
WeSLAs
Pearson,
argue in favour
Siani; Mont,
of a set
Marco
of particular
Casassa;
2011 tools
Google
andScholar
approachesLaw
to help achieve accountability
Privacy Policies
in cloud computing. Automated
Our concern
and
is helping
Flexiblecloud
Rule Enforcement
providers
Compliance
achieve their
with security
Requirements
goalsGand
eneration
meeting
of Artefacts
their customers’
based Semantic
on
security
Privacy
and
Analysis
Policy
privacy
Analysis
requirements.
Rule based
The techniques
& Statistical
we propose in p
Natural language processing of rules and
Papanikolaou
regulations N.
forWe
compliance
discuss ongoing
in the cloud
work on developing
2012 tools
SCOPUS
and techniquesLaw
for understanding natural-language
Privacy Policies descriptions ofAutomated
security and
and
privacy
Flexible
rules,
Ruleparticularly
Enforcement
Compliance
in the context
with Requirements
of cloud computing
Generation
services.
of Artefacts
In particular,
basedweSemantic
onpresent
PrivacyAnalysis
aPolicy
three-part
Analysis
toolkit for
Statistical
analyzing and processing texts, an
Mining Privacy Goals from Privacy Policies
Bhatia,
Using
Jaspreet;
Hybridized
Privacy
Breaux,
Task
policies
Travis
Recomposition
D.;
describe
Schaub,
high-level
Florian 2016
goalsACM
for corporate data practices;
Law regulators require
Privacy industries
Policies to make available
Definition
conspicuous,
of Privacyaccurate
Requirements
privacy policies
Compliance
to their
withcustomers.
Requirements
Consequently,
Generation of
software
Artefacts
requirements
based Semantic
on Privacy
must Analysis
conform
Policy Analysis
&
toMorphological
those privacy
Statistical
Analysis
policies. To help stakeholders e
An empirical study of natural language Brodie
parsingC.A.,
of privacy
KaratToday
policy
C.-M.,organizations
rules
Karatusing
J. thedoSPARCLE
not have good
policy
2006ways
workbench
SCOPUS
of linking their written
Law privacy policiesPrivacy
with thePolicies
implementation of those
Ease
policies.
the implementation
To assist organizations
of written privacy
inComplexity
addressing
policies
ofthis
Privacy
issue,PoliciesGeneration
our human-centeredof Artefacts
researchbased
has focused
Semantic
on Privacy
on Analysis
understanding
Policy Analysis
organizational
Rule based privacy management needs
Design of a Compliance Index for Privacy
Akanfe
Policies:
O., Valecha
A Study
Many
R.,
of Mobile
Rao
nations
H.R.
Wallet
have adopted
and Remittance
comprehensive
2020
Services
SCOPUS
data privacy lawsLaw
to protect customers’
Privacy Policies
data. However, privacy
Measuring
policiesthe
of mobile
Compliance
walletofdigital
apps payment
Compliance
systems
with (DPS),
Requirements
and particularly
Generationthe
of Artefacts
mobile wallet
based
and
Semantic
onremittance
PrivacyAnalysis
Policy
services
Analysis
that are Statistical
part of DPS, are often not compliant
Sharing copies of synthetic clinical corpora
Lohrwithout
C., Buechel
physical
The
S., Hahn
distribution
legal U.
culture
- Aincase
the European
study to get
Union
2018
around
imposes
SCOPUS
iPRS almost
and privacy
unsurmountable
Medicine
constraints featuring
hurdles patient
tothe
exploit
German
health
copyright
JSYNCC
information
protected
corpus
(PHI)
language
Privacy
/ free clinical
Preserving
data (in
textterms
Information
of intellectual
Sharing
Secondary
property rights
use of
(IPRs)
unstructured
of media
Generation
clinical
contents)
data
of
and(Research)
Synthetic
privacy Data
protected
Morphological
medical health
Analysis
data (in terms
Ruleofbased
the notion of informational self-
Resilience of clinical text de-identified with
Carrell
“hiding
D.S.,inMalin
plain
Objective:
B.A.,
sight”Cronkite
to hostile
Effective,
D.J.,
reidentification
scalable
Aberdeen de-identification
J.S.,
attacks
2020
Clark
by
SCOPUS
C.,
human
of
Li personally
M.,readers
Bastakoty
identifying
Medicine
D., Nyemba
information
S., Hirschman
Synthetic
(PII) for information-rich
L.Data GenerationclinicalPrivacy
text is Preserving
critical to support
Information
secondary
Sharing
Secondary
use, but nouse
method
of unstructured
is 100% effective.
Generation
clinical data
Theofhiding-in-plain-sight
(Research)
Synthetic Data Semantic
(HIPS)Analysis
approach attempts Statistical
to solve this “residual PII problem.” H
Challenges and opportunities beyond structured
Tayefi, Maryam;
data inNgo,
analysis
Abstract
Phuong;
of
Electronic
electronic
Chomutare,
health
health
Taridzo;
records
records(EHR)
Dalianis,
2021 Wiley
contain
Hercules;
a lot Salvi,
of valuable
Elisa;
Medicine
information
Budrionis, Andrius;
about
patient
individual
Godtliebsen,
healthpatients
information
Fredand the
(PHI)
whole
protecting
/ clinical
population.
notes
patientBesides
privacy structured
in unstructured
data,
Secondary
clinical
unstructured
use
textof unstructured
data in EHRs
Generation
clinical
can provide
data
of(Research)
Synthetic
extra, valuable
Data information
Semantic Analysis
but the analytics processes
NN are complex, time-consumin
Applying language technology to nursing
Suominen
documents:
H., Lehtikunnas
Pros
Objectives:
and cons
T.,The
with
Backpresent
aB.,
focus
Karsten
study
on ethics
H.,
discusses
Salakoski
2007 ethics
SCOPUS
T., Salanterä
in buildingS.andMedicine
using applications patient
based onhealth
natural
information
language(PHI)
processing
protecting
/ nursing
in document
electronic
patient privacy
nursing
in unstructured
documentation.
Secondary
clinical
Specifically,
use
textof unstructured
we first focus
Information
clinical
on the
data
question
Extraction
(Research)
offrom
how nursing
patient
Semantic
confidentiality
documents
Analysis &can
Morphological
be ensured
Rule based
Analysis
in developing
& Statistical
language techno
Chatbot for IT security training: Using motivational
Gulenko I. interviewing
We conduct
to improve
a pre-study
security
with
behaviour
25 participants
2014 SCOPUSon Mechanical Turk
General
to find out which complex
security behavioural
sensitive information
problems/Browsing
are
unstructured
most important
in a data
private
formanner
online users. Complexity
These questions
of Privacy
are based
PoliciesInformation
on motivationalExtraction
interviewing
from(MI),
privacy
Semantic
an evidence-based
policies
Analysis treatmentNN
methodology that enables to train pe
<i>MyAdChoices</i>: Bringing Transparency and
Parra-Arnau, Control
Javier; to Online
TheAchara,
intrusiveness Advertising
Jagdish
and
Prasad;
the increasing
Castelluccia,
2017
invasiveness
ACM
Claude of online advertising
Other have, in the
complex
last few
sensitive
years, raised
information
serious
/Browsing
unstructured
concernsinregarding
a data
private user
manner
privacy and
Profiling
Web usability. As a reactionInformation
to these concerns,
Extraction
wefrom
haveprivacy
witnessed
Semantic
policies
the
Analysis
emergence of a myriad
Statistical
of ad-blocking and antitracking
Pre-processing legal text: Policy parsing
Waterman
and isomorphic
K.K. One
intermediate
of the most
representation
significant challenges
2010in SCOPUS
achieving digital privacy
Law is incorporating Privacy
privacy Policies
policy directly in computer
Easesystems.
the automation
While rule
process
systems
of privacy
haveComplexity
long
regulating
existed,
of Privacy
documents
translating
PoliciesInformation
privacy laws, regulations,
Extraction policies,
from privacy
Semantic
and policies
contracts
Analysis
into&processor
Morphological
Rule
amenable
based
Analysis
forms
& Statistical
is slow and
& NNdifficult b
Identifying the provision of choices in privacy
Sathyendra
policyK.M.,
text Websites’
Wilson S.,and
Schaub
mobile
F.,apps’
Zimmeck
privacy
S.,2017
policies,
Sadeh SCOPUS
N.written in natural Law
language, tend to bePrivacy
long and
Policies
difficult to understand.
Ease
Information
the comprehension
privacy revolves
of Privacy
around
Policies
Complexity
the fundamental
of Privacy
principle
PoliciesInformation
of notice and Extraction
choice, namely
from privacy
theMorphological
ideapolicies
that users
Analysis
should be able
Rule
to make
basedinformed
& Statistical
decisions
& NN about
Polisis: Automated analysis and presentation
Harkousof H.,
privacy
Fawaz
policies
Privacy
K., Lebret
using
policies
R.,
deep
are
Schaub
learning
the primary
F., Shinchannel
K.G.,
2018Aberer
through
SCOPUS
K. which companies
Law inform users about
Privacy
their
Policies
data collection and sharing
Ease the
practices.
comprehension
These policies
of Privacy
are often
Policies
Complexity
long and
of Privacy
difficult to
PoliciesInformation
comprehend. ShortExtraction
notices based
from privacy
onSemantic
information
policies
Analysis
extracted from privacy
Statistical
policies
& NNhave been shown to b
Understanding Privacy Awareness in Android
Feichtner
AppJ.,Descriptions
Gruber
Permissions
S. Using are
Deepa key
Learning
factor in Android
2020toSCOPUS
protect users' privacy.
Mobile
As itApplication
is often not complex
obvious why
sensitive
applications
information
require
/Investigating
unstructured
certain permissions,
the
datapermission
developer-provided
usage coverage
Complexity
descriptions
of app
of Privacy
descriptions
in Google
PoliciesInformation
Play and third-party
Extraction
markets
from privacy
should
Semantic
policies
explain
Analysis
to users how sensitive
NN data is processed. Reliably rec
PrivacyCheck: Automatic Summarization
Zaeem,
of Privacy
Razieh
Policies
Nokhbeh;
PriorUsing
research
German,
Datashows
Mining
Rachel
that L.;
onlyBarber,
a tiny
2018
K.
percentage
Suzanne
ACM of users actually
Law read the online Privacy
privacy Policies
policies they implicitly agree
Paying
tomore
whileAttention
using a website.
to Privacy
Prior
Policies
Complexity
research
& Ease
also
ofthe
suggests
Privacy
comprehension
PoliciesInformation
that usersofignore
Privacy
privacy
Extraction
Policies
policies
from because
privacy
Semantic
policies
these
Analysis
policies are lengthy
NNand, on average, require 2 years o
Automatic extraction of opt-out choicesSathyendra
from privacyK.M.,
policies
Online
Schaub"notice
F., Wilson
and choice"
S., Sadeh
is an
N. essential
2016 SCOPUS
concept in the US FTC's
Law Fair Information
Privacy
Practice
Policies
Principles. Privacy laws
Paying
based
more
onAttention
these principles
to Privacy
include
Policies
Complexity
requirements
& Easeofthe
Privacy
for
comprehension
providing
PoliciesInformation
notice
of about
Privacy
data
Extraction
Policies
practices
fromand
privacy
Semantic
allowing
policies
individuals
Analysis to exercise
Statistical
control over those practices. Intern
A step towards usable privacy policy: Automatic
Liu F., Ramanath
alignment
With
R.,ofSadeh
privacy
the rapid
N.,
statements
Smith
development
N.A. of web-based
2014 SCOPUS
services, concerns Law
about user privacy have
Privacy
heightened.
Policies The privacy policies
Ease theof online
comprehension
websites,of
which
Privacy
serve
Policies
Complexity
as a legalofagreement
Privacy PoliciesMapping
between serviceofproviders
Privacy Policies
and users,
toMorphological
Privacy
are notIssues
easyAnalysis
for people to understand
Statistical and therefore offer an oppo
The Effect of the GDPR on Privacy Policies:
Zaeem, Recent
Razieh
Progress
Nokhbeh;
The General
andBarber,
Future
DataK.
Promise
Protection
Suzanne Regulation
2020 (GDPR)
ACM is considered
Law
by some to be thePrivacy
most important
Policies change in dataInvestigating
privacy regulation
the Impact
in 20 of
years.
GDPREffective
with
Complexity
Privacy
May 2018,
Policy
of Privacy
the
Analysis
European
PoliciesMapping
Union GDPR
Privacy
privacy
Policies
law applies
to GDPR
Semantic
to any organization
Analysis that collects
Rule and
based
processes
& Statistical
the personal info
The role of vocabulary mediation to discover
Leoneand
V., Di
represent
CaroTo
L. date,
relevant
the information
effort madein
byprivacy
existingpolicies
2020
vocabularies
SCOPUS to provide a Law
shared representation
Privacy
of thePolicies
data protection domain
Ease
is not
thefully
comprehension
exploited. Different
of Privacy
natural
Policies
Complexity
languageofprocessing
Privacy PoliciesMapping
(NLP) techniques
Privacy
havePolicy
been applied
Content
Semantic
to
tothe
selected
text
Analysis
ofDictionary
privacy
& Morphological
policiesNN
without,
Analysis
however, taking advantage o
P2Onto: Making Privacy Policies Transparent
Novikova E., Doynikova
The privacy
E., Kotenko
issue isI.highly relevant2020
for modern
SCOPUSinformation systems.
Law Both particular
Privacy
usersPolicies
and organizations usually
Ease dothe
notcomprehension
understand risks
of related
Privacywith
Policies
Complexity
personalofdata
Privacy
processing.
PoliciesOntology
The waysused
an organization
for transparency
gathers,
Semantic
in Privacy
uses,Analysis
discloses,
Policies and manages
Statistical
a customer’s or client’s data sh
De-identification of unstructured clinicalMeystre
data forS.M.
patient privacy
The adoption
protection
of Electronic Health Record
2015 SCOPUS
(EHR) systems is growing
Medicine
at a fast pacepatient
in the health
Unitedinformation
States and (PHI)protecting
in Europe, and patient
this growth
privacy
results
in unstructured
in verySecondary
largeclinical
quantities
use
textofofunstructured
patient clinical
Overview
clinical
information
data
of Challenges
(Research)
becomingand
available
applied
Noneinmethods
electronic
forformat,
protection
withof
None
tremendous
personal health
potential,
information
but also(PHI)
equal
Challenges in detecting privacy revealing
Tesfay
information
W.B., Serna
in unstructured
This
J., paper
Pape discusses
S.text the challenges
2016
in detecting
SCOPUS privacy revealing
Medicine
information using
patient
ontologies,
health information
natural language
(PHI)protecting
/processing
clinical notes
sensitive
and machine
Information
learning
in unstructured
techniques.
Sensitive Information
text
It reviews current
in unstructured
definitions,
OverviewDataof
andchallenges
sketches in
problem
detecting
None
levels
privacy
towards
revealing
identifying
information
the
None
main
in unstructured
open challenges.
text Furthermor
GDPR privacy policies in CLAUDETTE:Liepin
Challenges
R., Contissa
of omission,
The
G.,latest
Drazewski
context
developments
and
K., Lagioia
multilingualism
in natural
F., Lippi
2019
language
M., Micklitz
SCOPUS
processing
H.-W., Pałka
and machine
Law
P., Sartorlearning
G., Torroni
have
Privacy
P.created
Policies
new opportunities inInvestigating
legal text analysis.
the Impact
In particular,
of GDPRwe with
Complexity
look
Privacy
at thePolicy
texts
of Privacy
of
Analysis
online
PoliciesOverview
privacy policies
ofafter
Challenges
the implementation
of omission,
Semanticof
context
the
Analysis
European
and multilingualism
GeneralRule
Data
based
Protection Regulation (GDPR
Data processing and text mining technologies
Sun W.,on
Cai
electronic
Z., LiCurrently,
Y., medical
Liu F., medical
Fang
records:
S.,institutes
Wang
A review
G.generally
2018useSCOPUS
EMR to record patient's
Medicine
condition, including
patient
diagnostic
health information
information,
(PHI)
procedures
protecting
/ clinical performed,
notes
patient privacy
and treatment
in unstructured
results.
Disclosure
clinical
EMRoftext
has
sensitive
been recognized
data forOverview
analysis
as a valuable
over theresource
researchfor
field
None
large-scale analysis. However,
None
EMR has the characteristics of dive
Deriving semantic models from privacyBreaux
policiesT.D., Antón
Natural
A.I. language policies describe 2005
interactions
SCOPUS between and across
Law organizations,Privacy
third-parties
Policies
and individuals. However,
Ease the current
comparison
policyoflanguages
Privacy Policies
are
Complexity
limited in of
their
Privacy
abilityPoliciesSemantic
to collectively describe
Model interactions
Creation based
across
Semantic
on Privacy
these
Analysis
parties.
Policies
&Goals
Morphological
fromStatistical
requirements
Analysis engineering are usefu
Modeling language vagueness in privacy
Liupolicies
F., Fellausing
N.L.,deep
Website
Liao K.
neural
privacy
networks
policies are too long
2016
to read
SCOPUS
and difficult to understand.
Law The over-sophisticated
Privacy Policies
language undermines
Ease the effectiveness
comprehension of of
privacy
Privacy
notices.
Policies
Complexity
Peopleofbecome
Privacy less
PoliciesSemantic
willing to share
Modeling
their personal
of language
information
Semantic
vagueness
when
Analysis
they perceive the
NNprivacy policy as vague. The goal o
Opening public deliberations: Transparency,
Bassi privacy,
E., Leonianonymisation
D.,
The
Leucci
openS.,
data
Pane
movement
J., Vaccari
is demanding
L. 2013 publication
SCOPUS of data withheld
Other by public institutions.
GovernmentWideData
access to government
Privacy
data
Preserving
improvesInformation
transparency
Sharing
and
Unintended
also fosters
data
economic
disclosure
growth.
Semantic
Still, careless
open source
publication
stack ofSemantic
personal Analysis
data can&easily
Morphological
leadStatistical
to privacy
Analysis
violations. Due to these co
On the Suitability of Applying WordNet Zhu
to Privacy
N., Wang
Measurement
S., Privacy
He J., Teng
protection
D., HeisP.,
a Zhang
fundamental
Y. 2018
issueSCOPUS
in the era of big data.
General
For personalized privacy
User Generated
protection,
Content
it is necessary
(UGC)Ease
to measure
the automation
the amount
process
or the
of privacy
degree
Compliance
regulating
of privacy
with
documents
leakage.
Requirements
To facilitate
SemanticsuchSimilarities
measurement,Tool semantic
Comparison
Semantic
similarities
Analysis and relationships
Statistical
of words should be determined s
Enforcing vocabulary k-anonymity by semantic
Liu J., Wang
similarity
K. Webbased
query
clustering
logs provide a rich wealth
2010of information,
SCOPUS but also present
Generalserious privacy
Query
risks.
Data
We consider publishing
Searching
vocabularies,
in a private
bags of
manner
query-termsProfiling
extracted from web query logs,
Semantic
which has
similarity
a variety
based
of applications.
on clustering
Semantic We
and
Analysis
aim
k-annonymity
at preventing identity
Statistical
disclosure of such bag-valued
Mining rule semantics to understand legislative
Breaux T.D.,
compliance
AntónOrganizations
A.I. in privacy-regulated 2005
industries
SCOPUS
(e.g. healthcare Law
and financial institutions)
Privacy
face
Policies
significant challengesEase
whenthe
developing
automation
policies
processand
of systems
privacy
Complexity
regulating
that areofproperly
Privacy
documents
aligned
PoliciesSemantics
with relevantofprivacy
Rules Mining
legislation.
Semantic
We analyze
Analysis
privacy regulations
Statistical
derived from the Health Insurance
She Knows Too Much-Voice CommandFurey
Devices
E., and
BluePrivacy
J. Voice controlled Internet of Things (IoT)
2018 devices
SCOPUS have becomeOther
ubiquitous in homes
Speech
and offer
Dataindividuals many convenient
Voice-based
andService
entertaining
in a private
features.
manner
The
Profiling
Amazon Echo and its intelligent
Speech personal
de-identification
assistant, Alexa,
Speech
is a leading
Processing
innovation in thisRule
area.
based
This novel research examines
Partakable technology Osman N. This paper proposes a shift in how 2018
technology
SCOPUSis currently being
Social
developed
Networkby giving
Userpeople,
Generated
the users,
Content
control
(UGC) over
Definition
/ user
theirproposals
technology.
of Privacy Requirements
We argue that users
Sensitive
should
Information
have a say
in unstructured
in the Suggestion
behaviour
Dataofofthe
security
technologies
and privacy
Morphological
that mediate
requirements
their
Analysis
online
by involving
interactions
Rule
usersbased
and control their private data.
A. General Addenda

A.1.2. NLP as a Privacy Threat Analysis Sheet

98
Document Title Authors Abstract Year Source Domain Data Type Use Cases Generalized Issue/Vulnerability solved
PETs(RQ2)
for NLP(RQ1)
1
A Framework for Secure Speech Recognition P. Smaragdis; M.
WeV.present
S. Shashanka
an algorithm that enables
2007 IEEE
privacy-preserving
General
speech recognition transactions between
Speech
multiple parties. We
Speech
assume
Recognition
two commonplace
without Disclosure
scenarios.
Direct Disclosure
of
One
all Voice
beingofFeatures
the
sensitive
case where
data for
oneSecure
NLP
of two
model
Multiparty
parties
training
hasComputation
private spe
Identity verification using voice and its use in a privacy
Çamlıkaya,
preserving
Eren;
Since
system
security has been a growing
2008 concern
Google in
Scholar
recent years,
Medicine
the field of biometrics has gained popularity
Speechand became an
Speech
activeVerification
research area.
without
Beside
Disclosure
new
memorizability
identity
of all Voice
authentication
of NN
Features and recognition
Obfuscation
methods, protection again
Slice-based architecture for biometrics: PrototypeB.illustration
K. Sy on privacy
This research
preserving
investigates
voice verification
slice-based
2009 IEEE
architecture for Medicine
biometrics. A service slice is an aggregation
Speech
of resources for a Speech
specific biometric
Verification
objective;
without Disclosure
e.g.,Direct
speakerof
Disclosure
all
verification.
Voice of
Features
sensitive
Slice-based
dataarchitecture
for Secure
NLP Tasks
Multiparty
is attractive
Computation
as a frame
Privacy Preserving Techniques for Speech Processing
Pathak, M; Speech is perhaps the most private
2010 form
Google
of personal
Scholar communication
General but current speech processing
Speech
techniques are not
Speech
designed
Processing
to preserve
without
the privacy
Disclosure
Direct
of the
Disclosure
of speaker
all Voice and
of
Features
sensitive
require data
complete
for Differential
NLP
access
model
to Privacy
the
training
speech
(DP)recordi
& Sec
Privacy-preserving document similarity detectionKhelik, Ksenia; The document similarity detection
2011is Google
an important
Scholar
technique
General
used in many applications. The existence
Written
of the tool that guarantees
Similarity detection
the privacy
without
protection
Data Information
Exposure
of the documents
Disclosure
during
by Statistical
the comparison
Language
Homomorphic
will expand
Models
Encryption
the area where
(HE)
A Strategy for Deploying Secure Cloud-Based Natural
D. Carrell
LanguageNatural
Processing
language
Systems
processing
for Applied
(NLP)
2011
Research
of
IEEE
clinical
Involving
text offers
Clinical
Medicine
greatText
potential to expand secondary use of
Written
high-value electronic
Classification
health record
without
(EHR)
Data
data,
Disclosure
butDirect
a barrier
Disclosure
to adopting
of sensitive
NLP is data
the high
for None
NLP
total model
cost oftraining
operation, driven m
Privacy-preserving machine learning for speech processing
Pathak, Manas A;
Speech is one of the most private
2012forms
Google
of personal
Scholar communication,
General as a speech sample containsSpeech
information about the
Speech
gender,
Processing
accent, ethnicity,
without Disclosure
andDirect
the emotional
Disclosure
of all Voice
state
of
Features
sensitive
of the speaker
data for
apart
Homomorphic
NLPfrom
Tasks
the message
Encryption
content.
(HE)
An Efficient and Secure Nonlinear Programming Madhura,
Outsourcing
M; in
Santosh,
Cloud
Cloud Computing
Computing
R; provides a appropriate
2012 Googleon-demand
Scholar network
Cloud Computing
access to a shared pool of configurable
Written
computing resources
Model which
Training
could
without
be rapidly
Sharing
deployed
Data
memorizability
with muchof more
NN great efficiencyHomomorphic
and with minimal
Encryption
overhead
(HE)
to
Privacy-preserving speaker authentication Pathak, Manas;Speaker
Portelo, authentication
Jose; Raj, Bhiksha;
systems
Trancoso,
2012
require
Google
Isabel;
access
Scholar
to the General
voice of the user. A person’s voice carries information
Speech about their
Speech
gender,
Verification
nationalitywithout
etc., allDisclosure
of which
Directbecome
of
Disclosure
all Voice
accessible
of
Features
sensitive
to the
data
system,
for Obfuscation
NLP
which
Tasks
could abuse this know
HMM Based Privacy Preserving Approach for VOIP
Prashantini,
Communication
S; Jeyaprabha,
Pre-processing
TJ;of Speech Signal
2013serves
Google
various
Scholar
purposes
General
in any speech processing application. ItSpeech
includes Noise Removal,
SpeechEndpoint
Communication
Detection,
without
Pre-emphasis,
Disclosure
Direct Disclosure
Framing,
of all Voice
of Windowing,
sensitive
Featuresdata
Echo
for Canceling
Homomorphic
NLP modeletc.
training
Encryption
Out of these,
(HE)
sil
Privacy-preserving speech processing: cryptographic
Pathak,
andManas
string-matching
A;
Speech
Raj, Bhiksha;
isframeworks
one ofRane,
the most
show
Shantanu
private
promise
2013D;
forms
Google
Smaragdis,
of communication.
Scholar
Paris;General
People do not like to be eavesdroppedSpeech
on. They will frequently
Speecheven
Processing
object to being
without
recorded;
Disclosure
Direct
in fact,
Disclosure
of allinVoice
many of
Features
places
sensitive
it isdata
illegal
fortoHomomorphic
NLP
record
Tasks
peopleEncryption
speaking in
(HE)
pu
Collaborative search log sanitization: Toward differential
Hong, Yuan;
privacy
Vaidya,
and
Severe
boosted
Jaideep;
privacy
utility
Lu,
leakage
Haibing;
in the
Karras,
2014
AOL Panagiotis;
search
Googlelog
Scholar
incident
Goel, Sanjay;
Other
has attracted considerable worldwide attention.
Written
However, all the
Model
web Training
users' daily
without
search
Sharing
intentsData
Direct
and behavior
Disclosure
areof
collected
sensitiveindata
suchfor
data,
Federated
NLPwhich
modelcan
Learning
training
be invaluable
& Differen
for
Privacy-preserving speaker verification using garbled
Portêlo,
GMMsJosé; Raj,
In this
Bhiksha;
paper Abad,
we present
Alberto;
a privacy-preserving
Trancoso,
2014 Google
Isabel;Scholar
speaker General
verification system using a UBM-GMM technique.
SpeechRemote speaker
Speech
verification
Processing
services
without
relyDisclosure
on the
Direct
system
Disclosure
of all having
Voice of
Features
access
sensitive
to the
data
user's
for Obfuscation
NLP
recordings,
model training
or features derive
Privacy-preserving important passage retrieval Marujo, Luís; Portêlo,
State-of-the-art
José; De important
Matos, David
passage
Martins;
2014 retrieval
Google
Neto,methods
Scholar
João P; Gershman,
obtain
General
very Anatole;
good results,
Carbonell,
but doJaime;
not take
Trancoso,
intoWritten
account
Isabel;
privacy
Raj, Bhiksha;
issues.
Similarity
In this
detection
paper, we
without
present
Data
a privacy
exploitability
Exposurepreserving
of word
method
embeddings
that relies on
Obfuscation
creating secure representati
A Federated Network for Translational Cancer Research
Jacobson,
Using
RS;Clinical
Becich,
AdvancesData
MJ;inBollag,
and
cancer
Biospecimens
RJ;
research
Chavan,
and
2015
G;personalized
Corrigan,
Web of Science
J; medicine
Dhir, R;
Medicine
Feldman,
will require
MD;significant
Gaudioso, new
C; bridging
Legowski,infrastructures,
E;Written
Maihle, NJ;including
Mitchell,
Model
more
K; Murphy,
robust
Training
M;
biorepositories
without
Sakthivel,
Sharing
M;that
Tseytlin,
Data
Direct
link human
E;
Disclosure
Weaver,
tissueof
Jtosensitive
clinical phenotypes
data for Federated
NLP
and
model
outcomes.
Learning
trainingIn order to m
A Full-Text Retrieval Algorithm for Encrypted DataWei
in Cloud
SongYihui
Storage
CuiZhiyong
Nowadays,
Applications
Peng
more and more Internet
2015 users
Springer
use the cloud
Cloud
storage
Computing
services to store their personal data,
Written
especially whenStoring
the mobile
and devices
Searching
which
Datahave
without
limited
Direct
Data
Disclosure
storage
Exposure
capacity
of sensitive
popularize.
data for
With
Homomorphic
NLP
theTasks
cloud storage
Encryption
services,
(HE)
Privacy-preserving multi-document summarization
Marujo, Luís; Portêlo,
State-of-the-art
José; Ling,
extractive
Wang; demulti-document
Matos,
2015 David
Google
Martins;
summarization
ScholarNeto,
General
João
systems
P; Gershman,
are usuallyAnatole;
designedCarbonell,
without any
Jaime;
Written
concern
Trancoso,
about Isabel;
privacy
Summarization
Raj,
issues,
Bhiksha;
meaning
withoutthat
Document
all documents
Information
Disclosure
areDisclosure
open to third
by parties.
Statistical
In Language
this
Obfuscation
paper Models
we propose a privacy
Privacy-preserving frameworks for speech miningPortêlo, José Miguel
Security
Ladeira;
agencies often desire2015
to monitor
Googletelephonic
Scholar conversations
General to determine if any of them areSpeech
important to national
Speech
security,
Processing
but this results
withoutinDisclosure
an intolerable
Direct Disclosure
of allinvasion
Voice of
Features
ofsensitive
the privacy
dataoffor
thousands
Secure
NLP model
Multiparty
of regular
trainingComputation
citizens, wh
Preserving privacy of encrypted data stored in cloud
Dharini,
and M;
enabling
Sasikumar,
In efficient
cloudR; computing,
retrieval ofan
encrypted
important
2016
data
application
Google
throughScholar
isblind
to outsource
storage
Cloud the
Computing
data to external cloud servers for Written
scalable data storage.
Storing
Privacy
andand
Searching
confidentiality
Data without
of the
Direct
outsourced
Data
Disclosure
Exposure
data
of sensitive
needed todata
be ensured,
for Homomorphic
NLP model
for thistraining
the
Encryption
outsourced
(HE)
da
Edit Distance Based Encryption and Its Application
Tran Viet Xuan PhuongGuomin
Edit distance, also
YangWilly
known as
SusiloKaitai
Levenshtein
2016 Springer
Liang
distance, is Medicine
a very useful tool to measure the similarity between
Written two strings.Similarity
It has been
detection
widely used
without
in Data
manyDirect
Exposure
applications
Disclosure
suchofas
sensitive
natural data
language
for Homomorphic
NLP
processing
Tasks and
Encryption
bioinformatic
(HE)
Encrypted domain cloud-based speech noise reduction
M. A. Yakubu;
with comb
N.During
C.
filter
Maddage;
the acquisition
P. K. Atrey
process2016
of speech
IEEEby recording,Cloud
noiseComputing
often contaminates the signal which Speech
degrades its quality Storing
and (1) and
makes
Searching
it unpleasant
Data without
for human
Direct
Data
Disclosure
perception
Exposureofand
sensitive
(2) causes
datainaccuracies
for Homomorphic
NLP Tasks
in speech
Encryption
processing
(HE)
Learning low-dimensional representations of medical
Choi,concepts
Youngduck;
WeChiu,
showChill
howYi-I;
to learn
Sontag,
low-dimensional
David;
2016 Google
representations
Scholar General
(embeddings) of a wide range of conceptsWritten
in medicine, including
Privacy-Utility-Trade-off
diseases (e.g., ICD9for
codes),
Wordexploitability
medications,
Embeddingsof procedures,
word embeddings
and laboratory
Obfuscation
tests. We expect that these
Efficient and Privacy-Preserving Voice-Based Search
M. Hadian;
over mHealth
T. Altuwaiyan;
In-home
Data IoT
X. devices
Liang; W.
play
Li a major
2017role
IEEE
in healthcare systems
Medicine
as smart personal assistants. They usually
Speechcome with a voice-enabled
Speech Processing
feature
without
to addDisclosure
an extra
Directlevel
Disclosure
of allofVoice
usability
of
Features
sensitive
and convenience
data for Homomorphic
NLP
to elderly,
Tasks disabled
Encryption
people,
(HE)
a
Privacy Preserving Vector Quantization Based Speaker
Ene, Andrei;
Recognition
Togan,
In the
System
Mihai;
recent
Toma,
yearsStefan-Adrian;
we have witnessed
2017 Google
an emergence
Scholar Cloud
of cloud
Computing
computing services, in all importantSpeech
fields. These services
Speech
provide
Recognition
space forwithout
keepingDisclosure
huge
Direct
amount
Disclosure
of allofVoice
data,
ofFeatures
and
sensitive
also great
data for
computational
Homomorphic
NLP Tasks power.
Encryption
But togethe
(HE)
Privacy preserving encrypted phonetic search of C.
speech
Glackin;
dataG. Chollet;
This paper
N. Dugan;
presents
N. aCannings;
strategy2017
for
J. Wall;
enabling
IEEE S. Tahir;
speechI. G.
recognition
Ray;
Cloud
M.Computing
Rajarajan
to be performed in the cloud whilstSpeech
preserving the privacy
Storing
of users.
and Searching
The approach
Dataadvocates
without
Direct
Data
aDisclosure
demarcation
Exposureof sensitive
of responsibilities
data for Homomorphic
NLP
between
Tasksthe client
Encryption
and server
(HE)
Privacy-Preserving Speech Emotion RecognitionGareta, Alberto Machine
Abad; learning classification2017
and data
Google
mining
Scholar
tasks are
General
used in numerous situations nowadays,Speech
for instance, qualitySpeech
control,Emotion
eHealth,Recognition
banking, and
without
homeland
Information
Disclosure
security.
Disclosure
of Due
all Voice
by
to lack
Statistical
Features
of large
Language
Homomorphic
training sets
Models
and
Encryption
machine(HE)
lear
Beyond Big Data: What Can We Learn from AI Models?
Caliskan,
Invited
Aylin;Keynote
My research involves the heavy
2017
use Google
of machine
Scholar
learning
Other
and natural language processing in novelWritten
ways to interpret big
Investigating
data, develop
the privacy
Impact of
and
NLP
security
on
exploitability
Privacy
attacks, of
andword
gainembeddings
insights about humans
None and society through the
Deep models under the GAN: information leakage
Hitaj,
fromBriland;
collaborative
Ateniese,
Deepdeep
Learning
Giuseppe;
learning
has Perez-Cruz,
recently become
2017
Fernando;
Google
hugely Scholar
popular inGeneral
machine learning for its ability to solve end-to-end
Written learning systems,
Investigating
in which
the
the
Impact
features
of Collaborative
and the
memorizability
classifiers
Deep are
Learning
of NN
learned
onsimultaneously,
Privacy None providing significant improv
Author obfuscation using generalised differential Fernandes,
privacy Natasha;
The problem
Dras, Mark;
of obfuscating
McIver, Annabelle;
the
2018
authorship
GoogleofScholar
a text document
Generalhas received little attention in the literature
Written to date. Current
Privacy-Utility-Trade-off
approaches are ad-hocfor Training
and Direct
relyData
onDisclosure
assumptions
of sensitive
about andata
adversary's
for Differential
NLP auxiliary
model Privacy
training
knowledge
(DP) which
Syntf: Synthetic and differentially private term frequency
Weggenmann,
vectorsBenjamin;
Text
for privacy-preserving
miningKerschbaum,
and information
text
Florian;
mining
retrieval
2018 Google
techniques
Scholar
have General
been developed to assist us with analyzing,Written
organizing and retrieving
Privacy-Utility-Trade-off
documents with the forhelp
Training
ofInformation
computers.
Data Disclosure
In many cases,
by Statistical
it is desirable
Language
Differential
that Models
thePrivacy
authors(DP)
of such d
Privacy-preserving collaborative model learning: Wang,
The case
Qian;
of word
Du,Nowadays,
Minxin;
vector training
Chen,
machine
Xiuying;
learning
Chen,is
2018
Yanjiao;
becoming
Google
Zhou,
a Scholar
new
Pan;
paradigm
Chen,
General
Xiaofeng;
for miningHuang,
hiddenXinyi;
knowledge in big data.
Written
The collection and
Privacy-Utility-Trade-off
manipulation of big datafornot
Training
only
exploitability
create
Data considerable
of word embeddings
values, but alsoFederated
raise serious
Learning
privacy
& Homom
concern
Exploring Hashing and Cryptonet Based Approaches
M. Dias;
for Privacy-Preserving
A. Abad;
The
I. Trancoso
outsourcing
Speech
of machine
Emotionlearning
Recognition
2018 IEEE
classification andGeneral
data mining tasks can be an effective solution
Speech
for those partiesSpeech
that need
Emotion
machine
Recognition
learning services,
without
DirectDisclosure
but
Disclosure
lack the
ofof
appropriate
allsensitive
Voice Features
data
resources,
for Homomorphic
NLPknowledge
model training
Encryption
and/or tools(HE)
to c
Toward practical privacy-preserving analytics for Sharma,
IoT and cloud-based
Sagar;Modern
Chen,healthcare
Keke;
healthcare
Sheth,
systems
systems
Amit; now
2018
relyGoogle
on advanced
Scholarcomputing
Medicinemethods and technologies, such asWritten
Internet of Things (IoT)
Privacy-Utility-Trade-off
devices and clouds,for
to Training
collect
Direct
and
Data
analyze
Disclosure
personal
of sensitive
healthdata
datafor
atNone
NLP
an unprecedented
Tasks scale and d
Smart Bears don't talk to strangers: analysing privacy
Demetzou,
concerns
Katerina;
and
Thetechnical
“Smart
Böck, Leon;
Bear”
solutions
is
Hanteer,
a hypothetical
in smart
Obaida;
2018
toysconnected-smart
Google
for children
Scholar toy
Other
for children. While the functionalities it presents
Speechare appealingPrivate
to bothCommunication
children and their parents,
Direct
the Disclosure
privacy concerns
of sensitive
that are
data
raised
for None
NLP
should
Tasks
be taken into serious
Privacy-preserving neural representations of textCoavoux, Maximin;
ThisNarayan,
article deals
Shashi;
with Cohen,
adversarial
2018
Shay attacks
B;
Google
towards
Scholar
deep
Cloud
learning
Computing
systems for Natural Language Processing
Written (NLP), in the
Privacy-Utility-Trade-off
context of privacy protection.
for Neural
exploitability
WeText
study
Representations
a specific
of wordtype
embeddings
of attack: an None
attacker eavesdrops on the hid
VoiceGuard: Secure and Private Speech Processing.
Brasser, Ferdinand;
With Frassetto,
the adventTommaso;
of smart-home
Riedhammer,
2018
devices
Google
providing
Korbinian;
Scholar
voice-based
Sadeghi,
Home Automation
Ahmad-Reza;
interfaces, such
Schneider,
as Amazon
Thomas;
AlexaWeinert,
or
Speech
AppleChristian;
Siri, voice data
Speech
is constantly
Processing
transferred
without Disclosure
to cloud
Directservices
Disclosure
of all Voice
for automated
of
Features
sensitivespeech
data forrecognition
Obfuscation
NLP Tasksor speaker verificatio
Towards robust and privacy-preserving text representations
Li, Yitong; Baldwin,
Written
Timothy;
text often
Cohn,provides
Trevor;sufficient
2018 Google
clues toScholar
identify the
General
author, their gender, age, and other important
Written
attributes. Consequently,
Privacy-Utility-Trade-off
the authorshipfor
ofTraining
training
exploitability
Data
and evaluation
of word
corpora
embeddings
can have unforeseen
Obfuscationimpacts, including d
Privacy-preserving active learning on sensitive data
Feyisetan,
for userOluwaseyi;
intent
Active
classification
learning
Drake, Thomas;
holds promise
Balle, 2019
Borja;
of significantly
Google
Diethe, Scholar
Tom;
reducing General
data annotation costs while maintaining reasonable
Written model performance.
Model Training
However,
without
it requires
Sharingsending
Data
Direct data
Disclosure
to annotators
of sensitive
for labeling.
data for This
Differential
NLP presents
model Privacy
training
a possible
(DP) privac
Data Anonymization for Privacy Aware Machine Learning
David Nizar JaidanMaxime
The increase
CarrereZakaria
of data leaks, ChemliRémi
attacks,
2019 and
Springer
other
Poisvert
ransom-ware
General
in the last few years have pointed outWritten
concerns about data
Privacy-Utility-Trade-off
security and privacy. All
forthis
Training
has
Direct
negatively
DataDisclosure
affected
of sensitive
the sharing
dataand
for publication
Differential
NLP Tasks of Privacy
data. To
(DP)
address
Generalised differential privacy for text documentFernandes,
processingNatasha;
We address
Dras, Mark;
the problem
McIver,ofAnnabelle;
how
2019
to “obfuscate”
Google Scholar
texts byGeneral
removing stylistic clues which can identify Written
authorship, whilst preserving
Privacy-Utility-Trade-off
(as much as possible)
for Training
the
Information
Data
content of
Disclosure
the text. In
bythis
Statistical
paper weLanguage
Differential
combineModels
ideas
Privacy
from
(DP)
“generali
I am not what i write: Privacy preserving text representation
Beigi, Ghazaleh;
learning
Online
Shu, Kai;
usersGuo,
generate
Ruocheng;
tremendous
Wang,
2019 amounts
Suhang;
Google Scholar
Liu,
of textual
Huan;Other
information by participating in different activities,
Writtensuch as writingPrivacy-Utility-Trade-off
reviews and sharing tweets.
for Training
This
Information
textual
Data data
Disclosure
providesby
opportunities
Statistical Language
for
Differential
researchers
Models
Privacy
and (DP)
business p
Learning Private Neural Language Modeling withS.
Attentive
Ji; S. Pan;
Aggregation
G.Mobile
Long; keyboard
X. Li; J. Jiang;
suggestion
Z. Huang
is2019
typically
IEEE regarded as a General
word-level language modeling problem. Centralized
Written machine learning
Model Training
techniques
without
require
Sharing
the collection
Data
Direct Disclosure
of massiveofuser
sensitive
data for
data
training
for Federated
NLP
purposes,
modelLearning
training
which may raise p
Generative models for effective ML on private, decentralized
Augenstein, datasets
Sean;
To improve
McMahan,
real-world
H Brendan;
applications
Ramage,
2019ofGoogle
Daniel;
machineScholar
Ramaswamy,
learning,General
experienced
Swaroop; Kairouz,
modelersPeter;
develop
Chen,
intuition
Mingqing;
about
Written
Mathews,
their datasets,
Rajiv;their
Modelmodels,
Training
and
without
how the
Sharing
two interact.
Data
Direct Manual
Disclosure
inspection
of sensitive
of raw
data
data
for- Federated
NLP
of representative
modelLearning
training
samples,
& Differen
of
A privacy-preserving distributed filtering framework
Sadat,
for NLP
MN;artifacts
Aziz,Background
MMA; Mohammed,
Medical data
N; Pakhomov,
sharing
2019is S;
Web
a big
Liu,of
challenge
HF;
Science
Jiang,inXQ
Medicine
biomedicine, which often hinders collaborative
Written
research. Due toPrivacy-Utility-Trade-off
privacy concerns, clinical
for notes
Training
Direct
cannot
Data
Disclosure
be directlyofshared.
sensitive
A lot
data
of for
efforts
Homomorphic
NLPhave
model
been
training
Encryption
dedicated to
(HE)
de
Document Title Authors Abstract Year Source Domain Data Type Use Cases Generalized Issue/Vulnerability solved
PETs(RQ2)
for NLP(RQ1)
1
Encrypted Speech Recognition Using Deep Polynomial
S. Zhang;
Networks
Y. Gong;
The D.
cloud-based
Yu speech recognition/API
2019 IEEEprovides developers
Cloud Computing
or enterprises an easy way to create
Speech
speech-enabled features
SpeechinRecognition
their applications.
without However,
Disclosure
Directsending
Disclosure
of all Voice
audios
ofFeatures
sensitive
about personal
data foror
Homomorphic
NLP
company
model internal
training
Encryption
information
(HE)
Privacy-preserving voice-based search over mHealth
Hadian,
dataMohammad;
Voice-enabled
Altuwaiyan,
devices
Thamer;
have
Liang,
a2019
potential
Xiaohui;
Google
to significantly
Li,Scholar
Wei; Medicine
improve the healthcare systems as smart personal
Speech assistants. They
Speech
usually
Processing
come with
without
a hands-free
Disclosure
Directfeature
Disclosure
of all Voice
to addof
Features
an
sensitive
extra level
dataoffor
usability
Homomorphic
NLP Tasks
and convenience
Encryptionto(HE)
elde
Obfuscation for privacy-preserving syntactic parsing
Hu, Zhifeng; Havrylov,
The goalSerhii;
of homomorphic
Titov, Ivan; Cohen,
encryption
2019Shay
Google
is B;
to encrypt
Scholar
dataGeneral
such that another party can operate on it without
Writtenbeing explicitly
Privacy-Utility-Trade-off
exposed to the content forofTraining
the original
exploitability
Datadata. of
Weword
introduce
embeddings
an idea for Homomorphic
a privacy-preserving
Encryption
transform
(HE)
An Efficient and Dynamic Semantic-Aware Multikeyword
X. Dai; H.
Ranked
Dai; G.
Traditional
Search
Yang; X.
Scheme
searchable
Yi; H. Huang
Over encryption
Encrypted
2019schemes
Cloud
IEEEData
adopting the
Cloud
bag-of-words
Computingmodel occupy massive space
Written
to store the document
Storing
set's
andindex,
Searching
whereData
the dimension
without
exploitability
Data
of Exposure
the of
document
word embeddings
vector is equalHomomorphic
to the scale ofEncryption
the dictionary.
(HE)T
Semantic-aware multi-keyword ranked search scheme
Dai, Hua;
overDai,
encrypted
Xuelong;
Traditional
cloud
Yi,searchable
Xun;
data Yang, encryption
Geng; 2019
Huang,
schemes
ScienceDirect
Haiping
mostly adopt
Cloud
the Computing
TF-IDF (term frequency - inverse document
Written frequency) model
Storing
which
and ignores
Searching
theData
semantic
without
memorizability
association
Data Exposure
between
of NN keywords and documents.
Homomorphic
It isEncryption
a challenge(HE)
to d
Compromising Speech Privacy under ContinuousS.Masking
A. Anand;
in Personal
P. Walker;
This paper
Spaces
N. Saxena
explores the effectiveness
2019 IEEE
of common sound
Other
masking solutions deployed for preserving
Speech
speech privacy in Speech
workplace
Processing
environment
without
suchDisclosure
as Direct
hospitals,
Disclosure
of allfinancial
Voice of
Features
institutions,
sensitive data
lawyers
for None
NLP
offices,
Tasksnursing homes and g
Adversarial Training for Privacy-Preserving DeepM.
Learning
Alawad;Model
S. Gao;
Collaboration
Distribution
X. Wu; E. among
B. Durbin;
cancer
L. Coyle;
registries
2019L.IEEE
Penberthy;
is essential
G.toTourassi
develop
Medicine
accurate, robust, and generalizable Written
deep learning models
Model
for automated
Training without
information
Sharing
extraction
Data
Direct Disclosure
from cancerofpathology
sensitive data
reports.
for Obfuscation
NLP
Sharing
model
data
training
presents a seriou
Adversarial Learning of Privacy-Preserving Text Representations
Friedrich, Max; Köhn,
De-identification
for De-Identification
Arne; Wiedemann,
is theoftask
Medical
Gregor;
of 2019
detecting
Records
Biemann,
Google
protected
Scholar
Chris;health
Medicine
information (PHI) in medical text. It is a critical
Written step in sanitizing
Privacy-Utility-Trade-off
electronic health records
for Training
(EHRs)
Direct
Data
to Disclosure
be shared for
of sensitive
research.data
Automatic
for Obfuscation
NLPde-identification
model training classifiersc
Privacy-aware Document Ranking with Neural Signals
Shao, JJ; Ji, SY;The
Yang,
recent
T work on neural ranking
2019has
Webachieved
of Science
solid Cloud
relevance
Computing
improvement, by exploring similarities
Writtenbetween documents
Similarity
anddetection
queries using
without
word
Data
embeddings.
memorizability
Exposure It isofanNN
open problem how Obfuscation
to leverage such an advancem
Privacy-preserving outsourced speech recognition
Ma,
forZhuo;
smartLiu,
IoTYang;
devices
Most of
Liu,
the
Ximeng;
current Ma,
intelligent
Jianfeng;
Internet
2019Li, Feifei;
Google
of Things
Scholar
(IoT) products
Other take neural network-based speechSpeech
recognition as the standard
Speech Recognition
human-machine
without
interaction
Disclosure
memorizability
interface.
of all Voice
However,
of NN
Features
the traditional Secure
speech Multiparty
recognitionComputation
framework
Privacy-preserving adversarial representation learning
Srivastava,
in ASR:
BrijReality
Automatic
MohanorLal;
illusion?
speech
Bellet, recognition
Aurélien; Tommasi,
2019
(ASR)Google
isMarc;
a keyScholar
Vincent,
technology
Emmanuel;
General
in many services and applications. This typically
Speech requires user
Speech
devices
Recognition
to send their
without
speech
Disclosure
data
Direct
to the
Disclosure
of all
cloud
Voice
for
ofFeatures
ASR
sensitive
decoding.
data for
As Synthetic
NLP
the speech
modelData
training
signal
Generation
carries a lot
You talk too much: Limiting privacy exposure via Vaidya,
voice input
Tavish; Sherr,
Voice synthesis
Micah; uses a voice model
2019 to
Google
synthesize
Scholar
arbitrary
Home phrases.
Automation
Advances in voice synthesis have
Speech
made it possible
Speech
to create
Recognition
an accurate
without
voiceDisclosure
model
Direct
of Disclosure
aoftargeted
all Voiceof
individual,
Features
sensitivewhich
data for
canSynthetic
NLP
thenmodel
in turn
Data
training
beGeneration
used to genera
Towards differentially private text representationsLyu, Lingjuan; Li,
Most
Yitong;
deepHe,
learning
Xuanli;frameworks
Xiao, Tong;
2020
require
Google
users
Scholar
to pool their
General
local data or model updates to a trustedWritten
server to train or maintain
Privacy-Utility-Trade-off
a global model. The
for Word
assumption
exploitability
Embeddings
of a trusted
of wordserver
embeddings
who has access
Differential
to userPrivacy
information
(DP) is ill-s
Calibrating Mechanisms for Privacy Preserving Text
Feyisetan,
AnalysisOluwaseyi;
This talk
Balle,
presents
Borja;a Diethe,
formal approach
Tom;
2020
Drake,
Google
to carrying
Thomas;
Scholar
out privacy
General
preserving text perturbation using a variant
Writtenof Differential Privacy-Utility-Trade-off
Privacy (DP) known as Metric
for Word
DPexploitability
Embeddings
(mDP). Ourofapproach
word embeddings
applies carefully
Differential
calibrated
Privacy
noise to
(DP)
vector re
Hyperbolic Embeddings for Preserving Privacy and
Feyisetan,
Utility inOluwaseyi;
TextUser’s Diethe,
goal: meet
Tom;
some
Drake,
specific
Thomas;
2020
need
Google
with respect
Scholarto aGeneral
query x • Agent’s goal: satisfy the user’s request
Written• Question: what
Privacy-Utility-Trade-off
occurs when x is used for to
Word
make
exploitability
Embeddings
other inferences
of word• embeddings
Mechanism: Modify
Differential
the queryPrivacy
to protect
(DP)privacy
Differentially private set union Gopi, Sivakanth;We study the
Gulhane, basic Kulkarni,
Pankaj; operationJanardhan;
of set union
2020 Googlein Scholar
theJudy
Shen, global model of
Hanwen;
General differential
Shokouhi, privacy.
Milad; In this problem,
Yekhanin, we are given a universe
Sergey;Written 𝑈 of
Similarity items, possibly
detection without of infinite
Data size, and
Information
Exposure Disclosure by 𝐷
a database of users.Language
Statistical Each 𝑖 Privacy
userModels
Differential contributes
(DP)a subse
Differentially Private Representation for NLP: Formal
Lyu, Guarantee
Lingjuan; He,
Itand
has
Xuanli;
An
been
Empirical
Li,
demonstrated
Yitong;
Study onthat
Privacy
2020
hidden
and
Google
representation
Fairness
Scholar learned
Generalby a deep model can encode privateWritten
information of the input,
Privacy-Utility-Trade-off
hence can be exploited
for Neural
to recover
memorizability
Text Representations
such information
of NN with reasonable
Differential
accuracy.Privacy
To address
(DP) this is
Building a Personally Identifiable Information Recognizer
Hathurusinghe,
in a Privacy
This
Rajitha;
thesis
Preserved
explores
Manner
the training
Using2020
Automated
of aGoogle
deep Annotation
neural
Scholar
network
and
General
based
Federated
named
Learning
entity recognizer in an end-to-end
Written privacy preserved
Model Training
setting where
withoutdataset
Sharingcreation
Data
Direct and
Disclosure
model training
of sensitive
happen
datainfor
anFederated
NLP
environment
modelLearning
training
with minimal ma
Federated learning for healthcare informatics Xu, Jie; Glicksberg,
WithBenjamin
the rapid S;
development
Su, Chang;ofWalker,
2020
computer
Google
Peter;
software
Bian,
Scholar
Jiang;
and Medicine
hardware
Wang, Fei;
technologies, more and more healthcare
Written data are becoming
Model Training
readily available
without Sharing
from clinical
Data
Direct
institutions,
Disclosurepatients,
of sensitive
insurance
data for
companies,
Federated
NLP modeland
Learning
training
pharmaceutical
Federated Acoustic Model Optimization for Automatic
Conghui
Speech
TanDi
Recognition
JiangHuaxiao
Traditional Automatic
MoJinhua
Speech
PengYongxin
2020
Recognition
Springer
TongWeiwei
(ASR) systems
ZhaoChaotao
Other
are usually
ChenRongzhong
trained with speech
LianYuanfeng
recordsSpeech
SongQian
centralizedXu
on the ASR
Model
vendor’s
Trainingmachines.
without Sharing
However,
Data
Direct
with Disclosure
data regulations
of sensitive
such asdata
General
for Federated
NLP
Data
model
Protection
Learning
trainingRegulation
Privacy-Preserving Deep Learning NLP Models for
M. Cancer
Alawad;Registries
H. Yoon;
Population
S. Gao;
cancer
B. Mumphrey;
registries X.
can
2020
Wu;
benefit
E.
IEEE
B.from
Durbin;
deepJ. learning
C. Medicine
Jeong;
(DL)
I. Hands;
to automatically
D. Rust; L.extract
Coyle;cancer
L. Penberthy;
characteristics
WrittenG. Tourassi
from pathology
Model Training
reports.
without
The success
Sharingof
Data
exploitability
DL is proportional
of wordtoembeddings
the availability ofFederated
large labeled
Learning
datasets for mo
Texthide: Tackling data privacy in language understanding
Huang, Yangsibo;
tasks AnSong,
unsolved
Zhao;
challenge
Chen, Danqi;
in distributed
Li,
2020
Kai;Google
or
Arora,
federated
Sanjeev;
Scholar
learning
General
is to effectively mitigate privacy risks without
Written
slowing down Model
trainingTraining
or reducing
without
accuracy.
SharingInData
memorizability
this paper, we propose
of NN TextHide aiming
Federated
at addressing
Learning
this challenge
Task-Agnostic Privacy-Preserving RepresentationLi,Learning
Ang; Yang,
via Huanrui;
Federated
The availability
Chen,
Learning
Yiran;
of various large-scale
2020 Google
datasets
Scholar
benefitsGeneral
the advancement of deep learning. These Written
datasets are often crowdsourced
Model Trainingfrom
without
individual
Sharingusers
Data
memorizability
and contain of
private
NN information like Federated
gender, age,
Learning
etc. Due to rich p
Decentralizing feature extraction with quantum convolutional
Yang, Chao-Han
neural
We
Huck;
network
propose
Qi, Jun;
for
a novel
automatic
Chen,
decentralized
Samuel
speech
2020
Yen-Chi;
recognition
feature
Google
Chen,
extraction
Scholar
Pin-Yu;
approach
General
Siniscalchi,
in federated
Sabato Marco;
learningMa,
to address
Xiaoli; Lee,
privacy-preservation
Speech
Chin-Hui; issues
Model for
Training
speechwithout
recognition.
SharingIt is
Data
memorizability
built upon a quantum
of NN convolutional neural
Federated
network
Learning
(QCNN) compos
Fold-stratified cross-validation for unbiased and privacy-preserving
Bey, Romain; Goussault,
Objective:
federated
Romain;
Welearning
introduce
Grolleau,
fold-stratified
François;
2020 Google
cross-validation,
Benchoufi,
Scholar
Mehdi;
Medicine
a validation
Porcher, Raphaël;
methodology that is compatibleWritten
with privacy-preserving
Modelfederated
Training without
learningSharing
and thatData
memorizability
prevents data leakage
of NN caused by duplicates
Federated
of electronic
Learning health reco
Empirical Studies of Institutional Federated Learning
Zhu,For
Xinghua;
NaturalWang,
Federated
Language
Jianzong;
learning
Processing
Hong,
hasZhenhou;
sparkled
2020
new
Xiao,
Google
interests
Jing; Scholar
in the deep
General
learning society to make use of isolatedWritten
data sources fromModel
independent
Traininginstitutes.
without Sharing
With theData
Direct
development
Disclosure
of novel
of sensitive
trainingdata
tools,
forwe
Federated
NLP
have
model
successfully
Learning
training &
deployed
Differen
Privacy-Preserving Collaborative Deep Learning L.
With
Zhao;
Unreliable
Q. Wang;
With
Participants
Q.powerful
Zou; Y. Zhang;
parallelY.
computing
Chen 2020
GPUs
IEEE
and massive user
General
data, neural-network-based deep learning
Written
can well exert itsModel
strongTraining
power inwithout
problemSharing
modeling
Data
memorizability
and solving,ofand
NNhas archived greatFederated
success in
Learning
many applications
& Differen
Generic cost optimized and secured sensitive attribute
Sumathi,
storage
M; Sangeetha,
model
Cloudfor
computing
template
S; Thomas,
stands
based
A astextthe
document
2020
mostWeb
powerful
onofcloud
Science
technology
Cloud
for Computing
providing and managing resources as
Written
a pay-perusage system.
Model Training
Nowadays,
without
userSharing
documents
Data
Direct
are Disclosure
stocked in of
thesensitive
cloud fordata
easyforaccess,
Homomorphic
NLP model
less maintenance
training
Encryptioncost,
(HE)
Cross-lingual multi-keyword rank search with semantic
Guan, extension
Zhitao; Liu,
The
over
Xueyan;
emergence
encrypted
Wu, Longfei;
of
data
searchable
Wu, Jun;
2020
encryption
Xu,
ScienceDirect
Ruzhi;
technology
Zhang, sheds
Jinhu;
General
new
Li, Yuanzhang
light on the problem of secure search
Written
over encryptedStoring
data, which
and Searching
has been Data
a research
without
Direct
hotspot
Data
Disclosure
Exposure
in the past
of sensitive
few years.
data
However,
for Homomorphic
NLPmost
Tasks
existing
Encryption
search schem
(HE)
Privacy Preserving Chatbot Conversations D. Biswas With chatbots gaining traction 2020
and their
IEEEadoption growing
General
in different verticals, e.g. Health, Banking,
Written
Dating; and usersStoring
sharingand
more
Searching
and moreData
private
without
information
Direct
Data
Disclosure
Exposure
with chatbots
of sensitive
- studies
data for
haveHomomorphic
NLP
started
Tasksto highlight
Encryption
the privac
(HE)
Multi-user searchable encryption voice in home IoT
Li, Wei;
system
Xiao, Yazhou;
With the
Tang,
development
Chao; Huang,
of home
Xujing;
2020
IoT voice
Xue,
ScienceDirect
Jianwu
systems that Home
can provide
Automation
interfaces, such as Amazon Alex,
Speech
it is easy to disclose
Storing
personal
and Searching
privacy when
Datapeople
without
Direct
search
Data
Disclosure
Exposure
voice conveniently.
of sensitiveItdata
becomes
for Homomorphic
NLP
a challenge
Tasks Encryption
how to search
(HE)v
PrivFT: Private and Fast Text Classification With A.
Homomorphic
A. Badawi; L.
Encryption
We
Hoang;
present
C. F.
anMun;
efficient
K. Laine;
and non-interactive
K.
2020
M. M.
IEEE
Aungmethod forGeneral
Text Classification while preserving the privacy
Written
of the content using
Classification
Fully Homomorphic
on encrypted
Encryption
Dataexploitability
(FHE). Ourof word
solution
embeddings
(named Private
Homomorphic
Fast Text (PrivFT))
Encryption
provides
(HE)
PAIGE: towards a hybrid-edge design for privacy-preserving
Liang, Yilei; intelligent
O'Keeffe,
Intelligent
Dan;
personal
Personal
Sastry,
assistants
Nishanth;
Assistants2020
(IPAs)Google
such asScholar
Apple's Siri,
Home Google
Automation
Now, and Amazon Alexa are becoming
Speech an increasingly
Speech
important
Processing
class of
without
web application.
Disclosure
Direct Disclosure
of
In all
contrast
Voice of
to
Features
sensitive
previous keyword-oriented
data for None
NLP Tasks
search applications,
Privacy Risks of General-Purpose Language Models
X. Pan; M. Zhang;
Recently,
S. Ji; M.
a new
Yangparadigm of building
2020 IEEE
general-purpose General
language models (e.g., Google's Bert and OpenAI's
Written GPT-2) in Natural
Investigating
Language
the Impact
Processing
of Word
(NLP)
exploitability
Embeddings
for text feature
of
onword
Privacy
extraction,
embeddings a standard
Noneprocedure in NLP systems
Exploring the Privacy-Preserving Properties of Word
Abdalla,
Embeddings:
M; Abdalla,
Background:
Algorithmic
M; Hirst,Word
G;
Validation
Rudzicz,
embeddings
Study
F 2020
are dense
Web ofnumeric
Sciencevectors
Medicine
used to represent language in neural networks.
Written Until recently,
Investigating
there hadthe
been
Impact
no publicly
of Wordreleased
exploitability
Embeddings
embeddings
of
onword
Privacy
embeddings
trained on clinicalNone
data. Our work is the first to s
Investigating the Impact of Pre-trained Word Embeddings
Thomas, A;
onAdelani,
Memorization
The sensitive
DI; Davody,
in Neural
information
A; Mogadala,
Networks
present
2020
A; Klakow,
in Web
the training
ofD Science
data, poses
General
a privacy concern for applications as Written
their unintended memorization
Investigatingduring
the Impact
training
of can
Pre-trained
make
memorizability
models
Word Embeddings
susceptible
of NN toon
membership
NN Memorization
None
inference and attribute infere
Privacy Preserving Acoustic Model Training for Speech
Y. Tachioka
Recognition
In-domain speech data significantly
2020 improve
IEEE the speech
General
recognition performance of acoustic models.
Speech
However, the data
Model
mayTraining
contain without
confidential
Sharing
information
Data
Direct Disclosure
and exposure
of sensitive
of transcriptions
data for Obfuscation
NLP
may model
lead totraining
a breach in speak
Performance Evaluation of Voice Encryption Techniques
A. M. Raheema;
Based on
With
S.Modified
B.the
Sadkhan;
substantial
Chaotic
S. M.Systems
development
Abdul Sattar
2020of IEEE
communication systems,
Generalespecially in the public channels through
Speech
which the knowledge
Speechmoves,
Communication
channel problems
without Disclosure
Direct
are growing.
Disclosure
of all
TheVoice
ofprotection
sensitive
Features
of
data
privacy
for Obfuscation
NLP
andTasks
data is critical and shoul
Privacy-and utility-preserving textual analysis viaFeyisetan,
calibrated Oluwaseyi;
multivariate
Accurately
Balle,
perturbations
learning
Borja; from
Drake,
user
Thomas;
2020
data while
Google
Diethe,
providing
Scholar
Tom; quantifiable
General privacy guarantees provides an opportunity
Written to build better
Privacy-Utility-Trade-off
ML models while maintaining
for Training
exploitability
user
Data
trust. This
of word
paper
embeddings
presents a formal
Obfuscation
approach to carrying out pri
Preech: A system for privacy-preserving speech transcription
Ahmed, Shimaa;New
Chowdhury,
advancesAmrita
in machine
Roy; Fawaz,
learning
2020Kassem;
have
Google
made
Ramanathan,
Scholar
Automated
Cloud
Parmesh;
Speech
Computing
Recognition (ASR) systems practical
Speechand more scalable.
SpeechThese
Transcription
systems,without
however,
Disclosure
Information
pose serious
of Acoustic
Disclosure
privacy
Features
threats
by Statistical
as speech
Language
Obfuscation
is a richModels
source of sensitive
Privacy-preserving Voice Analysis via Disentangled
Aloufi,
Representations
Ranya; Haddadi,
Voice User
Hamed;
Interfaces
Boyle,
(VUIs)
David;
are
2020
increasingly
Google Scholar
popular and
Other
built into smartphones, home assistants,
Speech
and Internet of Things
Speech
(IoT)Processing
devices. Despite
withoutoffering
Disclosure
memorizability
an always-on
of all Voice
ofconvenient
NN
Features user experience,
Obfuscation
VUIs raise new securit
A novel privacy-preserving speech recognition framework
Wang, QR;
using
Feng,
bidirectional
Utilizing
CK; Xu,speech
Y;LSTM
Zhong,
as the
H;transmission
Sheng,
2020
VS Web
medium
of Science
in Internet
Other
of things (IoTs) is an effective way to reduce
Speech
latency while improving
Speechthe
Recognition
efficiency without
of human-machine
Disclosure
memorizability
of
interaction.
all Voice
of NN
Features
In the field of speech
Obfuscation
recognition, Recurrent Neu
Privacy-Aware Best-Balanced Multilingual Communication
Pituxcoosuvarn,InM;machine
Nakaguchi,
translation
T; Lin, DH;
(MT)Ishida,
mediated
2020T Web
human-to-human
of Science General
communication, it is not an easy task to select
Written
the languages Classification
and translationonservices
encrypted
to be
DataDirect
used as the
Disclosure
users have
of sensitive
various data
language
for Secure
NLP
backgrounds
Tasks
Multiparty
andComputation
skills. Our
Privacy-preserving Feature Extraction via Adversarial
Ding,Training
Xiaofeng; Deep
Fang,learning
Hongbiao;
is increasingly
Zhang, Zhilin;
popular,
2020
Choo,
Google
partly
Kim-Kwang
Scholar
due to Raymond;
its Cloud
widespread
Computing
Jin, Hai;
application potential, such as inWritten
civilian, government
Model
and military
Trainingdomains.
without Sharing
Given the
Data
memorizability
exacting computational
of NN requirements,
Secure
cloud Multiparty
computingComputation
has been u
Document Title Authors Abstract Year Source Domain Data Type Use Cases Generalized Issue/Vulnerability solved
PETs(RQ2)
for NLP(RQ1)
1
SecureNLP: A system for multi-party privacy-preserving
Feng, Qi;
natural
He, Debiao;
language
NaturalLiu,
language
processing
Zhe; Wang,
processing
Huaqun;
(NLP)
2020
Choo,
allows
Google
Kim-Kwang
aScholar
computer
Raymond;
General
program to understand human language asWritten
it is spoken, and has
Model
been
Training
increasingly
without
deployed
SharinginData
memorizability
a growing number
of NN
of applications, such
Secure
as machine
Multiparty
translation,
Computation
sen
Adversarial representation learning for private speech
Ericsson,
generation
David;As
Östberg,
more and
Adam;
moreZec,
dataEdvin
is collected
Listo;
2020Martinsson,
in
Google
various
Scholar
settings
John; Mogren,
General
across organizations,
Olof; companies, and countries,
Speechthere has been
Speech
an increase
Characteristics
in the demand
Obfuscation
of user
Direct
inprivacy.
a
Disclosure
dynamic
Developing
manner
of sensitive
privacy
data
preserving
for Synthetic
NLP model
methods
Data
training
for
Generation
data analytic
Privacy-Preserving Knowledge Transfer with Bootstrap
Hong-Jun
Aggregation
YoonHilda
There
ofB.
Teacher
is KlaskyEric
a needEnsembles
to transfer
B. DurbinXiao-Cheng
knowledge
2021 Springer
among
WuAntoinette
institutions
General
StroupJennifer
and organizations
DohertyLinda
to save effort
CoyleLynne
in annotation
Written
PenberthyChristopher
and labeling orModel
inStanleyJ.
enhancing
Training
Blair
task
without
ChristianGeorgia
performance.
Sharing Data
exploitability
However,
D. Tourassi
knowledge
of word embeddings
transfer is difficult
Differential
becausePrivacy
of restrictions
(DP) that
ADePT: Auto-encoder based Differentially PrivateKrishna,
Text Transformation
Satyapriya;
Privacy
Gupta,
is anRahul;
important
Dupuy,
concern
Christophe;
2021
when
Google
building
Scholar
statistical
General
models on data containing personal information.
Written Differential privacy
Privacy-Utility-Trade-off
offers a strong definition
for Training
of
exploitability
privacy
Data andofcan
word
beembeddings
used to solve several
Differential
privacyPrivacy
concerns
(DP)(Dwork
Privacy-Preserving Graph Convolutional Networks
Igamberdiev,
for Text Classification
Timour;
GraphHabernal,
convolutional
Ivan;networks 2021
(GCNs)
Google
are a Scholar
powerful architecture
General for representation learning and Written
making predictions on
Classification
documentswithout
that naturally
Data Disclosure
occur
memorizability
as graphs, e.g.,
of NN
citation or social networks.
Differential
DataPrivacy
containing
(DP)sensitiv
Exploiting peer-to-peer communications for queryTran,
privacy
Bang;
preservation
Liang,
Voice
Xiaohui;
assistant
in voice assistant
system (VAS)
systems
2021
is a popular
Googletechnology
Scholar for
Home
users
Automation
to interact with the Internet and theSpeech
Internet-of-Things Private
devices.Communication
In the VAS, voice queries
Direct
are linked
Disclosure
to users’
of sensitive
accounts,
data
resulting
for Obfuscation
NLPinTasks
long-term and continuous
Federated learning of N-gram language models Chen M., Suresh
WeA.T.,
propose
Mathews
algorithms
R., Wong
to train
A., Allauzen
2019
production-quality
SCOPUS
C., Beaufaysn-gram
F.,General
Riley
language
M. models using federated learning.
Written
Federated learning
Model
is aTraining
distributed
without
computation
Sharing platform
Data
Direct Disclosure
that can be
of used
sensitive
to train
dataglobal
for Federated
NLP
models
model
forLearning
training
portable devices s
Poster abstract: Federated learning for speech emotion
Latif S.,recognition
Khalifa Privacy
S., applications
Ranaconcerns
R., Jurdak
areR.considered
2020
oneSCOPUS
of the major challenges
Generalin the applications of speech emotion
Speech
recognition (SER)Model
as it involves
Trainingthe
without
complete
Sharing
sharing
Data
Direct
of speech
Disclosure
data,
of which
sensitive
candata
bringfor
threatening
Federated
NLP modelconsequences
Learning
training to pe
Industrial Federated Topic Modeling Jiang, Di; Tong, Probabilistic
Yongxin; Song,topic
Yuanfeng;
modelingWu,
has
2021
Xueyang;
beenACM
applied
Zhao,
in Weiwei;
a variety
Medicine
Peng,
of industrial
Jinhua;applications.
Lian, Rongzhong;
Training
Xu,
a high-quality
Qian;
Written
Yang, model
Qiang usually
Model
requires
Training
a massive
withoutamount
SharingofData
Direct
data toDisclosure
provide comprehensive
of sensitive data
co-occurrence
for Federated
NLP model
information
Learning
training for the mo
Improving on-device speaker verification using federated
Granqvistlearning
F., Seigel
Information
with
M.,
privacy
vanon
Dalen
speaker
R., Cahill
characteristics
Á.,
2020
Shum
SCOPUS
can
S., Paulik
be useful
M. as
General
side information in improving speaker recognition
Speech accuracy. However,
Model Training
such information
without Sharing
is often
Data
Direct
private.
Disclosure
This paper
of sensitive
investigates
datahow
for Federated
NLP
privacy-preserving
modelLearning
traininglearning
& Differen
ca
Secure computation for privacy preserving biometric
Sy B.data retrieval
Theand
goal
authentication
of this research is to 2008
develop
SCOPUS
provable secureGeneral
computation techniques for two biometric security
Written tasks; namely
Speech
biometric
Communication
data retrieval
without
and authentication.
Disclosure
Direct Disclosure
of all
WeVoice
first
of sensitive
present
Features
data
models
for Homomorphic
NLP
for privacy
modeland
training
Encryption
security that
(HE)
de
Privacy-preserving PLDA speaker verification using
Treiber
outsourced
A., Nautsch
secure
The A.,
usage
computation
Kolberg
of biometric
J., Schneider
recognition
T.,
2019Busch
has
SCOPUS
become
C. prevalent
General
in various verification processes, ranging
Speech
from unlocking mobile
Speechdevices
Verification
to verifying
without
bank
Disclosure
transactions.
Direct of
Disclosure
all Automatic
Voice of
Features
sensitive
speakerdata
verification
for Homomorphic
NLP(ASV)
modelallows
training
Encryption
an individual
(HE)
Achieving efficient similar document search over Aritomo
encrypted
D.,data
Watanabe
Cloud
on thecomputing,
C.
cloud which provides
2019scalable
SCOPUS computing resources
Cloud Computing
at an economical rate, is more important
Written than ever. More
Storing
andand
more
Searching
data owners
Dataare
without
storing
Direct
Data
their
Disclosure
Exposure
data on
ofthe
sensitive
cloud. data
To protect
for Homomorphic
NLP
private
modeland
training
sensitive
Encryption
data,
(HE)u
Privacy preserving speech, face and fingerprint based
Dineshbiometrie
A., BijoyBiometrics
authentication
K.E. represents
system the
using
identity
secure
2017of SCOPUS
signal
individuals.
processing
Physical
General
characteristics like voice, face, fingerprint,
Speech
etc. are used to recognize
Speech Processing
individuals.without
Biometrics
Disclosure
are
Direct
used
Disclosure
of all
as Voice
a promising
of
Features
sensitive
methoddata
forfor
authentication,
Homomorphic
NLP Tasks but Encryption
use of these
(HE)
Efficient and Privacy-Preserving Speaker Verification
Gao Scheme
X., Li K., for
Chen
Speaker
Home
W., Automation
Hu
recognition
W., Zhang
Devices
technology
Z., Li Q.2020
has SCOPUS
been widely used Home
in home
Automation
automation recently, which can provide
Speechidentification for
Speech
the user
Verification
efficiently.without
However,
Disclosure
speaker
Direct of
Disclosure
recognition
all Voice of
Features
technology
sensitive data
brings
for security
Homomorphic
NLP Tasksrisks as
Encryption
well as conven
(HE)
Encrypted Domain Mel-Frequency Cepstral Coefficient
Chen J.,and
Chen
Fragile
Z.,
Audio
Zheng
Audio
hasWatermarking
P.,
become
Guo J.,increasingly
Zhang W.,2018
Huang
important
SCOPUS
J. in modern social
Cloudcommunication
Computing and mobile Internet, e.g.,
Written
the employment Speech
of voice Watermarking
messaging in mobile social
Direct
Apps.
Disclosure
In the scenario
of sensitive
of cloud
datacomputing,
for Homomorphic
NLP Tasks
we need
Encryption
to consider
(HE)
th
Efficient searching with multiple keyword over encrypted
Santhi K.,
cloud
Deepa
data
In M.,
Cloud
byLawanya
blind
Computing,
storage
Shri M.,
a basic
Benjula
application
2016
AnbuSCOPUS
Malar
is toM.B.
source the
Cloud
knowledge
Computing
to outside cloud servers for climbable
Written knowledge Storing
storage.and
TheSearching
outsourced
Data
knowledge,however,have
without
Direct
Data
Disclosure
Exposureoftosensitive
be compelled
data for
to be
Homomorphic
NLPencrypted
Tasks because
Encryption
of the
(HE)
p
Toward Efficient Multi-Keyword Fuzzy Search over
FuEncrypted
Z., Wu X.,Outsourced
Guan
Keyword-based
C., Sun
Data
X.,with
Ren
search
Accuracy
K. over encrypted
Improvement
2016 SCOPUS
outsourced data
Cloud
has become
Computing
an important tool in the current
Written
cloud computing Storing
scenario.
and
The
Searching
majority Data
of thewithout
existing
Direct
Data
techniques
Disclosure
Exposureare
of sensitive
focusing data
on multi-keyword
for Homomorphic
NLP TasksexactEncryption
match or single
(HE)
Privacy preserving similarity based text retrieval through
Kumari P.,
blind
Jancy
storage
Cloud
S. computing is improving 2016
rapidlySCOPUS
due to their moreCloud
advantage
Computing
and more data owners give interest
Written
to outsource their
Storing
dataand
into Searching
cloud storage
Datafor
without
centralize
Direct
Data
Disclosure
their
Exposure
data.ofAssensitive
huge files
data
stored
for Homomorphic
NLP
in the
Tasks
cloud storage,
Encryption
there
(HE)
is n
Classification of Encrypted Word Embeddings using
Podschwadt
RecurrentR.,
Neural
Deep
Takabi
learning
Networks.
D. has made many2020
exciting
SCOPUS
applications possible
Generaland given the popularity of social networks
Writtenand user generated
Classification
contentoneveryday
encrypted
there
Dataexploitability
is no shortage of
of word
data embeddings
for these applications.
Homomorphic
The content
Encryption
generated
(HE)
b
Privacy-Preserving Character Language Modelling
Thaine, Patricia;Some
Penn,ofGerald;
the most sensitive information
Google
we generate
Scholar isGeneral
either written or spoken using natural language.
Written
Privacy-preserving
Classification
methodsonforencrypted
natural language
Dataexploitability
processingofare
word
therefore
embeddings
crucial, especially
Homomorphic
considering
Encryption
the ever-g
(HE)
Continuous improvement process (CIP)-based privacy-preserving
Yankson B. Advances
framework
within
for smart
the toyconnected
industry
2021and
toys
SCOPUS
interconnectedness
Other
have resulted in the rapid and pervasiveWritten
development of smart
Private
connected
Communication
toys (SCTs), built with
Direct
theDisclosure
capacity to
ofcollect
sensitive
terabytes
data forofNone
NLP
personal
Tasksidentifiable informati
Preserving privacy in speaker and speech characterisation
Nautsch A., Jiménez
Speech
A., recordings
Treiber A., are
Kolberg
a richJ.,source
2019
Jasserand
SCOPUS
of personal,
C., Kindt
sensitive
E.,Medicine
Delgado
data that
H., Todisco
can be used
M., Hmani
to support
M.A.,aMtibaa
plethora
Speech
A.,of
Abdelraheem
diverse applications,
Speech
M.A., Abad
Verification
from
A.,health
Teixeira
without
profiling
F., Matrouf
Disclosure
to biometric
Direct
D., Gomez-Barrero
of
Disclosure
all
recognition.
Voice of
Features
sensitive
ItM.,
is Petrovska-Delacrétaz
therefore
data for
essential
None
NLP Tasks
thatD.,
speech
Chollet
recordings
G., Evans
Privacy-aware personalized entity representations
Melnick
for improved
L., Elmessilhy
user
Representation
understanding
H., Polychronopoulos
learning has transformed
V.,
2020
Lopez
SCOPUS
G.,the
Tufield
Y., Khan
of machine
General
O.Z., Wang
learning.
Y.-Y.,
Advances
Quirk C. like ImageNet, word2vec,
Written and BERTPrivate
demonstrate
Communication
the power of pre-trained
Direct Disclosure
representations
of sensitive
to accelerate
data formodel
Obfuscation
NLP model
training.
training
The effectivenes
An analysis of blocking methods for private record
Etienne
linkageB., Cheatham
The field
M.,ofGrzebala
Big Data P.
Analytics2016
has led
SCOPUS
to many important
Other
advances; however, these analyses often
Written
involve linking databases
Similarity
that
detection
containwithout
personalData
identifiable
Direct
Exposure
Disclosure
information,
of sensitive
and this raises
data for
privacy
Obfuscation
NLP model
concerns.
training
The sensitive
Redesign of Gaussian mixture model for efficientRahulamathavan
and privacy-preserving
This
S., paper
Yaospeaker
X.,
proposes
Yogachandran
recognition
an algorithm
R.,
2018
Cumanan
toSCOPUS
perform
K., privacy-preserving
RajarajanGeneral
M. (PP) speaker recognition using Gaussian
Speech mixture models
Speech
(GMM).
Recognition
We consider
without
a scenario
Disclosure
Directwhere
Disclosure
of allthe
Voice
users
ofFeatures
sensitive
have to enrol
data for
their
Obfuscation
NLPvoice
model
biometric
trainingwith the third
A privacy preserving efficient protocol for semantic
Hawashin
similarityB.,
join
Fotouhi
During
using long
F.,
theTruta
similarity
string
T.M.
attributes
join process,
2011one
SCOPUS
or more sourcesGeneral
may not allow sharing the whole data with Written
other sources. In this
Similarity
case, privacy
detection
preserved
withoutsimilarity
Data Direct
Exposure
joinDisclosure
is required.ofWe
sensitive
showed data
in our
for Obfuscation
NLP
previous
Tasks
work [4] that using lon
Privacy-preserving speaker verification and identification
Pathak M.A.,
usingRaj
gaussian
Speech
B. being
mixture
a unique
modelscharacteristic
2012 SCOPUS
of an individual isGeneral
widely used in speaker verification and speaker
Speechidentification tasks
Speechin applications
Verification without
such asDisclosure
authentication
Direct of
Disclosure
alland
Voice
surveillance
of
Features
sensitiverespectively.
data for Obfuscation
NLPInTasks
this article, we present fr
Privacy-preserving speaker verification as password
Pathak
matching
M.A., Raj
We B.present a text-independent2012
privacy-preserving
SCOPUS speaker
General
verification system that functions similar
Speech
to conventional password-based
Speech Verification
authentication.
without Disclosure
Our
Direct
privacy
of
Disclosure
allconstraints
Voice of
Features
sensitive
require data
that the
for Obfuscation
NLP
system
Tasks
does not observe the s
Privacy-preserving ivector-based speaker verification
Rahulamathavan
This
Y., paper
Sutharsini
introduces
K.R., Ray
an efficient
I.G.,2018
Lu algorithm
R.,SCOPUS
Rajarajan
to develop
M. General
a privacy-preserving voice verification based
Speech
on iVector and linear
Speech
discriminant
Verification
analysis
without
techniques.
Disclosure
DirectThis
of
Disclosure
allresearch
Voice of
Features
considers
sensitive data
a scenario
for Obfuscation
NLP
in Tasks
which users enrol their vo
Secure binary embeddings of front-end factor analysis
PortêloforJ.,privacy
AbadRemote
A.,
preserving
Raj B.,
speaker
Trancoso
speaker
verification
verification
I. services
2013 SCOPUS
typically rely on the
General
system to have access to the users recordings,
Written or features derived
Speechfrom
Verification
them, and
without
also Disclosure
a model
Direct
of of
Disclosure
theallusers
Voicevoice.
of
Features
sensitive
This conventional
data for Obfuscation
NLPscheme
Tasks raises several priv
Biomedical Data Privacy Enhancement Architecture
Singh
Based
N., Lakhina
on Multi-Keyword
TheU.,Collaborative
Elamvazuthi
Searchhealthcare
I.,Technique
Jangra A.,
cloud
2018
Singh
environment
SCOPUS
A.K. offersMedicine
potential benefits, including cost reduction Written
and improvising theStoring
healthcare
and quality
Searching
delivered
Data without
to patients.
Direct
Data
Disclosure
Though,
Exposure
such
of sensitive
scenarios
data
experiences
for Obfuscation
NLP Tasks
substantial challenges
PIVOT: Privacy-preserving outsourcing of text data
Li Y.,
for Wang
word embedding
W.H.,
In Dong
this paper,
against
B. wefrequency
design PIVOT,
analysis
2019
a new
attack
SCOPUS
privacy-preserving
General
method that supports outsourcing of textWritten
data for word embedding.
Privacy-Utility-Trade-off
PIVOT includes fora 1-to-many
Wordexploitability
Embeddings
mappingoffunction
word embeddings
for text documents
Obfuscation
that can defend against the
Privacy-preserving speaker verification using secure
Portêlo
binary
J., Raj
embeddings
B.,
Remote
Abad A.,
speaker
Trancoso
verification
I. services
2014 SCOPUS
typically rely on the
General
system having access to the users recordings,
Speechor features derived
Speechfrom
Verification
them, and/or
without
a model
Disclosure
exploitability
for the
of users
all Voice
ofvoice.
word
Features
embeddings
This conventionalObfuscation
approach raises several priva
Privacy preserving speaker verification using adapted
Pathak
GMMs
M.A., Raj
InB.
this paper we present an adapted
2011 SCOPUS
UBM-GMM basedGeneral
privacy preserving speaker verification (PPSV)
Speech
system, whereSpeech
the system
Verification
is not able
without
to observe
Disclosure
memorizability
the speech
of all Voice
data
of NN
Features
provided by the user
Obfuscation
and the user does not observ
Privacy-preserving classification of personal text Reich
messages
D., Todoki
with Classification
secure
A., Dowsley
multi-party
R.,
of personal
decomputation:
Cock text
M., 2019
Nascimento
messages
An application
SCOPUS
has
A. many
to hate-speech
useful
General
applications
detection
in surveillance, e-commerce,
Written
and mental healthClassification
care, to name without
a few.Data
Giving
Disclosure
applications
Direct Disclosure
access to
of personal
sensitivetexts
data can
for Secure
NLP
easily
Tasks
lead
Multiparty
to (un)intentional
Computation p
Secure computation for biometric data security-application
Sy B.K. to speaker
The goal
verification
of this research is to 2009
develop
SCOPUS
provable secureGeneral
computation techniques for two biometric Speech
security tasks in complex
Speechdistributed
Verification
systems
without
involving
Disclosure
Direct
multiple
of
Disclosure
allparties;
Voice of
Features
namely
sensitive
biometric
data fordata
Secure
NLPretrieval
Tasks
Multiparty
and authentication
Computation
Attacking a privacy preserving music matching algorithm
Portêlo J., Raj B.,
Secure
Trancoso
multi-party
I. computation
2012
based
SCOPUS
techniques are Other
often used to perform audio database search
Speech
tasks, such as music
Storing
matching,
and Searching
with privacy.
DataHowever,
without
Direct
Data
inDisclosure
spite
Exposure
of theofsecurity
sensitive
of data
individual
for Secure
NLP
components
Tasks
Multiparty
of the
Computation
matching
Privacy preserving speech analysis using emotion
Aloufi
filtering
R., Haddadi
at theVoice
edge:
H., controlled
Boyle
PosterD.
abstract
devices and services
2019 SCOPUS
are commonplace
Cloud
in consumer
Computing
IoT. Cloud-based analysis services
Speech extract information
Speechfrom
Recognition
voice input
without
usingDisclosure
speech
Directrecognition
Disclosure
of all Voice
techniques.
ofFeatures
sensitive Services
data for Synthetic
NLP
providers
model Data
can
training
build
Generation
detailed pr
A. General Addenda

A.2. Detailed Category Tables with Aggregation Mapping


A.2.1. Privacy Issue Solutions for NLP as a Privacy Enabler
This list is taken from the ”Privacy Issue Solution” column, contained by the work sheet
located in subsection A.1.1.

102
Generalized Privacy Issue Solution (RQ2) Amount
Automated Access Control 1 Automation 23
Automated Access Control based on notation 1
Automated Compliance Checker 4
Automated Essay Scoring 1
Automated Privacy Assessment of health Related Alexa Application 1
Automated Privacy Policy Classification 1
Automated Privacy Policy Coverage Check with Sequence Classification Models 1
Automated Rewriting 1
Automatic Analysis of Privacy Policies 1
Automatic Anonymization of Text Posts 1
Automatic Anonymization of Textual Documents 1
Automatic Assessment of Privacy Policiy Completness 1
Automatic Conversion of sensitive information in text 1
Automatic modeling of privacy regulations 1
Automatic policy enforcement 2
Automatic policy enforcement on semantic social data 1
Automatic semantic annotation mechanism 1
Automatic Summarization of User Reviews 1
Automation process of SLAs 1
Code-based privacy analysis of Smart phone apps 6 Code-based Analysis 7
Code-based privacy analysis of web applications 1
Collect annotated corpus 17 Collect annotated corpus 17
De-identification 66 De-identification 74
De-identification / Obfuscating Authorship 8
Demonstration of Threats 1 Demonstration of Threats 1
Designing a Big Data Framework for Publishing Social Media Data 1 Designing 27
Designing a decentralized peer-to-peer private document search engine 1
Designing a framework for sharing of sensitive information 1
Designing a gender obfuscation tool 1
Designing a geo-indistinguishability text pertubation algorithm 1
Designing a locally deployed system 2
Designing a methodology for providing extra safety precautions without being intrusive to users 1
Designing a multi-agent architecture coupled with privacy preserving techniques 1
Designing a Privacy Guard for Cloud-Based Home Assistants and IoT Devices 1
Designing a privacy preserved medical text data analysis 1
Designing a privacy-preserving web search engine 3
Designing a scheme for a privacy preserving search log analysis 1
Designing a scheme for a privacy preserving text analysis 1
Designing a search tree without tracking user interaction 1
Designing a tailor-made data annonymity approach 2
Designing a task-independent privacy-respecting data crowdsourcing framework 1
Designing a text perturbation mechanism for a privacy preserving text analysis 2
Designing A User-Centric Privacy-Disclosure Detection Framework 1
Designing a web search engine side to generate privacy-preserving user profiles 1
Designing an Inclusive Financial Privacy Index 1
Designing Secure Views for privacy preserving data analysis 1
Desinging a system that supports privacy preserving data publication of network security data 1
Detecting privacy-sensitive information / activities 27 Detection 38
Detection of Data Privacy Violations 1
Detection of Inconsistencies within Privacy Policies 1
Detection of Locations without user involvement 1
Detection of Personal Health Information in Peer-to-Peer File-Sharing Networks 1
Detection of potential pitfalls in the privacy policies of companies on the We 1
Detection of Privacy Breaches in Physican Reviews 1
Detection of Privacy Sensitive Conditions in C-CDAs 1
Detection of the Correlation between User Reviews and Privacy Issues 2
Detection of vagueness 2
Encryption based on Events 1 Encryption based Solutions 2
Encryption-based search system mitigating file-Injection Attacks 1
Generation of Artefacts based on Privacy Policy Analysis 12 Generation 22
Generation of Synthetic Data 10
Information Extraction from Emails 1 Infromation Extraction 26
Information Extraction from Health Records 1
Information Extraction from large-scale unstructured textual data 1
Information Extraction from nursing documents 1
Information Extraction from privacy policies 20
Information Extraction if information is harboured by a second party 1
Information Extraction out of Privacy Policies with Word Embeddings 1
Mapping of Ambiguity on a score board 1 Mapping 8
Mapping of Description-to-permission Fidelity on a scale 1
Mapping of Privacy Policies to Privacy Issues 1
Mapping Privacy Policies to Contextual Integrity (CI) with Q&A 1
Mapping Privacy Policies to GDPR 2
Mapping Privacy Policy Content to selected Dictionary 2
Ontological semantics perspective for checking Privacy Policies 1 Ontology based Solutions 4
Ontology build for privacy 1
Ontology used for transparency in Privacy Policies 1
Ontology-Enabled Access Control and Privacy Recommendations 1
Overview of algorithms and tools for sharing data in a privacy-preserving manner 1 Overviews 14
Overview of Challenges and applied methods for protection of personal health information (PHI) 3
Overview of challenges in detecting privacy revealing information in unstructured text 1
Overview of Challenges of omission, context and multilingualism 1
Overview of challenges of working with personal and particularly sensitive data in practice 1
Overview of existing dialogue system vulnerabilities in security and privacy. 1
Overview of techniques for privacy preserving data linkage 1
Overview of the current state of the legal regulations and analyse different data protection and privacy-preserving
1 techniques in the context of big data an
Overview over the research field 4
Semantic analysis for privacy preserving peer feedback 1 Semantic Solutions 19
Semantic Analysis of social media forensic analysis for preventive policing of online activities 1
Semantic correlations in document sanitization to Minimizing risk disclosure 1
Semantic features using word embeddings for classification 1
Semantic Framework for the Analysis of Privacy Policies 1
Semantic Incompleteness Detection in Privacy Policy Goals 2
Semantic Inference from Privacy Policies 1
Semantic Microaggregation for Anonymization of query logs to preserve utility 1
Semantic Model Creation based on Privacy Policies 1
Semantic Modeling of language vagueness 1
Semantic open source stack 1
Semantic Similarities Tool Comparison 2
Semantic similarity based on clustering and k-annonymity 2
Semantic-based text pertubation approach 1
Semantic-centered Rules 1
Semantics of Rules Mining 1
Speech De-Identification 15 Speech de-identification 16
Suggestion of security and privacy requirements 3 Suggestions 7
Suggestion of security and privacy requirements by involving users 1
Suggestion of security and privacy requirements for text mining 3
Grand Total 304
A. General Addenda

A.2.2. Privacy Issue Solutions for NLP as a Privacy Threat


This list is taken from the ”Use Cases” column, contained by the work sheet located in
subsection A.1.2.

106
Model Training without Sharing Data COUNTA of Model Training without Sharing Data
Classification on encrypted Data 4 Classification without Data Disclosure 7
Classification without Data Disclosure 3
Investigating the Impact of Collaborative Deep Learning on Privacy 1 Investigating the Impact of NLP related Concept on Privacy 5
Investigating the Impact of NLP on Privacy 1
Investigating the Impact of Pre-trained Word Embeddings on NN Memorization 1
Investigating the Impact of Word Embeddings on Privacy 2
Model Training without Sharing Data 25 Model Training without Sharing Data 26
Privacy-Utility-Trade-off for Neural Text Representations 2 Privacy-Utility-Trade-off for NLP related Concept 20
Privacy-Utility-Trade-off for Training Data 13
Privacy-Utility-Trade-off for Word Embeddings 5
Private Communication 4 Private Communication 4
Similarity detection without Data Exposure 7 Similarity detection without Data Exposure 7
Speech Characteristics Obfuscation in a dynamic manner 1 Speech related Service without complete Voice Feature Disclosure 42
Speech Communication without Disclosure of all Voice Features 3
Speech Emotion Recognition without Disclosure of all Voice Features 2
Speech Processing without Disclosure of all Voice Features 12
Speech Recognition without Disclosure of all Voice Features 9
Speech Transcription without Disclosure of Acoustic Features 1
Speech Verification without Disclosure of all Voice Features 13
Speech Watermarking 1
Storing and Searching Data without Data Exposure 15 Storing and Searching Data without Data Exposure 15
Summarization without Document Disclosure 1 Summarization without Document Disclosure 1
Grand Total 126
A. General Addenda

A.3. Figures for Mapping Results


A.3.1. Use Case Classification for NLP as a Privacy Treat
These values were taken from the work sheet located in subsection A.1.2.

108
List of Figures
1.1. Trends in Data Science and Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Overview of the Systematic Mapping Study (SMS) taken from [11] . . . . . . . 5
1.3. Amount of papers during the different SMS phases . . . . . . . . . . . . . . . . 5
1.4. Keywording abstracts to build classification schemes taken from [11] . . . . . . 6

3.1. Overview of privacy preservation methods in deep learning taken from [44] . 17
3.2. Overview of privacy preservation metrics in deep learning taken from [44] . . 18
3.3. Overview of privacy preservation challenges and weaknesses taken from [45] 18

4.1. Search Term Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23


4.2. Search Query Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3. Filtered Search Query Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4. Proportion of remaining papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1. Publications per year . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


5.2. Publications per electronic data source . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3. Mapping Results for the Domain category in the top category NLP as a Privacy
Enabler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4. Mapping Results for the Use Case Classification category in the top category
NLP as a Privacy Enabler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5. Mapping Results for the Generalized Privacy Issue category in the top category
NLP as a Privacy Enabler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6. Privacy Issues and Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.7. Privacy Issues and Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.8. Development of Privacy Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.9. Mapping Results for the Domain category in the top category NLP as a Privacy
Threat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.10. Mapping Results for the Data Type category in the top category NLP as a
Privacy Threat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.11. Aggregated Mapping Results for the Use Case Classification category in the
top category NLP as a Privacy Threat . . . . . . . . . . . . . . . . . . . . . . . . 70
5.12. Mapping Results for the Generalized Issue/Vulnerability solved for NLP cate-
gory in the top category NLP as a Privacy Threat . . . . . . . . . . . . . . . . . 70
5.13. NLP Privacy Issues and Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.14. NLP Privacy Issues and Data Type . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.15. NLP Privacy Issues and Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

110
List of Figures

5.16. Mapping Results for the Generalized Privacy Issue Solution category in the top
category NLP as a Privacy Enabler . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.17. Mapping Results for the Generalized Applied NLP Concept category in the
top category NLP as a Privacy Enabler . . . . . . . . . . . . . . . . . . . . . . . 73
5.18. Aggregated mapping results for the Generalized Applied NLP Concept cate-
gory in the top category NLP as a Privacy Enabler . . . . . . . . . . . . . . . . . 74
5.19. Mapping Results for the Generalized Applied NLP Concept category in the
top category NLP as a Privacy Enabler . . . . . . . . . . . . . . . . . . . . . . . 75
5.20. Aggregated mapping results for the Generalized NLP Method category in the
top category NLP as a Privacy Enabler . . . . . . . . . . . . . . . . . . . . . . . 75
5.21. Privacy solutions aiming to solve privacy issues . . . . . . . . . . . . . . . . . . 76
5.22. Privacy solutions and their analysis levels . . . . . . . . . . . . . . . . . . . . . . 77
5.23. Privacy solutions and their applied method category . . . . . . . . . . . . . . . 78
5.24. Mapping Results for the Privacy Enhancing Technologies (PETs) category in
the top category NLP as a Privacy Threat . . . . . . . . . . . . . . . . . . . . . . 79
5.25. Aggregated Mapping Results for the Privacy Enhancing Technologies (PETs)
category in the top category NLP as a Privacy Threat . . . . . . . . . . . . . . . 79
5.26. NLP Privacy Threats and their Solutions . . . . . . . . . . . . . . . . . . . . . . 80
5.27. NLP Privacy Threat Solutions and their Use Cases . . . . . . . . . . . . . . . . . 81
5.28. NLP Privacy Threat Solutions and their Development over time . . . . . . . . . 82

111
List of Tables
1.1. Table of Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4.1. Search Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


4.2. Proposed Electronic Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3. Selected Electronic Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4. Amount of Papers after Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5. Selection of Papers for the Validation Process . . . . . . . . . . . . . . . . . . . . 28
4.6. Criteria Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.7. Table with Advisor Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.8. Summary of Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.9. Subcategories within the two top categories . . . . . . . . . . . . . . . . . . . . . 34
4.10. Domains for NLP as a Privacy Enabler . . . . . . . . . . . . . . . . . . . . . . . 35
4.11. Classification of Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.12. Generalized Privacy Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.13. Terms of Generalized Privacy Issue Solution . . . . . . . . . . . . . . . . . . . . 45
4.14. All Terms in Generalized Category of Applied NLP Concept . . . . . . . . . . 46
4.15. NLP Method Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.16. Domains for NLP as a Privacy Threat . . . . . . . . . . . . . . . . . . . . . . . . 46
4.17. Use Cases for NLP as a Privacy Threat . . . . . . . . . . . . . . . . . . . . . . . 47
4.18. Generalized Issue/Vulnerability solved for NLP . . . . . . . . . . . . . . . . . . 47
4.19. Privacy Enhancing Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

112
Acronyms
NLP Natural Language Processing. 2–4, 6, 9–23, 28–30, 32–35, 37–43, 48, 50–52, 55–62, 64,
83–90

PET Privacy Enhancing Technology. 4, 42, 58, 61–63, 86, 87

PP NLP Privacy-Preserving Natural Language Processing. 3–5, 10, 18, 23, 27, 48

SMS Systematic Mapping Study. 4, 5, 22, 57, 83, 84, 110

113
Bibliography
[1] The History of the General Data Protection Regulation. url: https : / / edps . europa .
eu / data - protection / data - protection / legislation / history - general - data -
protection-regulation_en.
[2] L. Jarvis. “The Age of Big Data Analytics: A cross-national comparison of the imple-
mentation of Article 23 of the GDPR in the United Kingdom, France, Germany and
Italy”. In: (2019).
[3] J. Wolff and N. Atallah. “Early GDPR Penalties: Analysis of Implementation and Fines
Through May 2020”. In: Available at SSRN 3748837 (2020).
[4] GDPR Enforcement Tracker. url: https://www.enforcementtracker.com/.
[5] R. Chan. “The Cambridge Analytica whistleblower explains how the firm used Face-
book data to sway elections”. In: Business Insider. Retrieved 12 (2020).
[6] D. R. Raban and A. Gordon. “The evolution of data science and big data research: A
bibliometric analysis”. In: Scientometrics 122.3 (2020), pp. 1563–1581.
[7] W. B. Tesfay, J. Serna, and S. Pape. “Challenges in Detecting Privacy Revealing Infor-
mation in Unstructured Text.” In: PrivOn@ ISWC. 2016.
[8] N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song. “The secret sharer: Evaluating
and testing unintended memorization in neural networks”. In: 28th {USENIX} Security
Symposium ({USENIX} Security 19). 2019, pp. 267–284.
[9] X. Pan, M. Zhang, S. Ji, and M. Yang. “Privacy risks of general-purpose language
models”. In: 2020 IEEE Symposium on Security and Privacy (SP). IEEE. 2020, pp. 1314–
1331.
[10] M. Abdalla, M. Abdalla, G. Hirst, and F. Rudzicz. “Exploring the Privacy-Preserving
Properties of Word Embeddings: Algorithmic Validation Study”. In: Journal of medical
Internet research 22.7 (2020), e18055.
[11] K. Petersen, R. Feldt, S. Mujtaba, and M. Mattsson. “Systematic mapping studies in
software engineering”. In: 12th International Conference on Evaluation and Assessment in
Software Engineering (EASE) 12. 2008, pp. 1–10.
[12] B. A. Kitchenham, D. Budgen, and P. Brereton. Evidence-based software engineering and
systematic reviews. Vol. 4. CRC press, 2015.
[13] K. Petersen, S. Vakkalanka, and L. Kuzniarz. “Guidelines for conducting systematic
mapping studies in software engineering: An update”. In: Information and Software
Technology 64 (2015), pp. 1–18.

114
Bibliography

[14] A. Lukács. “what is privacy? The history and definition of privacy”. In: (2016).
[15] A. F. Westin. “Social and political dimensions of privacy”. In: Journal of social issues
59.2 (2003), pp. 431–453.
[16] Art. 4 GDPR – Definitions. Mar. 2018. url: https://gdpr-info.eu/art-4-gdpr/.
[17] P. M. Schwartz and D. J. Solove. “The PII problem: Privacy and a new concept of
personally identifiable information”. In: NYUL rev. 86 (2011), p. 1814.
[18] Y. Shen and S. Pearson. “Privacy enhancing technologies: A review”. In: Hewlet Packard
Development Company. Disponible en https://bit. ly/3cfpAKz (2011).
[19] S. L. Garfinkel et al. “De-identification of personal information”. In: National institute
of standards and technology (2015).
[20] C. Dwork. “The differential privacy frontier”. In: Theory of Cryptography Conference.
Springer. 2009, pp. 496–502.
[21] H. H. Pang, X. Xiao, and J. Shen. “Obfuscating the topical intention in enterprise text
search”. In: 2012 IEEE 28th International Conference on Data Engineering. IEEE. 2012,
pp. 1168–1179.
[22] K. El Emam. “Seven ways to evaluate the utility of synthetic data”. In: IEEE Security &
Privacy 18.4 (2020), pp. 56–59.
[23] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. “Communication-
efficient learning of deep networks from decentralized data”. In: Artificial Intelligence
and Statistics. PMLR. 2017, pp. 1273–1282.
[24] R. Cramer, I. B. Damgård, et al. Secure multiparty computation. Cambridge University
Press, 2015.
[25] C. Fontaine and F. Galand. “A survey of homomorphic encryption for nonspecialists”.
In: EURASIP Journal on Information Security 2007 (2007), pp. 1–10.
[26] E. D. Liddy. “Natural language processing”. In: (2001).
[27] S. Wu, E. M. Ponti, and R. Cotterell. “Differentiable Generative Phonology”. In: arXiv
preprint arXiv:2102.05717 (2021).
[28] A. Voutilainen. “Part-of-speech tagging”. In: The Oxford handbook of computational
linguistics (2003), pp. 219–232.
[29] A. R. Hippisley. “Lexical analysis”. In: (2010).
[30] M. Marrero, J. Urbano, S. Sánchez-Cuadrado, J. Morato, and J. M. Gómez-Berbıs.
“Named entity recognition: fallacies, challenges and opportunities”. In: Computer
Standards & Interfaces 35.5 (2013), pp. 482–489.
[31] url: https://sites.google.com/view/privatenlp/home/emnlp-2020?authuser=0.
[32] url: https://sites.google.com/view/privatenlp/home?authuser=0.
[33] PrivateNLP 2020. url: https://sites.google.com/view/wsdm-privatenlp-2020.

115
Bibliography

[34] Q. Feng, D. He, Z. Liu, H. Wang, and K.-K. R. Choo. “SecureNLP: A system for
multi-party privacy-preserving natural language processing”. In: IEEE Transactions on
Information Forensics and Security 15 (2020), pp. 3709–3721.
[35] P. Ram, J. Markkula, V. Friman, and A. Raz. “Security and privacy concerns in
connected cars: a systematic mapping study”. In: 2018 44th Euromicro Conference on
Software Engineering and Advanced Applications (SEAA). IEEE. 2018, pp. 124–131.
[36] L. H. Iwaya, A. Ahmad, and M. A. Babar. “Security and Privacy for mHealth and
uHealth Systems: a Systematic Mapping Study”. In: IEEE Access 8 (2020), pp. 150081–
150112.
[37] N. Papanikolaou, S. Pearson, and M. C. Mont. “Towards natural-language understand-
ing and automated enforcement of privacy rules and regulations in the cloud: survey
and bibliography”. In: FTRA International Conference on Secure and Trust Computing,
Data Management, and Application. Springer. 2011, pp. 166–173.
[38] P. Silva, C. Gonçalves, C. Godinho, N. Antunes, and M. Curado. “Using natural
language processing to detect privacy violations in online contracts”. In: Proceedings of
the 35th Annual ACM Symposium on Applied Computing. 2020, pp. 1305–1307.
[39] L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.-V. Chioasca, and R. T.
Batista-Navarro. “Natural Language Processing (NLP) for Requirements Engineering:
A Systematic Mapping Study”. In: arXiv preprint arXiv:2004.01099 (2020).
[40] W. Khan, A. Daud, J. A. Nasir, and T. Amjad. “A survey on the state-of-the-art machine
learning models in the context of NLP”. In: Kuwait journal of Science 43.4 (2016).
[41] Z. Ji, Z. C. Lipton, and C. Elkan. “Differential privacy and machine learning: a survey
and review”. In: arXiv preprint arXiv:1412.7584 (2014).
[42] M. Gong, Y. Xie, K. Pan, K. Feng, and A. K. Qin. “A survey on differentially private
machine learning”. In: IEEE Computational Intelligence Magazine 15.2 (2020), pp. 49–64.
[43] D. Zhang, X. Chen, D. Wang, and J. Shi. “A survey on collaborative deep learning
and privacy-preserving”. In: 2018 IEEE Third International Conference on Data Science in
Cyberspace (DSC). IEEE. 2018, pp. 652–658.
[44] H. C. Tanuwidjaja, R. Choi, and K. Kim. “A survey on deep learning techniques for
privacy-preserving”. In: International Conference on Machine Learning for Cyber Security.
Springer. 2019, pp. 29–46.
[45] H. C. Tanuwidjaja, R. Choi, S. Baek, and K. Kim. “Privacy-Preserving Deep Learning
on Machine Learning as a Service—a Comprehensive Survey”. In: IEEE Access 8 (2020),
pp. 167425–167447.
[46] T. Bolton, T. Dargahi, S. Belguith, M. S. Al-Rakhami, and A. H. Sodhro. “On the
security and privacy challenges of virtual assistants”. In: Sensors 21.7 (2021), p. 2312.
[47] L. M. H. E. V. Polychronopoulos, G. Lopez, Y. T. O. Z. K. Ye, and Y. W. C. Quirk.
“Privacy-Aware Personalized Entity Representations for Improved User Understand-
ing”. In: (2020).

116
Bibliography

[48] A. N. Mehdy and H. Mehrpouyan. “A User-Centric and Sentiment Aware Privacy-


Disclosure Detection Framework based on Multi-input Neural Network.” In: Pri-
vateNLP@ WSDM. 2020, pp. 21–26.
[49] R. Podschwadt and D. Takabi. “Classification of Encrypted Word Embeddings using
Recurrent Neural Networks.” In: PrivateNLP@ WSDM. 2020, pp. 27–31.
[50] V. Jain and S. Ghanavati. “Is It Possible to Preserve Privacy in the Age of AI?” In:
PrivateNLP@ WSDM. 2020, pp. 32–36.
[51] Y. Huang, Z. Song, D. Chen, K. Li, and S. Arora. “Texthide: Tackling data privacy in
language understanding tasks”. In: arXiv preprint arXiv:2010.06053 (2020).
[52] M. B. Hosseini, K. Pragyan, I. Reyes, and S. Egelman. “Identifying and Classifying
Third-party Entities in Natural Language Privacy Policies”. In: Proceedings of the Second
Workshop on Privacy in NLP. 2020, pp. 18–27.
[53] C. Akiti, A. Squicciarini, and S. Rajtmajer. “A Semantics-based Approach to Disclosure
Classification in User-Generated Online Content”. In: Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing: Findings. 2020, pp. 3490–3499.
[54] R. Khandelwal, A. Nayak, Y. Yao, and K. Fawaz. “Surfacing Privacy Settings Using
Semantic Matching”. In: Proceedings of the Second Workshop on Privacy in NLP. 2020,
pp. 28–38.
[55] G. Kerrigan, D. Slack, and J. Tuyls. “Differentially Private Language Models Benefit
from Public Pre-training”. In: arXiv preprint arXiv:2009.05886 (2020).
[56] Z. Xu, A. Aggarwal, O. Feyisetan, and N. Teissier. “A Differentially Private Text
Perturbation Method Using a Regularized Mahalanobis Metric”. In: arXiv preprint
arXiv:2010.11947 (2020).
[57] M. Kuhrmann, D. M. Fernández, and M. Daneva. On the Pragmatic Design of Literature
Studies in Software Engineering: An Experience-based Guideline. 2016. arXiv: 1612.03583
[cs.SE].
[58] url: https://dl.acm.org/search/advanced.
[59] Operators - Data Studio Help. url: https://support.google.com/datastudio/answer/
10468382?hl=en.
[60] url: https://ieeexplore.ieee.org/Xplorehelp/searching-ieee-xplore/search-
examples.
[61] Elsevier. How do I use the advanced search? url: https://service.elsevier.com/app/
answers/detail/a_id/25974/supporthub/sciencedirect/.
[62] Elsevier. How can I best use the Advanced search? url: https://service.elsevier.com/
app/answers/detail/a_id/11365/c/10545/supporthub/scopus/.
[63] Search Tips. url: https://link.springer.com/searchhelp.
[64] index. url: https : / / images . webofknowledge . com / WOKRS535R111 / help / WOS / hp _
search.html.

117
Bibliography

[65] Search Tips (Video). Nov. 1970. url: https://www.wiley.com/customer- success/


wiley-digital-archives-training-hub/search-tips-video.
[66] K. Pandey. “Natural Language Processing: An Overview”. In: Advances in Science &
Technology (), p. 24.
[67] L. Chen, M. A. Babar, and H. Zhang. “Towards an evidence-based understanding of
electronic data sources”. In: 14th International Conference on Evaluation and Assessment
in Software Engineering (EASE). 2010, pp. 1–4.
[68] Elsevier. Ei Compendex: Most complete Engineering Database. url: https://www.elsevier.
com/solutions/engineering-village/content/compendex.
[69] url: https://www.google.com/sheets/about/.
[70] Free Reference Manager - Stay on top of your Literature. url: https://www.jabref.org/.
[71] COUNTUNIQUE - Docs Editors Help. url: https : / / support . google . com / docs /
answer/3093405?hl=en.
[72] QUERY function - Docs Editors Help. url: https://support.google.com/docs/answer/
3093343?hl=en.
[73] SORT function - Docs Editors Help. url: https://support.google.com/docs/answer/
3093150?hl=en.
[74] Prashanth, By, and Prashanth. SORTN Tie Modes in Google Sheets - The Four Tiebreakers.
Oct. 2018. url: https://infoinspired.com/google-docs/spreadsheet/sortn-tie-
modes-in-google-sheets/.
[75] T. Angelanda and R. Colomo-Palaciosa. “Deep Learning for Single Image Super-
Resolution: A Systematic Mapping Study”. In: ().
[76] J. Zhang, L. Yang, S. Yu, and J. Ma. “A DNS tunneling detection method based on deep
learning models to prevent data exfiltration”. In: International Conference on Network
and System Security. Springer. 2019, pp. 520–535.
[77] W. Song, Y. Cui, and Z. Peng. “A full-text retrieval algorithm for encrypted data in
cloud storage applications”. In: Natural Language Processing and Chinese Computing.
Springer, 2015, pp. 229–241.
[78] L. Li, Y. Fan, M. Tse, and K.-Y. Lin. “A review of applications in federated learning”.
In: Computers & Industrial Engineering (2020), p. 106854.
[79] K. Meenakshi and G. Maragatham. “A review on security attacks and protective
strategies of machine learning”. In: International Conference on Emerging Current Trends
in Computing and Expert Technology. Springer. 2019, pp. 1076–1087.
[80] Q. Liu, P. Li, W. Zhao, W. Cai, S. Yu, and V. C. Leung. “A survey on security threats
and defensive techniques of machine learning: A data driven view”. In: IEEE access 6
(2018), pp. 12103–12117.

118
Bibliography

[81] A. Lutskiv and N. Popovych. “Adaptable Text Corpus Development for Specific
Linguistic Research”. In: 2019 IEEE International Scientific-Practical Conference Problems
of Infocommunications, Science and Technology (PIC S&T). IEEE. 2019, pp. 217–223.
[82] F. Rashid and A. Miri. “An Emerging Strategy for Privacy Preserving Databases: Dif-
ferential Privacy”. In: International Conference on Human-Computer Interaction. Springer.
2020, pp. 487–498.
[83] Y.-C. Tsai, S.-L. Wang, I.-H. Ting, and T.-P. Hong. “Flexible Anonymization of Transac-
tions with Sensitive Items”. In: 2018 5th International Conference on Behavioral, Economic,
and Socio-Cultural Computing (BESC). IEEE. 2018, pp. 201–206.
[84] J. Ji, B. Chen, and H. Jiang. “Fully-connected LSTM–CRF on medical concept extrac-
tion”. In: International Journal of Machine Learning and Cybernetics 11.9 (2020), pp. 1971–
1979.
[85] R. Delanaux, A. Bonifati, M.-C. Rousset, and R. Thion. “Query-based linked data
anonymization”. In: International Semantic Web Conference. Springer. 2018, pp. 530–546.
[86] Q. Li, Z. Yang, L. Luo, L. Wang, Y. Zhang, H. Lin, J. Wang, L. Yang, K. Xu, and Y. Zhang.
“A multi-task learning based approach to biomedical entity relation extraction”. In:
2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE. 2018,
pp. 680–682.
[87] M. Pikuliak, M. Simko, and M. Bielikova. “Towards combining multitask and multilin-
gual learning”. In: International Conference on Current Trends in Theory and Practice of
Informatics. Springer. 2019, pp. 435–446.
[88] H. Zhou, X. Li, W. Yao, C. Lang, and S. Ning. “Dut-nlp at mediqa 2019: an adversarial
multi-task network to jointly model recognizing question entailment and question
answering”. In: Proceedings of the 18th BioNLP Workshop and Shared Task. 2019, pp. 437–
445.
[89] K. Roberts, Y. Si, A. Gandhi, and E. Bernstam. “A framenet for cancer information in
clinical narratives: schema and annotation”. In: Proceedings of the Eleventh International
Conference on Language Resources and Evaluation (LREC 2018). 2018.
[90] N. Niknami, M. Abadi, and F. Deldar. “SpatialPDP: A personalized differentially
private mechanism for range counting queries over spatial databases”. In: 2014 4th
International Conference on Computer and Knowledge Engineering (ICCKE). IEEE. 2014,
pp. 709–715.
[91] J. Friginal, S. Gambs, J. Guiochet, and M.-O. Killijian. “Towards privacy-driven design
of a dynamic carpooling system”. In: Pervasive and mobile computing 14 (2014), pp. 71–
82.
[92] R. Shokri, G. Theodorakopoulos, J.-Y. Le Boudec, and J.-P. Hubaux. “Quantifying
location privacy”. In: 2011 IEEE symposium on security and privacy. IEEE. 2011, pp. 247–
262.

119
Bibliography

[93] M. Gong, K. Pan, Y. Xie, A. K. Qin, and Z. Tang. “Preserving differential privacy in
deep neural networks with relevance-based adaptive noise imposition”. In: Neural
Networks 125 (2020), pp. 131–141.
[94] S. Al-Fedaghi. “Perceived privacy”. In: 2012 Ninth International Conference on Information
Technology-New Generations. IEEE. 2012, pp. 355–360.
[95] H. Wang, Z. Li, and Y. Cheng. “Distributed Latent Dirichlet allocation for objects-
distributed cluster ensemble”. In: 2008 International Conference on Natural Language
Processing and Knowledge Engineering. IEEE. 2008, pp. 1–7.
[96] P. Sun, Z. Wang, Y. Feng, L. Wu, Y. Li, H. Qi, and Z. Wang. “Towards personalized
privacy-preserving incentive for truth discovery in crowdsourced binary-choice ques-
tion answering”. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications.
IEEE. 2020, pp. 1133–1142.
[97] D. Z. Hakkani-Tur, Y. Saygin, M. Tang, and G. Tur. Preserving privacy in natural language
databases. US Patent 8,473,451. June 2013.
[98] M. Ç. Demir and Ş. Ertekin. “Identifying textual personal information using bidirec-
tional LSTM networks”. In: 2018 26th Signal Processing and Communications Applications
Conference (SIU). IEEE. 2018, pp. 1–4.
[99] P. Thaine and G. Penn. “Reasoning about unstructured data de-identification”. In:
Journal of Data Protection & Privacy 3.3 (2020), pp. 299–309.
[100] H. Wang, T. Feng, Z. Ren, L. Gao, and J. Zheng. “Towards Efficient Privacy-Preserving
Personal Information in User Daily Life”. In: International Conference on Internet of
Things as a Service. Springer. 2019, pp. 503–513.
[101] C. Dilmegani. Top 10 Privacy Enhancing Technologies (PETs) in 2021. July 2021. url:
https://research.aimultiple.com/privacy-enhancing-technologies/.
[102] Domain. url: https://www.merriam-webster.com/dictionary/domain.
[103] N. M. Nejad, P. Jabat, R. Nedelchev, S. Scerri, and D. Graux. “Establishing a strong
baseline for privacy policy classification”. In: IFIP International Conference on ICT
Systems Security and Privacy Protection. Springer. 2020, pp. 370–383.
[104] E. Poplavska, T. B. Norton, S. Wilson, and N. Sadeh. “From Prescription to Descrip-
tion: Mapping the GDPR to a Privacy Policy Corpus Annotation Scheme”. In: Legal
Knowledge and Information Systems. IOS Press, 2020, pp. 243–246.
[105] M. Bahrami, M. Singhal, and W.-P. Chen. “Learning Data Privacy and Terms of Service
from Different Cloud Service Providers”. In: 2017 IEEE International Conference on Smart
Cloud (SmartCloud). IEEE. 2017, pp. 250–257.
[106] Z. Qu, V. Rastogi, X. Zhang, Y. Chen, T. Zhu, and Z. Chen. “Autocog: Measuring the
description-to-permission fidelity in android applications”. In: Proceedings of the 2014
ACM SIGSAC Conference on Computer and Communications Security. 2014, pp. 1354–1365.

120
Bibliography

[107] R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita, and M. Esposito. “Crosslingual


named entity recognition for clinical de-identification applied to a COVID-19 Italian
data set”. In: Applied Soft Computing 97 (2020), p. 106779.
[108] A. E. Johnson, L. Bulgarelli, and T. J. Pollard. “Deidentification of free-text medical
records using pre-trained bidirectional transformers”. In: Proceedings of the ACM
Conference on Health, Inference, and Learning. 2020, pp. 214–221.
[109] B. Singh, Q. Sun, Y. S. Koh, J. Lee, and E. Zhang. “Detecting Protected Health Informa-
tion with an Incremental Learning Ensemble: A Case Study on New Zealand Clinical
Text”. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics
(DSAA). IEEE. 2020, pp. 719–728.
[110] L. Lange, H. Adel, and J. Strötgen. “NLNDE: The Neither-Language-Nor-Domain-
Experts’ Way of Spanish Medical Document De-Identification”. In: arXiv preprint
arXiv:2007.01030 (2020).
[111] D. C. Nguyen, E. Derr, M. Backes, and S. Bugiel. “Short text, large effect: Measuring the
impact of user reviews on android app security & privacy”. In: 2019 IEEE Symposium
on Security and Privacy (SP). IEEE. 2019, pp. 555–569.
[112] G. L. Scoccia, S. Ruberto, I. Malavolta, M. Autili, and P. Inverardi. “An investigation
into Android run-time permissions from the end users’ perspective”. In: Proceedings of
the 5th international conference on mobile software engineering and systems. 2018, pp. 45–55.
[113] N. Victor and D. Lopez. “A Conceptual Framework for Sensitive Big Data Publish-
ing”. In: Proceedings of International Conference on Communication and Computational
Technologies. Springer. 2021, pp. 523–533.
[114] G. Canfora, A. Di Sorbo, E. Emanuele, S. Forootani, and C. A. Visaggio. “A NLP-based
solution to prevent from privacy leaks in social network posts”. In: Proceedings of the
13th International Conference on Availability, Reliability and Security. 2018, pp. 1–6.
[115] A. F. B. Neto, P. Desaulniers, P. A. Duboue, and A. Smirnov. “Filtering Personal Queries
from Mixed-Use Query Logs”. In: Canadian Conference on Artificial Intelligence. Springer.
2014, pp. 47–58.
[116] M. Sokolova and S. Matwin. “Personal privacy protection in time of big data”. In:
Challenges in computational statistics and data mining. Springer, 2016, pp. 365–380.
[117] O. Akanfe, R. Valecha, and H. R. Rao. “Design of an inclusive financial privacy index
(INF-PIE): a financial privacy and digital financial inclusion perspective”. In: ACM
Transactions on Management Information Systems (TMIS) 12.1 (2020), pp. 1–21.
[118] F. Olsson. “On privacy preservation in text and document-based active learning for
named entity recognition”. In: Proceedings of the ACM first international workshop on
Privacy and anonymity for very large databases. 2009, pp. 53–60.
[119] M. Blohm, C. Dukino, M. Kintz, M. Kochanowski, F. Koetter, and T. Renner. “Towards
a Privacy Compliant Cloud Architecture for Natural Language Processing Platforms.”
In: ICEIS (1). 2019, pp. 454–461.

121
Bibliography

[120] M. Gallé, A. Christofi, and H. Elsahar. “The case for a GDPR-specific annotated dataset
of privacy policies”. In: AAAI Symposium on Privacy-Enhancing AI and HLT Technologies.
2019.
[121] R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita, and M. Esposito. “A novel
covid-19 data set and an effective deep learning approach for the de-identification of
italian medical records”. In: IEEE Access 9 (2021), pp. 19097–19110.
[122] D. S. Carrell, D. J. Cronkite, B. A. Malin, J. S. Aberdeen, and L. Hirschman. “Is the
juice worth the squeeze? Costs and benefits of multiple human annotators for clinical
text de-identification”. In: Methods of information in medicine 55.04 (2016), pp. 356–364.
[123] T.-V. T. Nguyen, N. Fornara, and F. Marfia. “Automatic policy enforcement on semantic
social data”. In: Multiagent and Grid Systems 11.3 (2015), pp. 121–146.
[124] N. Papanikolaou, S. Pearson, and M. C. Mont. “Automated Understanding Of Cloud
Terms Of Service And SLAs”. In: ().
[125] V. Keselj and D. Jutla. “QTIP: multi-agent NLP and privacy architecture for information
retrieval in usable Web privacy software”. In: The 2005 IEEE/WIC/ACM International
Conference on Web Intelligence (WI’05). IEEE. 2005, pp. 718–724.
[126] J. Parra-Arnau, J. P. Achara, and C. Castelluccia. “Myadchoices: Bringing transparency
and control to online advertising”. In: ACM Transactions on the Web (TWEB) 11.1 (2017),
pp. 1–47.
[127] P. X. Mai, A. Goknil, L. K. Shar, F. Pastore, L. C. Briand, and S. Shaame. “Modeling
security and privacy requirements: a use case-driven approach”. In: Information and
Software Technology 100 (2018), pp. 165–182.
[128] J. H. Weber-Jahnke and A. Onabajo. “Finding defects in natural language confidential-
ity requirements”. In: 2009 17th IEEE International Requirements Engineering Conference.
IEEE. 2009, pp. 213–222.
[129] Z. Lindner. “A Comparative Study of Sequence Classification Models for Privacy
Policy Coverage Analysis”. In: arXiv preprint arXiv:2003.04972 (2020).
[130] R. N. Zaeem and K. S. Barber. “A Large Publicly Available Corpus of Website Privacy
Policies Based on DMOZ.” In: CODASPY. 2021, pp. 143–148.
[131] T. D. Breaux and A. I. Antón. “Deriving semantic models from privacy policies”.
In: Sixth IEEE International Workshop on Policies for Distributed Systems and Networks
(POLICY’05). IEEE. 2005, pp. 67–76.
[132] D. Torre, S. Abualhaija, M. Sabetzadeh, L. Briand, K. Baetens, P. Goes, and S. Forastier.
“An ai-assisted approach for checking the completeness of privacy policies against
gdpr”. In: 2020 IEEE 28th International Requirements Engineering Conference (RE). IEEE.
2020, pp. 136–146.
[133] L. Yu, T. Zhang, X. Luo, and L. Xue. “Autoppg: Towards automatic generation of
privacy policy for android applications”. In: Proceedings of the 5th Annual ACM CCS
Workshop on Security and Privacy in Smartphones and Mobile Devices. 2015, pp. 39–50.

122
Bibliography

[134] C. A. Brodie, C.-M. Karat, and J. Karat. “An empirical study of natural language
parsing of privacy policy rules using the SPARCLE policy workbench”. In: Proceedings
of the second symposium on Usable privacy and security. 2006, pp. 8–19.
[135] Y. Luo, L. Cheng, H. Hu, G. Peng, and D. Yao. “Context-Rich Privacy Leakage Analysis
Through Inferring Apps in Smart Home IoT”. In: IEEE Internet of Things Journal 8.4
(2020), pp. 2736–2750.
[136] J. Liu and K. Wang. “Enforcing vocabulary k-anonymity by semantic similarity based
clustering”. In: 2010 IEEE International Conference on Data Mining. IEEE. 2010, pp. 899–
904.
[137] T. Diethe and O. Feyisetan. “Preserving privacy in analyses of textual data”. In:
PrivateNLP@ WSDM. 2020.
[138] Y. Shvartzshanider, A. Balashankar, T. Wies, and L. Subramanian. “RECIPE: Applying
open domain question answering to privacy policies”. In: Proceedings of the Workshop
on Machine Reading for Question Answering. 2018, pp. 71–77.
[139] R. Doku, D. B. Rawat, and C. Liu. “On the Blockchain-Based Decentralized Data Shar-
ing for Event Based Encryption to Combat Adversarial Attacks”. In: IEEE Transactions
on Network Science and Engineering (2020).
[140] J. Qian, F. Han, J. Hou, C. Zhang, Y. Wang, and X.-Y. Li. “Towards privacy-preserving
speech data publishing”. In: IEEE INFOCOM 2018-IEEE Conference on Computer Com-
munications. IEEE. 2018, pp. 1079–1087.
[141] Y. Tian. “Privacy Preserving Information Sharing in Modern and Emerging Platforms”.
PhD thesis. Carnegie Mellon University, 2018.
[142] M. D. Shermis, S. Lottridge, and E. Mayfield. “The impact of anonymization for
automated essay scoring”. In: Journal of Educational Measurement 52.4 (2015), pp. 419–
436.
[143] S. Ucan and H. Gu. “A platform for developing privacy preserving diagnosis mo-
bile applications”. In: IEEE-EMBS International Conference on Biomedical and Health
Informatics (BHI). IEEE. 2014, pp. 509–512.
[144] S. Wang, R. Sinnott, and S. Nepal. “P-GENT: Privacy-preserving geocoding of non-
geotagged tweets”. In: 2018 17th IEEE International Conference On Trust, Security And
Privacy In Computing And Communications/12th IEEE International Conference On Big
Data Science And Engineering (TrustCom/BigDataSE). IEEE. 2018, pp. 972–983.
[145] K. Hashimoto, J. Yamagishi, and I. Echizen. “Privacy-preserving sound to degrade
automatic speaker verification performance”. In: 2016 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2016, pp. 5500–5504.
[146] R. Boukharrou, A.-C. Chaouche, and K. Mahdjar. “Toward a Privacy Guard for Cloud-
Based Home Assistants and IoT Devices”. In: International Conference on Mobile, Secure,
and Programmable Networking. Springer. 2020, pp. 177–194.

123
Bibliography

[147] F. H. Shezan, H. Hu, G. Wang, and Y. Tian. “VerHealth: Vetting Medical Voice Applica-
tions through Policy Enforcement”. In: Proceedings of the ACM on Interactive, Mobile,
Wearable and Ubiquitous Technologies 4.4 (2020), pp. 1–21.
[148] P. Lopez-Otero and L. Docio-Fernandez. “Analysis of gender and identity issues in
depression detection on de-identified speech”. In: Computer Speech & Language 65
(2021), p. 101118.
[149] J. Tang, T. Zhu, P. Xiong, Y. Wang, and W. Ren. “Privacy and Utility Trade-Off for
Textual Analysis via Calibrated Multivariate Perturbations”. In: International Conference
on Network and System Security. Springer. 2020, pp. 342–353.
[150] Ö. Uzuner, Y. Luo, and P. Szolovits. “Evaluating the state-of-the-art in automatic
de-identification”. In: Journal of the American Medical Informatics Association 14.5 (2007),
pp. 550–563.
[151] X. Ye and F. Wang. “A Privacy Guard Service”. In: 2017 IEEE International Conference
on Cognitive Computing (ICCC). IEEE. 2017, pp. 48–55.
[152] E. Eder, U. Krieg-Holz, and U. Hahn. “CODE ALLTAG 2.0—A Pseudonymized
German-Language Email Corpus”. In: Proceedings of the 12th Language Resources and
Evaluation Conference. 2020, pp. 4466–4477.
[153] K. Fujita and Y. Tsukada. “A notation for policies using feature structures”. In: Data
Privacy Management and Autonomous Spontaneous Security. Springer, 2010, pp. 140–154.
[154] P. K. Das, A. L. Kashyap, G. Singh, C. Matuszek, T. Finin, A. Joshi, et al. “Semantic
knowledge and privacy in the physical web”. In: Proceedings of the 4th Workshop on
Society, Privacy and the Semantic Web-Policy and Technology (PrivOn2016) co-located with
15th International Semantic Web Conference (ISWC 2016). 2016.
[155] J. Heaps. “Code Element Vector Representations Through the Application of Natural
Language Processing Techniques for Automation of Software Privacy Analysis”. PhD
thesis. The University of Texas at San Antonio, 2020.
[156] N. Papanikolaou. “Natural language processing of rules and regulations for com-
pliance in the cloud”. In: OTM Confederated International Conferences" On the Move to
Meaningful Internet Systems". Springer. 2012, pp. 620–627.
[157] C. Iwendi, S. A. Moqurrab, A. Anjum, S. Khan, S. Mohan, and G. Srivastava. “N-
sanitization: A semantic privacy-preserving framework for unstructured medical
datasets”. In: Computer Communications 161 (2020), pp. 160–171.
[158] Z. Jian, X. Guo, S. Liu, H. Ma, S. Zhang, R. Zhang, and J. Lei. “A cascaded approach
for Chinese clinical text de-identification with less annotation effort”. In: Journal of
biomedical informatics 73 (2017), pp. 76–83.
[159] Q. Xu, L. Qu, C. Xu, and R. Cui. “Privacy-aware text rewriting”. In: Proceedings of the
12th International Conference on Natural Language Generation. 2019, pp. 247–257.
[160] K. Hammoud, S. Benbernou, M. Ouziri, Y. Saygın, R. Haque, and Y. Taher. “Personal
information privacy: what’s next?” In: CEUR Workshop Proceedings. 2019.

124
Bibliography

[161] D. I. Adelani, A. Davody, T. Kleinbauer, and D. Klakow. “Privacy guarantees for


de-identifying text transformations”. In: arXiv preprint arXiv:2008.03101 (2020).
[162] D. Sánchez, M. Batet, and A. Viejo. “Minimizing the disclosure risk of semantic
correlations in document sanitization”. In: Information Sciences 249 (2013), pp. 110–123.
[163] L. Deleger, T. Lingren, Y. Ni, M. Kaiser, L. Stoutenborough, K. Marsolo, M. Kouril,
K. Molnar, and I. Solti. “Preparing an annotated gold standard corpus to share
with extramural investigators for de-identification research”. In: Journal of biomedical
informatics 50 (2014), pp. 173–183.
[164] F. Martinelli, F. Marulli, F. Mercaldo, S. Marrone, and A. Santone. “Enhanced Privacy
and Data Protection using Natural Language Processing and Artificial Intelligence”.
In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE. 2020, pp. 1–8.
[165] D. Abril, G. Navarro-Arribas, and V. Torra. “On the declassification of confidential
documents”. In: International Conference on Modeling Decisions for Artificial Intelligence.
Springer. 2011, pp. 235–246.
[166] J. Neerbek, I. Assent, and P. Dolog. “Detecting complex sensitive information via
phrase structure in recursive neural networks”. In: Pacific-Asia Conference on Knowledge
Discovery and Data Mining. Springer. 2018, pp. 373–385.
[167] C.-M. Karat, C. Brodie, and J. Karat. “Usable privacy and security for personal infor-
mation management”. In: Communications of the ACM 49.1 (2006), pp. 56–57.
[168] S. Zimmeck and S. M. Bellovin. “Privee: An architecture for automatically analyzing
web privacy policies”. In: 23rd {USENIX} Security Symposium ({USENIX} Security 14).
2014, pp. 1–16.
[169] Y. Shvartzshnaider, Z. Pavlinovic, A. Balashankar, T. Wies, L. Subramanian, H. Nis-
senbaum, and P. Mittal. “Vaccine: Using contextual integrity for data leakage detection”.
In: The World Wide Web Conference. 2019, pp. 1702–1712.
[170] A. Stubbs and Ö. Uzuner. “Annotating longitudinal clinical narratives for de-identification:
The 2014 i2b2/UTHealth corpus”. In: Journal of biomedical informatics 58 (2015), S20–S29.
[171] J. Liu, D. He, D. Wu, and J. Xue. “Correlating UI Contexts with Sensitive API Calls:
Dynamic Semantic Extraction and Analysis”. In: 2020 IEEE 31st International Symposium
on Software Reliability Engineering (ISSRE). IEEE. 2020, pp. 241–252.
[172] M. Hatamian, N. Momen, L. Fritsch, and K. Rannenberg. “A multilateral privacy
impact analysis method for android apps”. In: Annual Privacy Forum. Springer. 2019,
pp. 87–106.
[173] P. M. Aonghusa and D. J. Leith. “Don’t let Google know I’m lonely”. In: ACM Transac-
tions on Privacy and Security (TOPS) 19.1 (2016), pp. 1–25.
[174] D. P. Lopresti and A. L. Spitz. “Information leakage through document redaction:
attacks and countermeasures”. In: Document recognition and retrieval XII. Vol. 5676.
International Society for Optics and Photonics. 2005, pp. 183–190.

125
Bibliography

[175] Ü. Meteriz, N. F. Yιldιran, J. Kim, and D. Mohaisen. “Understanding the Potential


Risks of Sharing Elevation Information on Fitness Applications”. In: 2020 IEEE 40th
International Conference on Distributed Computing Systems (ICDCS). IEEE. 2020, pp. 464–
473.
[176] D. Pàmies-Estrems, J. Castellà-Roca, and A. Viejo. “Working at the web search engine
side to generate privacy-preserving user profiles”. In: Expert Systems with Applications
64 (2016), pp. 523–535.
[177] K. Mivule and K. Hopkinson. “Distortion Search–A Web Search Privacy Heuristic”. In:
().
[178] A. Li, Y. Duan, H. Yang, Y. Chen, and J. Yang. “TIPRDC: task-independent privacy-
respecting data crowdsourcing framework for deep learning with anonymized in-
termediate representations”. In: Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. 2020, pp. 824–832.
[179] S. Reddy and K. Knight. “Obfuscating gender in social media writing”. In: Proceedings
of the First Workshop on NLP and Computational Social Science. 2016, pp. 17–26.
[180] P. Silva, C. Gonçalves, C. Godinho, N. Antunes, and M. Curado. “Using NLP and
machine learning to detect data privacy violations”. In: IEEE INFOCOM 2020-IEEE
Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE. 2020,
pp. 972–977.
[181] C. Santos, A. Gangemi, and M. Alam. “Detecting and Editing Privacy Policy Pitfalls
on the Web.” In: TERECOM@ JURIX. 2017, pp. 87–99.
[182] P. Jindal, C. A. Gunter, and D. Roth. “Detecting privacy-sensitive events in medical
text”. In: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology,
and Health Informatics. 2014, pp. 617–620.
[183] C. Liang, D. Abbott, Y. A. Hong, M. Madadi, and A. White. “Clustering Help-Seeking
Behaviors in LGBT Online Communities: A Prospective Trial”. In: International Confer-
ence on Human-Computer Interaction. Springer. 2019, pp. 345–355.
[184] H. Liu and B. Wang. “Mitigating File-Injection Attacks with Natural Language Process-
ing”. In: Proceedings of the Sixth International Workshop on Security and Privacy Analytics.
2020, pp. 3–13.
[185] D. R. G. de Pontes and S. D. Zorzo. “PPMark: An Architecture to Generate Privacy
Labels Using TF-IDF Techniques and the Rabin Karp Algorithm”. In: Information
Technology: New Generations. Springer, 2016, pp. 1029–1040.
[186] M. Mustapha, K. Krasnashchok, A. Al Bassit, and S. Skhiri. “Privacy Policy Classifi-
cation with XLNet (Short Paper)”. In: Data Privacy Management, Cryptocurrencies and
Blockchain Technology. Springer, 2020, pp. 250–257.
[187] E. Begoli, K. Brown, S. Srinivas, and S. Tamang. “SynthNotes: A Generator Frame-
work for High-volume, High-fidelity Synthetic Mental Health Notes”. In: 2018 IEEE
International Conference on Big Data (Big Data). IEEE. 2018, pp. 951–958.

126
Bibliography

[188] J. Guan, R. Li, S. Yu, and X. Zhang. “A method for generating synthetic electronic med-
ical record text”. In: IEEE/ACM transactions on computational biology and bioinformatics
(2019).
[189] K. M. Sathyendra, S. Wilson, F. Schaub, S. Zimmeck, and N. Sadeh. “Identifying the
provision of choices in privacy policy text”. In: Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing. 2017, pp. 2774–2779.
[190] R. Baalous and R. Poet. “How Dangerous Permissions are Described in Android
Apps’ Privacy Policies?” In: Proceedings of the 11th International Conference on Security of
Information and Networks. 2018, pp. 1–2.
[191] J. R. Reidenberg, J. Bhatia, T. D. Breaux, and T. B. Norton. “Ambiguity in privacy
policies and the impact of regulation”. In: The Journal of Legal Studies 45.S2 (2016),
S163–S190.
[192] N. M. Nejad, S. Scerri, and J. Lehmann. “Knight: Mapping privacy policies to GDPR”.
In: European Knowledge Acquisition Workshop. Springer. 2018, pp. 258–272.
[193] R. N. Zaeem and K. S. Barber. “The effect of the GDPR on privacy policies: Recent
progress and future promise”. In: ACM Transactions on Management Information Systems
(TMIS) 12.1 (2020), pp. 1–20.
[194] O. Krachina, V. Raskin, and K. Triezenberg. “Reconciling privacy policies and regula-
tions: Ontological semantics perspective”. In: Symposium on Human Interface and the
Management of Information. Springer. 2007, pp. 730–739.
[195] L. Cai, C. Lu, and C. Zhang. “Privacy domain-specific ontology building and consis-
tency analysis”. In: 2010 international conference on internet technology and applications.
IEEE. 2010, pp. 1–6.
[196] S. M. Meystre. “De-identification of unstructured clinical data for patient privacy
protection”. In: Medical Data Privacy Handbook. Springer, 2015, pp. 697–716.
[197] L. Zhou, H. Suominen, T. Gedeon, et al. “Adapting state-of-the-art deep language
models to clinical information extraction systems: Potentials, challenges, and solutions”.
In: JMIR medical informatics 7.2 (2019), e11499.
[198] S. K. Nayak and A. C. Ojha. “Data Leakage Detection and Prevention: Review and
Research Directions”. In: Machine Learning and Information Processing; Springer: Singapore
(2020), pp. 203–212.
[199] W. Sun, Z. Cai, Y. Li, F. Liu, S. Fang, and G. Wang. “Data processing and text mining
technologies on electronic medical records: a review”. In: Journal of healthcare engineering
2018 (2018).
[200] N. Zhu, M. Zhang, D. Feng, and J. He. “How Well Can WordNet Measure Privacy: A
Comparative Study?” In: 2017 13th International Conference on Semantics, Knowledge and
Grids (SKG). IEEE. 2017, pp. 45–49.

127
Bibliography

[201] I.-C. Yoo, K. Lee, S. Leem, H. Oh, B. Ko, and D. Yook. “Speaker Anonymization for
Personal Information Protection Using Voice Conversion Techniques”. In: IEEE Access
8 (2020), pp. 198637–198645.
[202] G. Francopoulo and L. Schaub. “Anonymization for the GDPR in the Context of
Citizen and Customer Relationship Management and NLP”. In: workshop on Legal and
Ethical Issues (Legal2020). ELRA. 2020, pp. 9–14.
[203] H. Suominen, T. Lehtikunnas, B. Back, H. Karsten, T. Salakoski, and S. Salantera.
“Theoretical considerations of ethics in text mining of nursing documents”. In: Studies
in health technology and informatics 122 (2006), p. 359.
[204] O. Hasan, B. Habegger, L. Brunie, N. Bennani, and E. Damiani. “A discussion of
privacy challenges in user profiling with big data techniques: The EEXCESS use case”.
In: 2013 IEEE International Congress on Big Data. IEEE. 2013, pp. 25–30.
[205] W. Ye and Q. Li. “Chatbot Security and Privacy in the Age of Personal Assistants”. In:
2020 IEEE/ACM Symposium on Edge Computing (SEC). IEEE. 2020, pp. 388–393.
[206] F. Hassan, D. Sánchez, J. Soria-Comas, and J. Domingo-Ferrer. “Automatic anonymiza-
tion of textual documents: Detecting sensitive information via word embeddings”. In:
2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And
Communications/13th IEEE International Conference On Big Data Science And Engineering
(TrustCom/BigDataSE). IEEE. 2019, pp. 358–365.
[207] B. Di Martino, F. Marulli, P. Lupi, and A. Cataldi. “A Machine Learning Based Method-
ology for Automatic Annotation and Anonymisation of Privacy-Related Items in
Textual Documents for Justice Domain”. In: Conference on Complex, Intelligent, and
Software Intensive Systems. Springer. 2020, pp. 530–539.
[208] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. G. Leon, M. S. Andersen,
S. Zimmeck, K. M. Sathyendra, N. C. Russell, et al. “The creation and analysis of
a website privacy policy corpus”. In: Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers). 2016, pp. 1330–1340.
[209] A. Ravichander, A. W. Black, S. Wilson, T. Norton, and N. Sadeh. “Question answering
for privacy policies: Combining computational and legal perspectives”. In: arXiv
preprint arXiv:1911.00841 (2019).
[210] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky.
“The Stanford CoreNLP natural language processing toolkit”. In: Proceedings of 52nd
annual meeting of the association for computational linguistics: system demonstrations. 2014,
pp. 55–60.
[211] X. Pan, Y. Cao, X. Du, B. He, G. Fang, R. Shao, and Y. Chen. “Flowcog: context-aware
semantics extraction and analysis of information flow leaks in android apps”. In: 27th
{USENIX} Security Symposium ({USENIX} Security 18). 2018, pp. 1669–1685.
[212] M. Dharini and R. Sasikumar. “Preserving privacy of encrypted data stored in cloud
and enabling efficient retrieval of encrypted data through blind storage”. In: Advances
in Natural and Applied Sciences 10.10 SE (2016), pp. 72–77.

128
Bibliography

[213] X. Dai, H. Dai, G. Yang, X. Yi, and H. Huang. “An efficient and dynamic semantic-
aware multikeyword ranked search scheme over encrypted cloud data”. In: IEEE
Access 7 (2019), pp. 142855–142865.
[214] H. Dai, X. Dai, X. Yi, G. Yang, and H. Huang. “Semantic-aware multi-keyword ranked
search scheme over encrypted cloud data”. In: Journal of Network and Computer Applica-
tions 147 (2019), p. 102442.
[215] W. Li, Y. Xiao, C. Tang, X. Huang, and J. Xue. “Multi-user searchable encryption voice
in home IoT system”. In: Internet of Things 11 (2020), p. 100180.
[216] Y. Liang, D. O’Keeffe, and N. Sastry. “PAIGE: towards a hybrid-edge design for
privacy-preserving intelligent personal assistants”. In: Proceedings of the Third ACM
International Workshop on Edge Systems, Analytics and Networking. 2020, pp. 55–60.
[217] M. Hadian, T. Altuwaiyan, X. Liang, and W. Li. “Privacy-preserving voice-based search
over mhealth data”. In: Smart Health 12 (2019), pp. 24–34.
[218] M. Friedrich, A. Köhn, G. Wiedemann, and C. Biemann. “Adversarial learning of
privacy-preserving text representations for de-identification of medical records”. In:
arXiv preprint arXiv:1906.05000 (2019).
[219] S. Krishna, R. Gupta, and C. Dupuy. “ADePT: Auto-encoder based Differentially
Private Text Transformation”. In: arXiv preprint arXiv:2102.01502 (2021).
[220] B. Weggenmann and F. Kerschbaum. “Syntf: Synthetic and differentially private term
frequency vectors for privacy-preserving text mining”. In: The 41st International ACM
SIGIR Conference on Research & Development in Information Retrieval. 2018, pp. 305–314.
[221] G. Beigi, K. Shu, R. Guo, S. Wang, and H. Liu. “I am not what i write: Privacy
preserving text representation learning”. In: arXiv preprint arXiv:1907.03189 (2019).
[222] C. Tan, D. Jiang, H. Mo, J. Peng, Y. Tong, W. Zhao, C. Chen, R. Lian, Y. Song, and
Q. Xu. “Federated acoustic model optimization for automatic speech recognition”. In:
International Conference on Database Systems for Advanced Applications. Springer. 2020,
pp. 771–774.
[223] D. Reich, A. Todoki, R. Dowsley, M. De Cock, and A. C. Nascimento. “Privacy-
preserving classification of personal text messages with secure multi-party compu-
tation: An application to hate-speech detection”. In: arXiv preprint arXiv:1906.02325
(2019).
[224] B. Hitaj, G. Ateniese, and F. Perez-Cruz. “Deep models under the GAN: information
leakage from collaborative deep learning”. In: Proceedings of the 2017 ACM SIGSAC
Conference on Computer and Communications Security. 2017, pp. 603–618.
[225] X. Ding, H. Fang, Z. Zhang, K.-K. R. Choo, and H. Jin. “Privacy-preserving Feature
Extraction via Adversarial Training”. In: IEEE Transactions on Knowledge and Data
Engineering (2020).

129
Bibliography

[226] X. Zhu, J. Wang, Z. Hong, and J. Xiao. “Empirical studies of institutional federated
learning for natural language processing”. In: Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing: Findings. 2020, pp. 625–634.
[227] M. Chen, A. T. Suresh, R. Mathews, A. Wong, C. Allauzen, F. Beaufays, and M. Riley.
“Federated learning of n-gram language models”. In: arXiv preprint arXiv:1910.03432
(2019).
[228] M. Coavoux, S. Narayan, and S. B. Cohen. “Privacy-preserving neural representations
of text”. In: arXiv preprint arXiv:1808.09408 (2018).
[229] L. Lyu, Y. Li, X. He, and T. Xiao. “Towards differentially private text representa-
tions”. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and
Development in Information Retrieval. 2020, pp. 1813–1816.
[230] B. Tran and X. Liang. “Exploiting peer-to-peer communications for query privacy
preservation in voice assistant systems”. In: Peer-to-Peer Networking and Applications
14.3 (2021), pp. 1475–1487.
[231] J. Shao, S. Ji, and T. Yang. “Privacy-aware document ranking with neural signals”. In:
Proceedings of the 42nd International ACM SIGIR Conference on Research and Development
in Information Retrieval. 2019, pp. 305–314.
[232] K. Khelik. “Privacy-preserving document similarity detection”. MA thesis. Univer-
sitetet i Agder/University of Agder, 2011.
[233] A. A. Gareta. “Privacy-Preserving Speech Emotion Recognition”. In: (2017).
[234] S. Ahmed, A. R. Chowdhury, K. Fawaz, and P. Ramanathan. “Preech: a system
for privacy-preserving speech transcription”. In: 29th {USENIX} Security Symposium
({USENIX} Security 20). 2020, pp. 2703–2720.
[235] A. Treiber, A. Nautsch, J. Kolberg, T. Schneider, and C. Busch. “Privacy-preserving
PLDA speaker verification using outsourced secure computation”. In: Speech Communi-
cation 114 (2019), pp. 60–71.
[236] D. Aritomo and C. Watanabe. “Achieving efficient similar document search over
encrypted data on the cloud”. In: 2019 IEEE International Conference on Smart Computing
(SMARTCOMP). IEEE. 2019, pp. 1–6.
[237] Z. Fu, X. Wu, C. Guan, X. Sun, and K. Ren. “Toward efficient multi-keyword fuzzy
search over encrypted outsourced data with accuracy improvement”. In: IEEE Transac-
tions on Information Forensics and Security 11.12 (2016), pp. 2706–2716.
[238] L. Marujo, J. Portêlo, W. Ling, D. M. de Matos, J. P. Neto, A. Gershman, J. Carbonell,
I. Trancoso, and B. Raj. “Privacy-preserving multi-document summarization”. In: arXiv
preprint arXiv:1508.01420 (2015).
[239] M. Koppel, S. Argamon, and A. R. Shimoni. “Automatically categorizing written texts
by author gender”. In: Literary and linguistic computing 17.4 (2002), pp. 401–412.
[240] O. Feyisetan, T. Drake, B. Balle, and T. Diethe. “Privacy-preserving active learning on
sensitive data for user intent classification”. In: arXiv preprint arXiv:1903.11112 (2019).

130
Bibliography

[241] M. Pathak. “Privacy Preserving Techniques for Speech Processing”. In: Dec 1 (2010),
pp. 1–54.
[242] S. Augenstein, H. B. McMahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen,
R. Mathews, et al. “Generative models for effective ML on private, decentralized
datasets”. In: arXiv preprint arXiv:1911.06679 (2019).
[243] F. Granqvist, M. Seigel, R. van Dalen, Á. Cahill, S. Shum, and M. Paulik. “Improving
on-device speaker verification using federated learning with privacy”. In: arXiv preprint
arXiv:2008.02651 (2020).
[244] Q. Wang, M. Du, X. Chen, Y. Chen, P. Zhou, X. Chen, and X. Huang. “Privacy-
preserving collaborative model learning: The case of word vector training”. In: IEEE
Transactions on Knowledge and Data Engineering 30.12 (2018), pp. 2381–2393.
[245] S. A. Anand, P. Walker, and N. Saxena. “Compromising Speech Privacy under Con-
tinuous Masking in Personal Spaces”. In: 2019 17th International Conference on Privacy,
Security and Trust (PST). IEEE. 2019, pp. 1–10.
[246] D. Carrell. “A strategy for deploying secure cloud-based natural language processing
systems for applied research involving clinical text”. In: 2011 44th Hawaii International
Conference on System Sciences. IEEE. 2011, pp. 1–11.
[247] K. Demetzou, L. Böck, and O. Hanteer. “Smart Bears don’t talk to strangers: analysing
privacy concerns and technical solutions in smart toys for children”. In: (2018).
[248] A. Rabinia, S. Ghanavati, and M. Dragoni. “Towards Integrating the FLG Framework
with the NLP Combinatory Framework”. In: ().
[249] P. Thaine and G. Penn. “Privacy-Preserving Character Language Modelling”. In: ().
[250] E. D. Lopez-Cozar, E. Orduna-Malea, and A. Martin-Martin. “Google Scholar as a data
source for research assessment”. In: Springer handbook of science and technology indicators.
Springer, 2019, pp. 95–127.
[251] R. Nosowsky and T. J. Giordano. “The Health Insurance Portability and Accountability
Act of 1996 (HIPAA) privacy rule: implications for clinical research”. In: Annu. Rev.
Med. 57 (2006), pp. 575–590.
[252] L. Zhao, Q. Wang, Q. Zou, Y. Zhang, and Y. Chen. “Privacy-preserving collabora-
tive deep learning with unreliable participants”. In: IEEE Transactions on Information
Forensics and Security 15 (2019), pp. 1486–1500.
[253] C.-H. H. Yang, J. Qi, S. Y.-C. Chen, P.-Y. Chen, S. M. Siniscalchi, X. Ma, and C.-H. Lee.
“Decentralizing feature extraction with quantum convolutional neural network for
automatic speech recognition”. In: ICASSP 2021-2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, pp. 6523–6527.
[254] O. Feyisetan, S. Ghanavati, S. Malmasi, and P. Thaine. “Proceedings of the Second
Workshop on Privacy in NLP”. In: Proceedings of the Second Workshop on Privacy in NLP.
2020.

131
Bibliography

[255] A. Aggarwal, Z. Xu, O. Feyisetan, and N. Teissier. “On Log-Loss Scores and (No)
Privacy”. In: Proceedings of the Second Workshop on Privacy in NLP. 2020, pp. 1–6.

132

You might also like