You are on page 1of 22

Polisis: Automated Analysis and

Presentation of Privacy Policies Using Deep Learning

Hamza Harkous1 , Kassem Fawaz2 , Rémi Lebret1 , Florian Schaub3 , Kang G. Shin3 , and Karl Aberer1

1 École Polytechnique Fédérale de Lausanne (EPFL)


2 University of of Wisconsin-Madison
3 University of Michigan

Abstract Donald and Cranor [4] estimated that it would take an


Privacy policies are the primary channel through which average user 201 hours to read all the privacy policies
companies inform users about their data collection and encountered in a year. Since then, we have witnessed
sharing practices. These policies are often long and diffi- a smartphone revolution and the rise of the Internet of
cult to comprehend. Short notices based on information Things (IoTs), which lead to the proliferation of ser-
extracted from privacy policies have been shown to be vices and associated policies [6]. In addition, emerging
useful but face a significant scalability hurdle, given the technologies brought along new forms of user interfaces
number of policies and their evolution over time. Com- (UIs), such as voice-controlled devices or wearables, for
panies, users, researchers, and regulators still lack usable which existing techniques for presenting privacy policies
and scalable tools to cope with the breadth and depth of are not suitable [3, 6, 7, 8].
privacy policies. To address these hurdles, we propose an Problem Description. Users, researchers, and regula-
automated framework for privacy policy analysis (Poli- tors are not well-equipped to process or understand the
sis). It enables scalable, dynamic, and multi-dimensional content of privacy policies, especially at scale. Users are
queries on natural language privacy policies. At the core surprised by data practices that do not meet their expec-
of Polisis is a privacy-centric language model, built with tations [9], hidden in long, vague, and ambiguous poli-
130K privacy policies, and a novel hierarchy of neural- cies. Researchers employ expert annotators to analyze
network classifiers that accounts for both high-level as- and reason about a subset of the available privacy poli-
pects and fine-grained details of privacy practices. We cies [10, 11]. Regulators, such as the U.S. Department of
demonstrate Polisis’ modularity and utility with two ap- Commerce, rely on companies to self-certify their com-
plications supporting structured and free-form querying. pliance with privacy practices (e.g., the Privacy Shield
The structured querying application is the automated as- Framework [12]). The problem lies in stakeholders lack-
signment of privacy icons from privacy policies. With ing the usable and scalable tools to deal with the breadth
Polisis, we can achieve an accuracy of 88.4% on this and depth of privacy policies.
task. The second application, PriBot, is the first free- Several proposals have aimed at alternative methods
form question-answering system for privacy policies. We and UIs for presenting privacy notices [8], including
show that PriBot can produce a correct answer among machine-readable formats [13], nutrition labels [14], pri-
its top-3 results for 82% of the test questions. Using an vacy icons (recently recommended by the EU [15]), and
MTurk user study with 700 participants, we show that at short notices [16]. Unfortunately, these approaches have
least one of PriBot’s top-3 answers is relevant to users faced a significant scalability hurdle: the human effort
for 89% of the test questions. needed to retrofit the new notices to existing policies and
maintain them over time is tremendous. The existing re-
1 Introduction search towards automating this process has been limited
Privacy policies are one of the most common ways of in scope to a handful of “queries,” e.g., whether the pol-
providing notice and choice online. They aim to inform icy mentions data encryption or whether it provides an
users how companies collect, store and manage their opt-out choice from third-party tracking [16, 17].
personal information. Although some service providers Our Framework. We overcome this scalability hurdle
have improved the comprehensibility and readability of by proposing an automatic and comprehensive frame-
their privacy policies, these policies remain excessively work for privacy policy analysis (Polisis). It divides a
long and difficult to follow [1, 2, 3, 4, 5]. In 2008, Mc- privacy policy into smaller and self-contained fragments

1
of text, referred to as segments. Polisis automatically an- relevant privacy policy segments to answer the user’s
notates, with high accuracy, each segment with a set of free-form questions. To build PriBot, we overcame the
labels describing its data practices. Unlike prior research non-existence of a public, privacy-specific QA dataset by
in automatic labeling/analysis of privacy policies, Poli- casting the problem as a ranking problem that could be
sis does not just predict a handful of classes given the solved using the classification results of Polisis. PriBot
entire policy document. Instead, Polisis annotates the matches user questions with answers from a previously
privacy policy at a much finer-grained scale. It predicts unseen privacy policy, in real time and with high accu-
for each segment the set of classes that account for both racy – demonstrating a more intuitive and user-friendly
the high-level aspects and the fine-grained classes of em- way to present privacy notices and controls. We evalu-
bedded privacy information. Polisis uses these classes to ate PriBot using a new test dataset, based on real-world
enable scalable, dynamic, and multi-dimensional queries questions that have been asked by consumers on Twitter.
on privacy policies, in a way not possible with prior ap- Contributions. With this paper we make the following
proaches. contributions:
At the core of Polisis is a novel hierarchy of neural-
network classifiers that involve 10 high-level and 122 • We design and implement Polisis, an approach for au-
fine-grained privacy classes for privacy-policy segments. tomatically annotating previously unseen privacy poli-
To build these fine-grained classifiers, we leverage tech- cies with high-level and fine-grained labels from a pre-
niques such as subword embeddings and multi-label specified taxonomy (Sec. 2, 3, 4, and 5).
classification. We further seed these classifiers with a
custom, privacy-specific language model that we gener- • We demonstrate how Polisis can be used to assign pri-
ated using our corpus of more than 130,000 privacy poli- vacy icons to a privacy policy with an average accu-
cies from websites and mobile apps. racy of 88.4%. This accuracy is computed by com-
Polisis provides the underlying intelligence for re- paring icons assigned with Polisis’ automatic labels to
searchers and regulators to focus their efforts on merely icons assigned based on manual annotations by three
designing a set of queries that power their applications. legal experts from the OPP-115 dataset [11] (Sec. 6).
We stress, however, that Polisis is not intended to replace • We design, implement and evaluate PriBot, a QA sys-
the privacy policy – as a legal document – with an auto- tem that answers free-form user questions from pri-
mated interpretation. Similar to existing approaches on vacy policies (Sec. 7). Our accuracy evaluation shows
privacy policies’ analysis and presentation, it decouples that PriBot produces at least one correct answer (as in-
the legally binding functionality of these policies from dicated by privacy experts) in its top three for 82% of
their informational utility. the test questions and as the top one for 68% of the test
Applications. We demonstrate and evaluate the modu- questions. Our evaluation of the perceived utility with
larity and utility of Polisis with two robust applications 700 MTurk crowdworkers shows that users find a rele-
that support structured and free-form querying of privacy vant answer in PriBot’s top-3 for 89% of the questions
policies. (Sec. 8).
The structured querying application involves extract-
ing short notices in the form of privacy icons from pri- • We make Polisis publicly available by providing three
vacy policies. As a case study, we investigate the Dis- web services demonstrating our applications: a ser-
connect privacy icons [18]. By composing a set of sim- vice giving a visual overview of the different aspects
ple rules on top of Polisis, we show a solution that can of each privacy policy, a chatbot for answering user
automatically select appropriate privacy icons from a pri- questions in real time, and a privacy-labels interface
vacy policy. We further study the practice of companies for privacy policies. These services are available at
assigning icons to privacy policies at scale. We empiri- https://pribot.org. We provide screenshots of
cally demonstrate that existing privacy-compliance com- these applications in Appendix B.
panies, such as TRUSTe (now rebranded as TrustArc),
might be adopting permissive policies when assigning 2 Framework Overview
such privacy icons. Our findings are consistent with Fig. 1 shows a high-level overview of Polisis. It com-
anecdotal controversies and manually investigated issues prises three layers: Application Layer, Data Layer, and
in privacy certification and compliance processes [19, 20, Machine Learning (ML) Layer. Polisis treats a privacy
21]. policy as a list of semantically coherent segments (i.e.,
The second application illustrates the power of free- groups of consecutive sentences). It also utilizes a tax-
form querying in Polisis. We design, implement and onomy of privacy data practices. One example of such a
evaluate PriBot, the first automated Question-Answering taxonomy was introduced by Wilson et al. [11] (see also
(QA) system for privacy policies. PriBot extracts the Fig. 3 in Sec. 4).

2
Further useful privacy and security related materials can be found through Google’s policies
Application App and principles pages, including:
1 o Information about our technologies and principles, which includes, among other things,
Layer

3. Prepend
privacy Class Comparison more information on 2. Append
Query Module • how Google uses cookies.
policy link • technologies we use for advertising. 1. Merge short list
• how we recognize patterns like faces.
1 user 4 3 o A page that explains what data is shared with Google when you visit websites that use our
query annotated
query advertising, analytics and social products.
classes segments o The Privacy Checkup tool, which makes it easy to review your key privacy settings.
o Google’s safety center, which provides information on how to stay safe and secure online.
Data ML Query Analyzer
Layer Layer
Fig. 2: List merging during the policy segmentation.
2
Segmenter policy
Segment Classifier
segments

Data
Privacy
Policy Extraction: Given the URL of a privacy pol-
Legend: Module flow
Taxonomy icy, the segmenter employs Google Chrome in head-
less mode (without UI) to scrape the policy’s web-
Fig. 1: A high-level overview of Polisis. page. It waits for the page to fully load which hap-
pens after all the JavaScript has been downloaded and
executed. Then, the segmenter removes all irrelevant
Application Layer (Sec. 5, 6 & 7): The Applica- HTML elements including the scripts, header, footer,
tion Layer provides fine-grained information about the side/navigation menus, comments, and CSS.
privacy policy, thus providing the users with high mod- Although several online privacy policies contain dy-
ularity in posing their queries. In this layer, a Query namically viewable content (e.g., accordion toggles and
Module receives the User Query about a privacy policy collapsible/expandable paragraphs), the “dynamic” con-
(Step 1 in Fig. 1). These inputs are forwarded to lower tent is already part of the loaded webpage in almost all
layers, which then extract the privacy classes embedded cases. For example, when the user expands a collapsible
within the query and the policy’s segments. To resolve paragraph, a local JavaScript exposes an offline HTML
the user query, the Class-Comparison module identifies snippet; no further downloading takes place.
the segments with privacy classes matching those of the We confirmed this with the privacy policies of the top
query. Then, it passes the matched segments (with their 200 global websites from Alexa.com. For each privacy-
predicted classes) back to the application. policy link, we compared the segmenter’s scraped con-
Data Layer (Sec. 3): The Data Layer first scrapes the tent to that extracted from our manual navigation of the
policy’s webpage. Then, it partitions the policy into se- same policy (while accounting for all the dynamically
mantically coherent and adequately sized segments (us- viewable elements of the webpage). Using a fuzzy string
ing the Segmenter component in Step 2 of Fig. 1). Each matching library,1 we found that the segmenter’s scraped
of the resulting segments can be independently con- policy covers, on average, 99.08% of the content of the
sumed by both the humans and programming interfaces. manually fetched policy.
Machine Learning Layer (Sec. 4): In order to en-
List Aggregation: Second, the segmenter handles any
able a multitude of applications to be built around Poli-
ordered/unordered lists inside the policy. Lists require
sis, the ML layer is responsible for producing rich and
a special treatment since counting an entire lengthy list,
fine-grained annotations of the data segments. This layer
possibly covering diverse data practices, as a single seg-
takes as an input the privacy-policy segments from the
ment could result in noisy annotations. On the other
Data Layer (Step 2) and the user query (Step 1) from the
hand, treating each list item as an independent segment
Application Layer. The Segment Classifier probabilisti-
is problematic as list elements are typically not self-
cally assigns each segment a set of class–value pairs de-
contained, resulting in missed annotations. See Fig. 2
scribing its data practices. For example, an element in
from Google’s privacy policy as an example2 .
this set can be information-type=location with probabil-
Our handling of the lists involves two techniques: one
ity p = 0.65. Similarly, the Query Analyzer extracts the
for short list items (e.g., the inner list of Fig. 2) and an-
privacy classes from the user’s query. Finally, the class–
other for longer list items (e.g., the outer list of Fig. 2).
value pairs of both the segments and the query are passed
For short list items (maximum of 20 words per element),
back to the Class Comparison module of the Application
the segmenter combines the elements with the introduc-
Layer (Steps 3 and 4).
tory statement of the list into a single paragraph element
(with <p> tag). The rest of the lists with long items are
3 Data Layer transformed into a set of paragraphs. Each paragraph is a
To pre-process the privacy policy, the Data Layer em- 1 https://pypi.python.org/pypi/fuzzywuzzy
ploys a Segmenter module in three stages: extraction, list 2 https://www.google.com/intl/en US/policies/
handling, and segmentation. The Data Layer requires no privacy/archive/20160829/, last modified on Aug. 29, 2016,
information other than the link to the privacy policy. retrieved on Jun. 27, 2018

3
distinct list element prepended by the list’s introductory for the privacy-policy domain. To that end, we created a
statement (Step 3 in Fig. 2). corpus of 130K privacy policies collected from apps on
Policy Segmentation: The segmenter performs an ini- the Google Play Store. These policies typically describe
tial coarse segmentation by breaking down the policy the overall data practices of the apps’ companies.
according to the HTML <div> and <p> tags. The out- We crawled the metadata of more than 1.4 million An-
put of this step is an initial set of policy segments. As droid apps available via the PlayDrone project [27] to
some of the resulting segments might still be long, we find the links to 199,186 privacy policies. We crawled
subdivide them further with another technique. We use the web pages for these policies, retrieving 130,326 poli-
GraphSeg [22], an unsupervised algorithm that gener- cies which returned an HTTP status code of 200. Then,
ates semantically coherent segments. It relies on word we extracted the textual content from their HTML us-
embeddings to generate segments as cliques of related ing the policy crawler described in Sec. 3. We will refer
(semantically similar) sentences. For that purpose, we to this corpus as the Policies Corpus. Using this corpus,
use custom, domain-specific word embeddings that we we trained a word-embeddings model using fastText [28].
generated using our corpus of 130K privacy policies (cf. We henceforth call this model the Policies Embeddings.
Sec. 4). Finally, the segmenter outputs a series of fine- A major advantage of using fastText is that it allows train-
grained segments to the Machine Learning Layer, where ing vectors for subwords (or character n-grams of sizes 3
they are automatically analyzed. to 6) in addition to words. Hence, even if we have words
outside our corpus, we can assign them vectors by com-
4 Machine Learning Layer bining the vectors of their constituent subwords. This is
This section describes the components of Polisis’ Ma- very useful in accounting for spelling mistakes that occur
chine Learning Layer in two stages: (1) an unsupervised in applications that involve free-form user queries.
stage, in which we build domain-specific word vectors
(i.e., word embeddings) for privacy policies from unla- 4.2 Classification Dataset
beled data, and (2) a supervised stage, in which we train a Our Policies Embeddings provides a solid starting
novel hierarchy of privacy-text classifiers, based on neu- point to build robust classifiers. However, training the
ral networks, that leverages the word vectors. These clas- classifiers to detect fine-grained labels of privacy poli-
sifiers power the Segment Classifier and Query Analyzer cies’ segments requires a labeled dataset. For that pur-
modules of Fig. 1. We use word embeddings and neural pose, we leverage the Online Privacy Policies (OPP-
networks thanks to their proven advantages in text clas- 115) dataset, introduced by Wilson et al. [11]. This
sification [23] over traditional techniques. dataset contains 115 privacy policies manually annotated
by skilled annotators (law school students). In total, the
4.1 Privacy-Specific Word Embeddings dataset has 23K annotated data practices. The anno-
Traditional text classifiers use the words and their fre- tations were at two levels. First, paragraph-sized seg-
quencies as the building block for their features. They, ments were annotated according to one or more of the
however, have limited generalization power, especially 10 high-level categories in Fig. 3 (e.g., First Party Col-
when the training datasets are limited in size and scope. lection, Data Retention). Then, annotators selected parts
For example, replacing the word “erase” by the word of the segment and annotated them using attribute–value
“delete” can significantly change the classification result pairs, e.g., information type: location, purpose: adver-
if “delete” was not in the classifier’s training set. tising, etc. In total, there were 20 distinct attributes and
Word embeddings solve this issue by extracting 138 distinct values across all attributes. Of these, 122
generic word vectors from a large corpus, in an unsu- values had more than 20 labels. In Fig. 3, we only show
pervised manner, and enabling their use in new classifi- the mandatory attributes that should be present in all seg-
cation problems (a technique termed Transfer Learning). ments. Due to space limitation, we only show samples of
The features in the classifiers become the word vectors the values for selected attributes in Fig. 3.
instead of the words themselves. Hence, two text seg-
ments composed of semantically similar words would be 4.3 Hierarchical Multi-label Classification
represented by two groups of word vectors (i.e., features) To account for the multiple granularity levels in the
that are close in the vector space. This allows the text policies’ text, we build a hierarchy of classifiers that
classifier to account for words outside the training set, as are individually trained on handling specific parts of the
long as they are part of the large corpus used to train the problem.
word vectors. At the top level, a classifier predicts one or more high-
While general-purpose pre-trained embeddings, such level categories of the input segment x (categories are the
as Word2vec [24] and GloVe [25] do exist, domain- top-level, shaded boxes of Fig. 3). We train a multi-label
specific embeddings result in better classification accu- classifier that provides us with the probability p(ci |x) of
racy [26]. Thus, we trained custom word embeddings the occurrence of each high-level category ci , taken from

4
Access,
Information Type 1st Party 3rd Party Data Data Specific Do Not Policy Choice, Choice Type
Edit, Other
• financial Collection Collection Retention Security Audiences Track Change Control • opt-in
• health
Delete • opt-out
• contact • opt-out-link
• location Collection Access Retention Security Audience Do Not Change
Action Introductory Choice Type •…
•… Mode Scope Period Measure group Track Type
Policy
Purpose Retention Period User
Information Information Access Retention Contact
Choice Choice Scope
• advertising Type Type Rights Purpose • stated period Information
• marketing • limited
• analytics • indefinitely
Information • unspecified Notification Practice not
• legal requirement Purpose Purpose
•… Type • other Type covered

Fig. 3: The privacy taxonomy of Wilson et al. [11]. The top level of the hierarchy (shaded blocks) defines high-level privacy
categories. The lower level defines a set of privacy attributes, each assuming a set of values. We show examples of values for
some of the attributes.

the set of all categories C . In addition to allowing mul- CNN Sigmoid

Segment
w1 Embeddings +
tiple categories per segment, using a multi-label classi- w2 ReLU
Dense 1
Classes
Layer + Dense 2
fier makes it possible to determine whether a category is …
+
Max- ReLU Probs
present in a segment by simply comparing its classifica- Pooling

tion probability to a threshold of 0.5.


At the lower level, a set of classifiers predicts one Fig. 4: Components of the CNN-based classifier used.
or more values for each privacy attribute (the leaves
in the taxonomy of Fig. 3). We train a set of multi-
label classifiers on the attribute-level. Each classifier Convolutional layer, whose main role is applying a non-
produces the probabilities p(v j |x) for the values v j ∈ linear function (a Rectified Linear Unit (ReLU)) over
V (b) of a single attribute b. For example, given the windows of k words. Then, a max-pooling layer com-
attribute b=information type, the corresponding clas- bines the vectors resulting from the different windows
sifier outputs the probabilities for elements in V (b): into a single vector. This vector then passes through the
{financial, location, user profile, health, demographics, first dense (i.e., fully-connected) layer with a ReLU ac-
cookies, contact information, generic personal informa- tivation function, and finally through the second dense
tion, unspecified, . . . }. layer. A sigmoid operation is applied to the output of
An important consequence of this hierarchy is that in- the last layer to obtain the probabilities for the possible
terpreting the output of the attribute-level classifier de- output classes. We used multi-label cross-entropy loss
pends on the categories’ probabilities. For example, the as the classifier’s objective function. We refer interested
values’ probabilities of the attribute “retention period” readers to [30] for further elaborations on how CNNs are
are irrelevant when the dominant high-level category is used in such contexts.
“policy change.” Hence, for a category ci , one would Models’ Training. In total, we trained 20 classifiers at
only consider the attributes descending from it in the hi- the attribute level (including the optional attributes). We
erarchy. We denote these attributes as A (ci ) and the set also trained two classifiers at the category level: one for
of all values across these attributes as V (ci ). classifying segments and the other for classifying free-
We use Convolutional Neural Networks (CNNs) in- form queries. For the former, we include all the classes
ternally within all the classifiers for two main reasons, in Fig. 3. For the latter, we ignore the “Other” cate-
which are also common in similar classification tasks. gory as it is mainly for introductory sentences or uncov-
First, CNNs enable us to integrate pre-trained word em- ered practices [11], which are not applicable to users’
beddings that provide the classifiers with better gener- queries. For training the classifiers, we used the data
alization capabilities. Second, CNNs recognize when a from 65 policies in the OPP-115 dataset, and we kept
certain set of tokens are a good indicator of the class, in 50 policies as a testing set. The hyper-parameters for
a way that is invariant to their position within the input each classifier were obtained by running a randomized
segment. grid-search. In Table 1, we present the evaluation met-
rics on the testing set for the category classifier intended
We use a similar CNN architecture for classifiers on
for free-form queries. In addition to the precision, re-
both levels as shown in Fig. 4. Segments are split into to-
call and F1 scores (macro-averaged per label3 ), we also
kens, using PENN Treebank tokenization in NLTK [29].
The embeddings layer outputs the word vectors of these 3 A successful multilabel classifier should not only predict the pres-

tokens. We froze that layer, preventing its weights from ence of a label, but also its absence. Otherwise, a model that predicts
that all labels are present would have 100% precision and recall. For
being updated, in order to preserve the learnt seman- that, the precision in the table represents the macro-average of the pre-
tic similarity between all the words present in our Poli- cision in predicting the presence of each label and predicting its ab-
cies Embeddings. Next, the word vectors pass through a sence (similarly for recall and F1 metrics).

5
Table 1: Classification results for user queries at the category sible application is privacy-centered comparative shop-
level. Hyperparameters: Embeddings size: 300, Number of ping [33]. A user can build on Polisis’ output to auto-
filters: 200, Filter Size: 3, Dense Layer Size: 100, Batch Size: matically quantify the privacy utility of a certain policy.
40 For example, such a privacy metric could be a combi-
nation of positive scores describing privacy-protecting
Category Prec. Recall F1
Top-1
Support features (e.g., policy containing a segment with the la-
Prec.
bel: retention period: stated period ) and negative scores
1st Party Collection 0.80 0.80 0.80 0.80 1267 describing privacy-infringing features (e.g., policy con-
3rd Party Sharing 0.81 0.81 0.81 0.86 963
User Choice/Control 0.76 0.73 0.75 0.81 455
taining a segment with the label: retention period: un-
Data Security 0.87 0.86 0.87 0.77 202 limited ). A major advantage of automatically generat-
Specific Audiences 0.95 0.94 0.95 0.91 156 ing short notices is that they can be seamlessly refreshed
Access, Edit, Delete 0.94 0.75 0.82 0.97 134
Policy Change 0.96 0.89 0.92 0.93 120 when policies are updated or when the rules to generate
Data Retention 0.79 0.67 0.71 0.60 93 these notices are modified. Otherwise, discrepancies be-
Do Not Track 0.97 0.97 0.97 0.94 16
tween policies and notices might arise over time, which
Average 0.87 0.83 0.84 0.84
deters companies from adopting the short notices in the
first place.
By answering free-form queries with relevant policy
show the top-1 precision metric, representing the fraction segments, Polisis can remove the interface barrier be-
of segments where the top predicted category label oc- tween the policy and the users, especially in conver-
curs in the annotators’ ground-truth labels. As evident in sational interfaces (e.g., voice assistants and chatbots).
the table, our classifiers can predict the top-level privacy Taking a step further, Polisis’ output can be potentially
category with high accuracy. Although we consider the used to automatically rephrase the answer segments to a
problem in the multi-label setting, these metrics are sig- simpler language. A rule engine can generate text based
nificantly higher than the models presented in the orig- on the combination of predicted classes of an answer seg-
inal OPP-115 paper [11]. The full results for the rest ment (e.g., “We share data with third parties. This con-
of classifiers are presented in Appendix A. The efficacy cerns our users’ information, like your online activities. We
of these classifiers is further highlighted through queries need this to respond to requests from legal authorities”).
that directly leverage their output in the applications de-
Researchers: The difficultly of analyzing the data-
scribed next.
collection claims by companies at scale has often been
5 Application Layer cited as a limitation in ecosystem studies (e.g., [34]).
Polisis can provide the means to overcome that. For in-
Leveraging the power of the ML Layer’s classifiers, stance, researchers interested in analyzing apps that ad-
Polisis supports both structured and free-from queries mit collecting health data [35, 36] could utilize Polisis to
about a privacy policy’s content. A structured query query a dataset of app policies. One example query can
is a combination of first-order logic predicates over be formed by joining the label information type: health
the predicted privacy classes and the policy segments, with the category of First Party Collection or Third Party
such as: ∃s (s ∈ policy ∧ information type(s)=location ∧ Sharing.
purpose(s) = marketing ∧ user choice(s)=opt-out). On Regulators: Numerous studies from regulators and
the other hand, a free-form query is simply a natural lan- law and public policy researchers have manually ana-
guage question posed directly by the users, such as “do lyzed the permissiveness of compliance checks [21, 37].
you share my location with third parties?”. The response
The number of assessed privacy policies in these stud-
to a query is the set of segments satisfying the predicates ies is typically in the range of tens of policies. For in-
in the case of a structured query or matching the user’s stance, the Norwegian Consumer Council has investi-
question in the case of a free-form query. The Appli- gated the level of ambiguity in defining personal infor-
cation Layer builds on these query types to enable an ar- mation within only 20 privacy policies [37]. Polisis can
ray of applications for different privacy stakeholders. We scale such studies by processing a regulator’s queries on
take an exemplification approach to give the reader a bet- large datasets. For example, with Polisis, policies can
ter intuition on these applications, before delving deeper be ranked according to an automated ambiguity met-
into two of them in the next sections. ric by using the information type attribute and differ-
Users: Polisis can automatically populate several of the entiating between the label generic personal information
previously-proposed short notices for privacy policies, and other labels specifying the type of data collected.
such as nutrition tables and privacy icons [3, 18, 31, 32]. Similarly, this applies to frameworks such as Privacy
This task can be achieved by mapping the notices to Shield [12] and the GDPR [15], where issues such as
a set of structured queries (cf. Sec. 6). Another pos- limiting the data usage purposes should be investigated.

6
6 Privacy Icons automatic labels matched the icon based on the experts’
Our first application shows the efficacy of Polisis in labels. The average accuracy across icons is 88.4%,
resolving structured queries to privacy policies. As showing the efficacy of our approach in matching the
a case study, we investigate the Disconnect privacy experts’ aggregated annotations. This result is signif-
icons [18], described in the first three columns of Table 2. icant in view of Miyazaki and Krishnamurthy’s find-
These icons evolved from a Mozilla-led working group ing [21]: the level of agreement among 3 trained human
that included the Electronic Frontier Foundation, Cen- judges assessing privacy policies ranged from 88.3% to
ter for Democracy and Technology, and the W3C. The 98.3%, with an average of 92.7% agreement overall. We
database powering these icons originated from TRUSTe also show Cohen’s κ, an agreement measure that ac-
(re-branded later as TrustArc), a privacy compliance counts for agreement due to random chance4 . In our
company, which carried out the task of manually ana- case, the values indicate substantial to almost perfect
lyzing and labeling privacy policies. agreement [40]. Finally, we show the distribution of
In what follows, we first establish the accuracy of Poli- icons based on the experts’ labels alongside Hellinger
sis’ automatic assignment of privacy icons, using the distance5 , which measures the difference between that
Disconnect icons as a proof-of-concept. We perform distribution and the one produced using the automatic
a direct comparison between assigning these icons via labels. This distance assumes small values, illustrating
Polisis and assigning them based on annotations by law that the distributions are very close. Overall, these results
students [11]. Second, we leverage Polisis to investi- support the potential of automatically assigning privacy
gate the level of permissiveness of the icons that Discon- icons with Polisis.
nect assigns based on the TRUSTe dataset. Our findings 6.2 Auditing Compliance Metrics
are consistent with the series of concerns raised around
compliance-checking companies over the years [21, 38, Given that we achieve a high accuracy in assigning
39]. This demonstrates the power of Polisis in scalable, privacy icons, it is intuitive to investigate how they com-
automated auditing of privacy compliance checks. pare to the icons assigned by Disconnect and TRUSTe.
An important consideration in this regard is that sev-
6.1 Predicting Privacy Icons eral concerns have been raised earlier around the level
Given that the rules behind the Disconnect icons are of leniency of TRUSTe and other compliance compa-
not precisely defined, we translated their description into nies [19, 20, 38, 39]. In 2000, the FTC conducted a study
explicit first-order logic queries to enable automatic pro- on privacy seals, including those of TRUSTe, and found
cessing. Table 2 shows the original description and color that, of the 27 sites with a privacy seal, approximately
assignment provided by Disconnect. We also show our only half implemented, at least in part, all four of the fair
interpretation of each icon in terms of labels present in information practice principles and that only 63% imple-
the OPP-115 dataset and the automated assignment of mented Notice and Choice. Hence, we pose the follow-
colors based on these labels. Our goal is not to reverse- ing question: Can we automatically provide evidence of
engineer the logic behind the creation of these icons but the level of leniency of the Disconnect icons using Poli-
to show that we can automatically assign such icons with sis? To answer this question, we designed an experiment
high accuracy, given a plausible interpretation. Hence, to compare the icons extracted by Polisis’ automatic la-
this represents our best effort to reproduce the icons, but bels to the icons assigned by Disconnect on real policies.
these rules could easily be adapted as needed. One obstacle we faced is that the Disconnect icons
To evaluate the efficacy of automatically selecting have been announced in June 2014 [41]; many privacy
appropriate privacy icons, we compare the icons pro- policies have likely been updated since then. To ensure
duced with Polisis’ automatic labels to the icons pro- that the privacy policies we consider are within a close
duced based on the law students’ annotations from the time frame to those used by Disconnect, we make use of
OPP-115 dataset [11]. We perform the evaluation over Ramanath et al.’s ACL/COLING 2014 dataset [42]. This
the same set of 50 privacy policies which we did not use dataset contains the body of 1,010 privacy policies ex-
to train Polisis (i.e., kept aside as a testing set). Each seg- tracted between December 2013 and January 2014. We
ment in the OPP-115 dataset has been labeled by three obtained the icons for the same set of sites using the Dis-
experts. Hence, we take the union of the experts’ labels connect privacy icons extension [18]. Of these, 354 poli-
on one hand and the predicted labels from Polisis on the cies had been (at least partially) annotated in the Discon-
other hand. Then, we run the logic presented in Table 2 nect dataset. We automatically assign the icons for these
(Columns 4 and 5) to assign icons to each policy based sites by passing their policy contents into Polisis and ap-
on each set of labels. plying the rules in Table 2 on the generated automatic la-
Table 3 shows the accuracy obtained per icon, mea- 4 https://en.wikipedia.org/wiki/Cohen%27s kappa
sured as the fraction of policies where the icon based on 5 https://en.wikipedia.org/wiki/Hellinger distance

7
Table 2: The list of Disconnect icons with their description, our interpretation, and Polisis’ queries.
Icon Disconnect Description Disconnect Color Assignment Interpretation as Labels Automated Color Assignment
Discloses whether data it Red: Yes, w/o choice to
Expected collects about you is opt-out. Or, undisclosed.
Use Let S be the segments with category:
used in ways other than
Yellow: Yes, with choice to first-party-collection-use and purpose:
you would reasonably
opt-out. advertising.
expect given the site’s 
service? Green: No. 





Let S be the segments with category: 


Expected Discloses whether it Red: Yes, w/o choice to third-party-sharing-collection, purpose:




Collec- allows other companies opt-out. Or, undisclosed. ∈ [advertising,analytics-research ], and




tion like ad providers and Yellow: Yes, with choice to action-third-party



analytics firms to track opt-out. ∈ [track-on-first-party-website-app,collect-

 Yellow: All segments in S have
users on the site? on-first-party-website-app]. category: user-choice-control and
Green: No. 

 choice-type ∈

[opt-in, opt-out-link,




Precise Discloses whether the Red: Yes, possibly w/o choice.



 opt-out-via-contacting-company]
Location site or service tracks a Let S be the segments with


Green: S = φ
Yellow: Yes, with choice.


user’s actual personal-information-type: location.



geolocation? Green: No.


 Red: Otherwise

Green: All segments in S have


Data Red: No data retention policy. retention-period: ∈
Retention Discloses how long they
Let S be the segments with category: [stated-period, limited ].
retain your personal Yellow: 12+ months. data-retention.
data? Red: S = φ
Green: 0-12 months.
Yellow: Otherwise
Children
Privacy Has this website received Let S be the segments with category: Green: length(S) > 0
TrustArc’s Children’s Green: Yes. Gray: No. international-and-specific-audiences and
Privacy Certification? audience-type: children Red: Otherwise

Table 3: Prediction accuracy and κ for icon prediction, with 3–5x the distance in the Table 3).
the distribution of icons per color based on OPP-115 labels.
This discrepancy might stem from our icon-
Hellinger
assignment strategy in Table 2, where we assign a
Icon Accuracy Cohen κ N(R) N(G) N(Y)
distance yellow label only when “All segments in S (the con-
Exp. Use 92% 0.76 0.12 41 8 1 cerned subset)” include the opt-in/opt-out choice, which
Exp. Collection 88% 0.69 0.19 35 12 3 could be considered as conservative. In Fig. 6, we show
Precise Location 84% 0.68 0.21 32 14 4
Data Retention 80% 0.63 0.13 29 16 5 the icon distributions when relaxing the yellow-icon
Children Privacy 98% 0.95 0.02 12 38 NA condition to become: “At least one segment in S” in-
cludes the opt-in/opt-out choice. Intuitively, this means
that the choice segment, when present, should explicitly
mention advertising/analytics (depending on the icon
bels. We report the results for the Expected Use and Ex- type). Although the number of yellow icons increases
pected Collection icons as they are directly interpretable slightly, the icons with the new permissive strategy are
by Polisis. We do not report the rest of the icons because significantly red-dominated. The Hellinger distances
the location information label in the OPP-115 taxonomy between those distributions drop to 0.47 and 0.50 for
included non-precise location (e.g., zip codes), and there Expected Use and Expected Collection, respectively.
was no label that distinguishes the exact retention period. This result indicates that the majority of policies do
Moreover, the Children privacy icon is assigned through not provide users a choice within the same segments
a certification process that does not solely rely on the pri- describing data usage for advertising or data collection
vacy policy. by third parties.
Fig. 5 shows the distribution of automatically ex- We go one step further to follow an even more permis-
tracted icons vs. the distribution of icons from Discon- sive strategy where we assign the yellow label to any pol-
nect, when they were available. The discrepancy be- icy with S! = φ , given that there is at least one segment in
tween the two distributions is obvious: the vast majority the whole policy (i.e., even outside S) with opt-in/opt-out
of the Disconnect icons have a yellow label, indicating choice. For example, a policy where third-party adver-
that the policies offer the user an opt-out choice (from tising is mentioned in the middle of the policy while the
unexpected use or collection). The Hellinger distances opt-out choice about another action is mentioned at the
between those distributions are 0.71 and 0.61 for Ex- end of the policy would still receive a yellow label. The
pected Use and Expected Collection, respectively (i.e., icon distributions, in this case, are illustrated in Fig. 7,

8
Polisis Polisis Polisis Polisis Polisis Polisis

TRUSTe TRUSTe TRUSTe TRUSTe TRUSTe TRUSTe


0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Percentage % Percentage % Percentage % Percentage % Percentage % Percentage %
(a) Exp. Use (b) Exp. Collection (a) Exp. Use (b) Exp. Collection (a) Exp. Use (b) Exp. Collection

Fig. 5: Conservative icons’ interpretation Fig. 6: Permissive icons’ interpretation Fig. 7: Very permissive icons’ interpretation

with Hellinger distance of 0.22 for Expected Use and 7 Free-form Question-Answering
0.19 for Expected Collection. Only in this interpreta- Our second application of Polisis is PriBot, a sys-
tion of the icons would the distributions of Disconnect tem that enables free-form queries (in the form of user
and Polisis come within reasonable proximity. In order questions) on privacy policies. PriBot is primarily moti-
to delve more into the factors behind this finding, we vated by the rise of conversation-first devices, such as
conducted a manual analysis of the policies. We found voice-activated digital assistants (e.g., Amazon Alexa
that, due to the way privacy policies are typically written, and Google Assistant) and smartwatches. For these de-
data collection and sharing are discussed in dedicated vices, the existing techniques of linking to a privacy pol-
parts of the policy, without mentioning user choices. The icy or reading it aloud are not usable. They might require
choices (mostly opt-out) are discussed in a separate sec- the user to access privacy-related information and con-
tion when present, and they cover a small subset of the trols on a different device, which is not desirable in the
collected/shared data. In several cases, these choices long run [8].
are neither about the unexpected use (i.e., advertising) To support these new forms of services and the emerg-
nor unexpected collection by third parties (i.e., advertis- ing need for automated customer support in this do-
ing/analytics). Although our primary hypothesis is that main [43], we present PriBot as an intuitive and user-
this is due to TRUSTe’s database being generally permis- friendly method to communicate privacy information.
sive, it can be partially attributed to a potential discrep- PriBot answers free-form user questions from a previ-
ancy between our versions of analyzed policies and the ously unseen privacy policy, in real time and with high
versions used by TRUSTe (despite our efforts to reduce accuracy. Next, we formalize the problem of free-form
this discrepancy). privacy QA and then describe how we leverage Polisis to
build PriBot.
6.3 Discussion 7.1 Problem Formulation
The input to PriBot consists of a user question q about
There was no loss of generality when considering only a privacy policy. PriBot passes q to the ML layer and the
two of the icons; they provided the needed evidence policy’s link to the Data Layer. The ML layer probabilis-
of TRUSTe/TrustArc potentially following a permissive tically annotates q and each policy’s segments with the
strategy when assigning icons to policies. A developer privacy categories and attribute-value pairs of Fig. 3.
could still utilize Polisis to extract the rest of the icons The segments in the privacy policy constitute the pool
by either augmenting the existing taxonomy or by per- of candidate answers {a1 , a2 , . . . , aM }. A subset G of the
forming additional natural language processing on the answer pool is the ground-truth. We consider an answer
segments returned by Polisis. In the vast majority of the ak as correct if ak ∈ G and as incorrect if ak ∈ / G . If G is
cases, whenever the icon definition is to be changed (e.g., empty, then no answers exist in the privacy policy.
to reflect a modification in the regulations), this change
can be supported at the rules level, without modifying 7.2 PriBot Ranking Algorithm
Polisis itself. This is because Polisis already predicts a Ranking Score: In order to answer the user question,
comprehensive set of labels, covering a wide variety of PriBot ranks each potential answer6 a by computing a
rules. proximity score s(q, a) between a and the question q.
Furthermore, by automatically generating icons, we This is within the Class Comparison module of the Ap-
do not intend to push humans completely out of the loop, plication Layer. To compute s(q, a), we proceed as fol-
especially in situations where legal liability issues might lows. Given the output of the Segment Classifier, an an-
arise. Polisis can assist human annotators by providing swer is represented as a vector:
initial answers to their queries and the supporting evi-
dence. In other words, it accurately flags the segments of α = {p(ci |a)2 × p(v j |a) | ∀ci ∈ C , v j ∈ V (ci )}
interest to an annotator’s query so that the annotator can 6 For
notational simplicity, we henceforth use a to indicate an an-
make a final decision. swer instead of ak .

9
for categories ci ∈ C and values v j ∈ V (ci ) descending contains unknown words (e.g., in a non-English language
from ci . Similarly, given the output of the Query Ana- or with too many spelling mistakes). Taking into consid-
lyzer, the question is represented as: eration these criteria, we compute a confidence indicator
as follows:
β = {p(ci |q)2 × p(v j |q) | ∀ci ∈ C , v j ∈ V (ci )}
(cer(q) + frac(q))
conf(q, a) = s(q, a) ∗ (3)
The category probability in both α and β is squared to 2
put more weight on the categories at the time of com-
parison. Next, we compute a certainty measure of the where the categorization certainty measure cer(q) is
answer’s high-level categorization. This measure is de- computed similarly to cer(a) in Eq. (1), and s(q, a) is
rived from the entropy of the normalized probability dis- computed according to Eq. (2). The fraction of known
tribution (pn ) of the predicted categories: words frac(q) is based on the presence of the question’s
words in the vocabulary of our Policies Embeddings’ cor-
cer(a) = 1 − (− ∑ (pn (ci |a) × ln(pn (ci |a))) / ln(|C |)) pus.
(1) Potentially Conflicting Answers Another challenge is
Akin to a dot product between two vectors, we com- displaying potentially conflicting answers to users. One
pute the score s(q, a) as: answer could describe a general sharing clause while an-
other specifies an exception (e.g., one answer specifies
∑i (βi × min(βi , αi )) “share” and another specifies “do not share”). To miti-
s(q, a) = × cer(a) (2)
∑i βi2 gate this issue, we used the same CNN classifier of Sec. 4
and exploited the fact that the OPP-115 dataset had op-
As answers are typically longer than the question and
tional labels of the form: “does” vs. “does not” to indi-
involve a higher number of significant features, this score
cate the presence or absence of sharing/collection. Our
prioritizes the answers containing significant features
classifier had a cross-validation F1 score of 95%. Hence,
that are also significant in the question. The min func-
we can use this classifier to detect potential discrepancies
tion and the denominator are used to normalize the score
between the top-ranked answers. The UI of PriBot can
within the range [0, 1].
thus highlight the potentially conflicting answers to the
To illustrate the strength of PriBot and its answer-
user.
ranking approach, we consider the following question
(posed by a Twitter user): 8 PriBot Evaluation
“Under what circumstances will you release to 3rd parties?” We assess the performance of PriBot with two met-
Then, we consider two examples of ranked segments rics: the predictive accuracy (Sec. 8.3) of its QA-ranking
by PriBot. The first segment has a ranking score of 0.63: model and the user-perceived utility (Sec. 8.4) of the pro-
“Personal information will not be used or disclosed for pur- vided answers. This is motivated by research on the eval-
poses other than those for which it was collected, except uation of recommender systems, where the model with
with the consent of the individual or as required by law. . . ” the best accuracy is not always rated to be the most help-
The second has a ranking score of 0: “All personal in- ful by users [44].
formation collected by the TTC will be protected by using
appropriate safeguards against loss, theft and unauthorized 8.1 Twitter Dataset
access, disclosure, copying, use or modification.” In order to evaluate PriBot with realistic privacy ques-
Although both example segments share terms such as tions, we created a new privacy QA dataset. It is worth
“personal” and “information,” PriBot ranks them differ- noting that we utilize this dataset for the purpose of test-
ently. It accounts for the fact that the question and the ing PriBot, not for training it. Our requirements for this
first segment share the same high-level category: 3rd dataset were that it (1) must include free-form questions
Party Collection while the second segment is categorized about the privacy policies of different companies and (2)
under Data Security. must have a ground-truth answer for each question from
Confidence Indicator: The ranking score is an internal the associated policy.
metric that specifies how close each segment is to the To this end, we collected, from Twitter, privacy-related
question, but does not relay PriBot’s certainty in report- questions users had tweeted at companies. This approach
ing a correct answer to a user. Intuitively, the confidence avoids subject bias, which is likely to arise when elicit-
in an answer should be low when (1) the answer is se- ing privacy-related questions from individuals, who will
mantically far from the question (i.e., s(q, a) is low), (2) not pose them out of genuine need. In our collection
the question is interpreted ambiguously by Polisis, (i.e., methodology, we aimed at a QA test set of size be-
classified into multiple high-level categories resulting in tween 100 and 200 QA pairs, as is the convention in
a high classification entropy), or (3) when the question similar human-annotated QA evaluation domains, such

10
as the Text REtrieval Conference (TREC) and SemEval- 8.2 QA Baselines
2015 [45, 46, 47]. We compare PriBot’s QA model against three baseline
To avoid searching for questions via biased keywords, approaches that we developed: (1) Retrieval reflects the
we started by searching for reply tweets that direct state-of-the-art in term-matching retrieval algorithms, (2)
the users to a company’s privacy policy (e.g., using SemVec representing a single neural network classifier,
queries such as ”filter:replies our privacy policy” and and (3) Random as a control approach where questions
”filter:replies our privacy statement” ). We then back- are answered with random policy segments.
tracked these reply tweets to the (parent) question tweets Our first baseline, Retrieval, builds on the BM25 algo-
asked by customers to obtain a set of 4,743 pairs of rithm [48], which is the state-of-the-art in ranking mod-
tweets, containing privacy questions but also substan- els employing term-matching. It has been used success-
tial noise due to the backtracking approach. Following fully across a range of search tasks, such as the TREC
the best practices of noise reduction in computational evaluations [49]. We improve on the basic BM25 model
social science, we automatically filtered the tweets to by computing the inverse document frequency on the
keep those containing question marks, at least four words Policies Corpus of Sec. 4.2 instead of a single policy.
(excluding links, hashtags, mentions, numbers and stop Retrieval ranks the segments in the policy according to
words), and a link to the privacy policy, leaving 260 pairs their similarity score with the user’s question. This score
of question–reply tweets. This is an example of a tweet depends on the presence of distinctive words that link a
pair which was removed by the automatic filtering: user’s question to an answer.
Our second baseline, SemVec employs a single clas-
Question: “@Nixxit your site is very suspicious.”
sifier trained to distinguish among all the (mandatory)
Answer: “@elitelinux Updated it with our privacy policy.
attribute-values (with > 20 annotations) from the OPP-
Apologies, but we’re not fully up yet and running shoe
115 dataset (81 classes in total). An example segment is
string.”
“geographic location information or other location-based
Next, two of the authors independently validated each information about you and your device”. We obtain a
of the tweets to remove question tweets (a) that were micro-average precision of 0.56 (i.e., the classifier is, on
not related to privacy policies, (b) to which the replies average, predicting the right label across the 81 classes
are not from the official company account, and (c) with in 56% of the cases – compared to 3.6% precision for
inaccessible privacy policy links in their replies. The a random classifier). After training this model, we ex-
level of agreement (Cohen’s Kappa) among both anno- tract a “semantic vector”: a representation vector that
tators for the labels valid vs. invalid was almost perfect accounts for the distribution of attribute values in the in-
(κ = 0.84) [40]. The two annotators agreed on 231 of the put text. We extract this vector as the input to the sec-
question tweets (of the 260), tagging 182 as valid and 49 ond dense layer (shown Fig. 4). SemVec ranks the sim-
as invalid. This is an example of a tweet pair which was ilarity between a question and a policy segment using
annotated as invalid: the Euclidean distance between semantic vectors. This
Question: “What is your worth then? You can’t do it? approach is similar to what has been applied previously
Nuts.” in image retrieval, where image representations learned
Answer: “@skychief26 3/3 You can view our privacy policy from a large-scale image classification task were effec-
at http://t.co/ksmaIK1WaY. Thanks.” tive in visual search applications [50].

This is an example of a tweet pair annotated as valid: 8.3 Predictive Accuracy Evaluation
Question: “@myen Are Evernote notes encrypted at rest?” Here, we evaluate the predictive accuracy of PriBot’s
Answer: “We’re not encrypting at rest, but are en- QA model by comparing its predicted answers against
crypting in transit. Check out our Privacy Policy here: expert-generated ground-truth answers for the questions
http://bit.ly/1tauyfh.” of the Twitter QA Dataset.
As we wanted to evaluate the answers to these ques- Ground-Truth Generation: Two of the authors gener-
tions with a user study, our estimates of an adequately- ated the ground-truth answers to the questions from the
sized study led us to randomly sample 120 tweets out of Twitter QA Dataset. They were given a user’s question
the tweets which both annotators labeled as valid ques- (tweet) and the segments of the corresponding policy.
tions. We henceforth refer to them as the Twitter QA Each policy consists of 45 segments on average (min=12,
Dataset. It is worth mentioning that although our QA ap- max=344, std=37). Each annotator selected indepen-
plications extend beyond the Twitter medium, this kind dently, the subset of these segments which they consider
of questions is as close as we can get to testing with the as best responding to the user’s question. This annota-
worst-case scenario: informal discourse, with spelling tion took place prior to generating the answers using our
and grammar errors, that is targeted at humans. models to avoid any bias. While deciding on the answers,

11
Random Retrieval SemVec PriBot entirely surprising since we seeded Retrieval with a large
1.0 0.7
0.8 0.6 corpus of 130K unsupervised policies, thus improving its
0.5
top-k score

0.6 performance on answers with matching terms.

NDCG
0.4
0.4 0.3
0.2
0.2 Policy Length We now assess the impact of the policy
0.1
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 length on PriBot’s accuracy. First, we report the Nor-
k k malized Discounted Cumulative Gain (NDCG) [53]. In-
(a) top-k score (b) NDCG tuitively, it indicates that a relevant document’s useful-
ness decreases logarithmically with the rank. This met-
Fig. 8: Accuracy metrics as a function of k.
ric captures how presenting the users with more choices
affects their user experience as they need to process
more text. Also, it is not biased by the length of the
the annotators accounted for the fact that multiple seg-
policy. The DCG part of the metric is computed as
ments of the policy might answer a question.
DCGk = ∑ki=1 log rel i
, where reli is 1 if answer ai is cor-
After finishing the individual annotations, the two an- 2 (i+1)

notators consolidated the differences in their labels to rect and 0 otherwise. NDCG at k is obtained by normal-
reach an agreed-on set of segments; each assumed to be izing the DCGk with the maximum possible DCGk across
answering the question. We call this the ground-truth all values of k. We show in Fig. 8b the average NDCG
set for each question. The annotators agreed on at least across questions for each value of k. It is clear that Pri-
one answer in 88% of the questions for which they found Bot’s model consistently exhibits superior NDCG. This
matching segments, thus signifying a substantial over- indicates that PriBot is poised to perform better in a sys-
lap. Cohen’s κ, measuring the agreement on one or more tem where low values of k matter the most.
answer, was 0.65, indicating substantial agreement [40]. Second, to further focus on the effect of policy length,
We release this dataset, comprising the questions, the we categorize the policy lengths (#segments) into short,
policy segments, and the ground-truth answers per ques- medium, and high, based on the 33rd and the 66th per-
tion at https://pribot.org/data.html. centiles (i.e., corresponding to #segments of 28 and 46).
We then generated, for each question, the predicted We then compute a metric independent of k, namely, the
ranked list of answers according to each QA model (Pri- Mean Average Precision (MAP), which is the mean of
Bot and the other three baselines). In what follows, we the area under the precision-recall curve across all ques-
evaluate the predictive accuracy of these models. tions. Informally, MAP is an indicator of whether all the
correct answers get ranked highly. We see from Fig. 9
Top-k Score: We first report the top-k score, a widely that, for short policies, the Retrieval model is within 15%
used and easily interpretable metric, which denotes the of the MAP of PriBot’s model, which makes sense given
portion of questions having at least one correct answer the smaller number of potential answers. With medium-
in the top k returned answers. It is desirable to achieve a sized policies, PriBot’s model is better by a large margin.
high top-k score for low values of k so that the user has This margin is still considerable with long policies.
to process less information before reaching a correct an-
swer. Fig. 8a shows how the top-k score varies as a func- Confidence Indicator Comparing the confidence (using
tion of k. PriBot’s model has the best performance over the indicator from Eq. (3)) of incorrect answers predicted
the other three models by a large margin, especially at the by PriBot (mean=0.37, variance=0.04) with the confi-
low values of k. For example, at k = 1, PriBot has a top-k dence of correct answers (mean=0.49, variance =0.05)
score of 0.68, which is significantly larger than the scores shows that PriBot places lower confidence in the answers
of 0.39 (Retrieval), 0.27 (SemVec), and 0.08 (Random) that turn out to be incorrect. Hence, we can use the con-
(p-value < 0.05 according to pairwise Fisher’s exact test, fidence indicator to filter out the incorrect answers. For
corrected with Bonferroni method for multiple compar- example, by setting the condition: conf(q, a) ≥ 0.6 to ac-
isons). PriBot further reaches a top-k score of 0.75, cept PriBot’s answers, we can enhance the top-1 accu-
0.82, and 0.87 for k ∈ {2, 3, 4}. To put these numbers in racy to 70%. This indicator delivers another advantage:
the wider context of free-form QA systems, we note that its components are independently interpretable by the ap-
the top-1 accuracy reported by IBM Watson’s team on a plication logic. If the score s(q, a) of the top-1 answer is
large insurance domain dataset (a training set of 12,889 too low, the user can be notified that the policy might not
questions and 21,325 answers) was 0.65 in 2015 [51] and contain an answer to the question. A low value of cer(q)
was later improved to 0.69 in 2016 [52]. Given that Pri- indicates that the user might have asked an ambiguous
Bot had to overcome the absence of publicly available question; the system can ask the user back for a clarifica-
QA datasets, our top-1 accuracy value of 0.68 is on par tion.
with such systems. We also observe that the Retrieval Pre-trained Embeddings Choice As discussed in
model outperforms the SemVec model. This result is not Sec. 4, we utilize our custom Policies Embeddings,

12
Random Retrieval SemVec PriBot WP-NoSub WP PE-NoSub PE
1.0
Mean Average Precision

0.8
0.8

top-k score
0.6 0.6
0.4 0.4
0.2 0.2
0.0 1.0 2.0 3.0 4.0
0.0 short medium long k
Policy Length
Fig. 10: top-k score of
Fig. 9: Variation of MAP PriBot with different pre-
across policy lengths. trained embeddings.
Fig. 11: An example of a QA pair displayed to the respon-
dents.
which have the two properties of (1) being domain-
specific and (2) using subword embeddings to handle
out-of-vocabulary words. We test the efficacy of this QA pairs with the same question. The other two ques-
choice by studying three variants of pre-trained embed- tions are randomly positioned anchor questions serving
dings. For the first variant, we start from our Policies as attention checkers. Additionally, we enforce a mini-
Embeddings (PE), and we disable the subwords mode, mum duration of 15 seconds for the respondent to eval-
thus only satisfying the first property; we call it PE- uate each QA pair, with no maximum duration enforced.
NoSub. The second variant is the fastText Wikipedia Em- We include an open-ended Cloze reading comprehension
beddings from [54], trained on the English Wikipedia, test [55]; we used the test to weed out the responses with
thus only satisfying the second property; we denote it as a low score, indicating a poor reading skill.
WP. The third variant is WP, with the subword mode
Participant Recruitment: After obtaining an IRB ap-
disabled, thus satisfying neither property; we call it WP-
proval, we recruited 700 Amazon MTurk workers with
NoSub. In Fig. 10, we show the top-k score of PriBot
previous success rate >95%, to complete our survey.
on our Twitter QA dataset with each of the four pre-
With this number of users, each QA pair received eval-
trained embeddings. First, we can see that our Policies
uations from at least 7 different individuals. We com-
Embeddings outperform the other models for all values
pensated each respondent with $2. With an average
of k, scoring 14% and 5% more than the closest vari-
completion time of 14 minutes, this makes the average
ant at k = 1 and k = 2, respectively. As expected, the
pay around $8.6 per hour (US Federal minimum wage
domain-specific model without subwords embeddings
is $7.25). While not fully representative of the general
(PE-NoSub) has a weaker performance by a significant
population, our set of participants exhibited high intra-
margin, especially for the top-1 answer. Interestingly, the
group diversity, but little difference across the respon-
difference is much narrower between the two Wikipedia
dent groups. Across all respondents, the average age is
embeddings since their vocabulary already covers more
34 years (std=10.5), 62% are males, 38% are females,
than 2.5M tokens. Hence, subword embeddings play a
more than 82% are from North America, more than 87%
less pronounced role there. In sum, the advantage of us-
have some level of college education, and more than 88%
ing subwords embeddings with the PE model originates
reported being employed.
from their domain specificity and their ability to compen-
sate for the missing words from the vocabulary. QA Pair Evaluation: To evaluate the relevance for a
QA pair, we display the question and the candidate an-
8.4 User-Perceived Utility Evaluation swer as shown in Fig. 11. We asked the respondents to
rate whether the candidate response provides an answer
We conducted a user study to assess the user-perceived
to the question on a 5-point Likert scale (1=Definitely Yes
utility of the automatically generated answers. This as-
to 5=Definitely No), as evident in Fig. 11. We denote a
sessment was done for each of the four different con-
respondent’s evaluation of a single candidate answer cor-
ditions (Retrieval, SemVec, PriBot and Random). We
responding to a QA pair as relevant (irrelevant) if s/he
evaluated the top-3 responses of each QA approach to
chooses either Definitely Yes (Definitely No) or Partially
each question. Thus, we assess the utility of 360 answers
Yes (Partially No). We consolidate the evaluations of
to 120 questions per approach.
multiple users per answer by following the methodology
Study Design: We used a between-subject design by outlined in similar studies [10], which consider the an-
constructing four surveys, each corresponding to a differ- swer as relevant if labeled as relevant by a certain frac-
ent evaluation condition. We display a series of 17 QA tion of users. We took this fraction as 50% to ensure a
pairs (each on a different page). Of these, 15 are a ran- majority agreement. Generally, we observed the respon-
dom subset of the pool of 360 QA pairs (of the evaluated dents to agree on the relevance of the answers. Highly
condition) such that a participant does not receive two mixed responses, where 45–55% of the workers tagged

13
to that of Eq. (3) to convey the (un)certainty of a reported
Table 4: top-k relevance score by evaluation group.
result, whether it is an answer, an icon, or another form of
top-k Relevance Score short notice. Last but not least, Polisis is not guaranteed
Group N
k=1 k=2 k=3 to be robust in handling an adversarially constructed pri-
Random 180 0.37 0.59 0.76 vacy policy. An adversary could include valid and mean-
Retrieval 184 0.46 0.71 0.79
SemVec 153 0.48 0.71 0.85
ingful statements in the privacy policy, carefully crafted
PriBot 183 0.70 0.78 0.89 to mislead Polisis’ automated classifiers. For example,
an adversary can replace words, in the policy, with syn-
onyms that are far in our embeddings space. While the
the answer as relevant, constituted less than 16% of the modified policy has the same meaning, Polisis might mis-
cases. classify the modified segments.
User Study Results: As in the previous section, we com- Deployment: We provide three prototype web applica-
pute the top-k score for relevance (i.e., the portion of tions for end-users. The first is an application that visual-
questions having at least one user-relevant answer in the izes the different aspects in the privacy policy, powered
top k returned answers). Table 4 shows this score for by the annotations from Polisis (available as a web ap-
the four QA approaches with k ∈ {1, 2, 3}, where PriBot plication and a browser extension for Chrome and Fire-
clearly outperforms the three baseline approaches. The fox). The second is a chatbot implementation of Pri-
respondents regarded at least one of the top-3 answers as Bot for answering questions about privacy policies in
relevant for 89% of the questions, with the first answer a conversational interface. The third is an application
being relevant in 70% of the cases. In comparison, for for extracting the privacy labels from several policies,
k = 1, the scores were 46% and 48% for the Retrieval given their links. These applications are available at
and the SemVec models respectively (p-value <= 0.05 https://pribot.org.
according to pairwise Fishers exact test, corrected with Legal Aspects We also want to stress the fact that Polisis
Holm-Bonferroni method for multiple comparisons). An is not intended to replace the legally-binding privacy pol-
avid reader might notice some differences between the icy. Rather, it offers a complementary interface for pri-
predictive models’ accuracy (Section 8.3) and the users’ vacy stakeholders to easily inquire the contents of a pri-
perceived quality. This is actually consistent with the ob- vacy policy. Following the trend of automation in legal
servations from research in recommender systems where advice [56], insurance claim resolution [57], and privacy
the prediction accuracy does not always match user’s sat- policy presentation [58, 16], third parties, such as auto-
isfaction [44]. For example, the top-k score metric for mated legal services firms or regulators, can deploy Poli-
accuracy differs by 2%, -3%, and 6% with respect to the sis as a solution for their users. As is the standard in such
perceived relevance in the PriBot model. Another ex- situations, these parties should amend Polisis with a dis-
ample is that the SemVec model and the Retrieval have claimer specifying that it is based on automatic analysis
smaller differences in this study than Sec. 8.3. We con- and does not represent the actual service provider [59].
jecture that the score shift with SemVec model is due Companies and service providers can internally de-
to some users accepting answers which match the ques- ploy an application similar to PriBot as an assistance
tion’s topic even when the actual details of the answer tool for their customer support agents to handle privacy-
are irrelevant. related inquiries. Putting the human in the loop allows
for a favorable trade-off between the utility of Polisis
9 Discussion and its legal implications. For a wider discussion on
Limitations Polisis might be limited by the employed the issues surrounding automated legal analysis, we re-
privacy taxonomy. Although the OPP-115 taxonomy fer the interested reader to the works of McGinnis and
covers a wide variety of privacy practices [11], there are Pearce [60] and Pasquale [61].
certain types of applications that it does not fully cap- Privacy-Specificity of the Approach: Finally, our ap-
ture. One mitigation is to use Polisis as an initial step proach is uniquely tailored to the privacy domain both
in order to filter the relevant data at a high level before from the data perspective and from the model-hierarchy
applying additional, application-specific text processing. perspective. However, we envision that applications with
Another mitigation is to leverage Polisis’ modularity by similar needs would benefit from extensions of our ap-
amending it with new categories/attributes and training proach, both on the classification level and the QA level.
these new classes on the relevant annotated dataset.
Moreover, Polisis, like any automated approach, ex- 10 Related Work
hibits instances of misclassification that should be ac- Privacy Policy Analysis: There have been numerous at-
counted for in any application building on it. One way to tempts to create easy-to-navigate and alternative presen-
mitigate this problem is using confidence scores, similar tations of privacy policies. Kelley et al. [32] studied us-

14
ing nutrition labels as a paradigm for displaying privacy 11 Conclusion
notices. Icons representing the privacy policies have also We proposed Polisis, the first generic framework that
been proposed [31, 62]. Others have proposed standards enables detailed automatic analysis of privacy policies.
to push service providers to encode privacy policies in It can assist users, researchers, and regulators in process-
a machine-readable format, such as P3P [13], but they ing and understanding the content of privacy policies at
have not been adopted by browser developers and ser- scale. To build Polisis, we developed a new hierarchy
vice providers. Polisis has the potential to automate the of neural networks that extracts both high-level privacy
generation of a lot of these notices, without relying on practices as well as fine-grained information from pri-
the respective parties to do it themselves. vacy policies. Using this extracted information, Polisis
Recently, several researchers have explored the poten- enables several applications. In this paper, we demon-
tial of automated analysis of privacy policies. For ex- strated two applications: structured and free-form query-
ample, Liu et al. [58] have used deep learning to model ing. In the first example, we use Polisis’ output to ex-
the vagueness of words in privacy policies. Zimmeck tract short notices from the privacy policy in the form
et al. [63] have been able to show significant incon- of privacy icons and to audit TRUSTe’s policy analysis
sistencies between app practices and their privacy poli- approach. In the second example, we build PriBot that
cies via automated analysis. These studies, among oth- answers users’ free-form questions in real time and with
ers [64, 65], have been largely enabled by the release of high accuracy. Our evaluation of both applications re-
the OPP-115 dataset by Wilson et al. [11], containing veals that Polisis matches the accuracy of expert analysis
115 privacy policies extensively annotated by law stu- of privacy policies. Besides these applications, Polisis
dents. Our work is the first to provide a generic sys- opens opportunities for further innovative privacy policy
tem for the automated analysis of privacy policies. In presentation mechanisms, including summarizing poli-
terms of the comprehensiveness and the accuracy of the cies into simpler language. It can also enable compar-
approach, Polisis makes a major improvement over the ative shopping applications that advise the consumer by
state of the art. It allows transitioning from labeling of comparing the privacy aspects of multiple applications
policies with a few practices (e.g., the works by Zim- they want to choose from.
meck and Bellovin [16] and Sathyendra et al. [17]) to a
much more fine-grained annotation (up to 10 high-level Acknowledgements
and 122 fine-grained classes), thus enabling a richer set This research was partially funded by the Wisconsin
of applications. Alumni Research Foundation and the US National Sci-
ence Foundation under grant agreements CNS-1330596
Evaluating the Compliance Industry: Regulators and and CNS-1646130.
researchers are continuously scrutinizing the practices of
References
the privacy compliance industry [21, 38, 39]. Miyazaki
and Krishnamurthy [21] found no support that partici- [1] F. H. Cate, “The limits of notice and choice,” IEEE Security Pri-
vacy, vol. 8, no. 2, pp. 59–62, March 2010.
pating in a seal program is an indicator of following pri-
vacy practice standards. The FTC has found discrepan- [2] Federal Trade Commission, “Protecting Consumer Privacy in an
Era of Rapid Change,” March 2012.
cies between the practical behaviors of the companies, as
reported in their privacy policies, and the privacy seals [3] J. Gluck, F. Schaub, A. Friedman, H. Habib, N. Sadeh, L. F.
Cranor, and Y. Agarwal, “How short is too short? implications
they have been granted [39]. Polisis can be used by these of length and framing on the effectiveness of privacy notices,”
researchers and regulators to automatically, and contin- in Twelfth Symposium on Usable Privacy and Security (SOUPS
uously perform such checks at scale. It can provide the 2016). Denver, CO: USENIX Association, 2016, pp. 321–340.
initial evidence that could be processed by skilled experts [4] A. M. McDonald and L. F. Cranor, “The cost of reading privacy
afterward, thus reducing the analysis time and the cost. policies,” ISJLP, vol. 4, p. 543, 2008.
[5] President’s Concil of Advisors on Science and Technology, “Big
Automated Question Answering: Our QA system, Pri- data and privacy: A technological perspective. Report to the Pres-
ident, Executive Office of the President,” May 2014.
Bot, is focused on non-factoid questions, which are usu-
ally complex and open-ended. Over the past few years, [6] F. Schaub, R. Balebako, and L. F. Cranor, “Designing effective
privacy notices and controls,” IEEE Internet Computing, vol. 21,
deep learning has yielded superior results to traditional no. 3, pp. 70–77, 2017.
retrieval techniques in this domain [51, 52, 66]. Our
[7] Federal Trade Commission, “Internet of Things, Privacy & Secu-
main contribution is that we build a QA system, with- rity in a Connected World,” Jan. 2015.
out a dataset that includes questions and answers, while
[8] F. Schaub, R. Balebako, A. L. Durity, and L. F. Cranor, “A design
achieving results on par with the state of the art on other space for effective privacy notices,” in Eleventh Symposium On
domains. We envision that our approach could be trans- Usable Privacy and Security (SOUPS 2015). Ottawa: USENIX
planted to other problems that face similar issues. Association, 2015, pp. 1–17.

15
[9] A. Rao, F. Schaub, N. Sadeh, A. Acquisti, and R. Kang, “Ex- [24] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
pecting the unexpected: Understanding mismatched privacy ex- “Distributed representations of words and phrases and their com-
pectations online,” in Twelfth Symposium on Usable Privacy and positionality,” in Advances in neural information processing sys-
Security (SOUPS 2016). Denver, CO: USENIX Association, tems, 2013, pp. 3111–3119.
2016, pp. 77–96. [25] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global
[10] S. Wilson, F. Schaub, R. Ramanath, N. Sadeh, F. Liu, N. A. vectors for word representation,” in Empirical Methods in Natu-
Smith, and F. Liu, “Crowdsourcing annotations for websites’ pri- ral Language Processing (EMNLP), 2014, pp. 1532–1543.
vacy policies: Can it really work?” in Proceedings of the 25th In- [26] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin, “Learning
ternational Conference on World Wide Web, ser. WWW ’16. Re- sentiment-specific word embedding for twitter sentiment classifi-
public and Canton of Geneva, Switzerland: International World cation.” in ACL (1), 2014, pp. 1555–1565.
Wide Web Conferences Steering Committee, 2016, pp. 133–143.
[27] N. Viennot, E. Garcia, and J. Nieh, “A measurement study of
[11] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. G. google play,” in ACM SIGMETRICS Performance Evaluation Re-
Leon, M. S. Andersen, S. Zimmeck, K. M. Sathyendra, N. C. view, vol. 42, no. 1. ACM, 2014, pp. 221–233.
Russell, T. B. Norton, E. H. Hovy, J. R. Reidenberg, and N. M.
Sadeh, “The creation and analysis of a website privacy policy [28] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich-
corpus,” in Proceedings of the 54th Annual Meeting of the Asso- ing word vectors with subword information,” arXiv preprint
ciation for Computational Linguistics, ACL 2016, August 7-12, arXiv:1607.04606, 2016.
2016, Berlin, Germany, Volume 1: Long Papers, 2016. [29] S. Bird and E. Loper, “Nltk: the natural language toolkit,” in Pro-
[12] U.S. Department of Commerce, “Privacy shield program ceedings of the ACL 2004 on Interactive poster and demonstra-
overview,” https://www.privacyshield.gov/Program- Overview, tion sessions. Association for Computational Linguistics, 2004,
2017, accessed: 10-01-2017. p. 31.
[13] L. Cranor, Web privacy with P3P. ” O’Reilly Media, Inc.”, 2002. [30] D. Britz, “Understanding convolutional neural networks
for NLP,” http : / / www.wildml.com / 2015 / 11 / understanding -
[14] P. G. Kelley, J. Bresee, L. F. Cranor, and R. W. Reeder, “A ”nutri- convolutional - neural - networks - for - nlp/, 2015, accessed:
tion label” for privacy,” in Proceedings of the 5th Symposium on 01-01-2017.
Usable Privacy and Security, ser. SOUPS ’09. New York, NY,
USA: ACM, 2009, pp. 4:1–4:12. [31] L. F. Cranor, P. Guduru, and M. Arjula, “User interfaces for pri-
vacy agents,” ACM Transactions on Computer-Human Interac-
[15] “Regulation (EU) 2016/679 of the European Parliament and of tion (TOCHI), vol. 13, no. 2, pp. 135–178, 2006.
the Council of 27 April 2016 on the protection of natural persons
with regard to the processing of personal data and on the free [32] P. G. Kelley, J. Bresee, L. F. Cranor, and R. W. Reeder, “A nutri-
movement of such data, and repealing Directive 95/46/EC (Gen- tion label for privacy,” in Proceedings of the 5th Symposium on
eral Data Protection Regulation),” Official Journal of the Euro- Usable Privacy and Security. ACM, 2009, p. 4.
pean Union, vol. L119, pp. 1–88, May 2016. [33] J. Y. Tsai, S. Egelman, L. Cranor, and A. Acquisti, “The effect
[16] S. Zimmeck and S. M. Bellovin, “Privee: An architecture for au- of online privacy information on purchasing behavior: An exper-
tomatically analyzing web privacy policies.” in USENIX Security, imental study,” Information Systems Research, vol. 22, no. 2, pp.
vol. 14, 2014. 254–268, 2011.

[17] K. M. Sathyendra, S. Wilson, F. Schaub, S. Zimmeck, and [34] A. Razaghpanah, R. Nithyanand, N. Vallina-Rodriguez, S. Sun-
N. Sadeh, “Identifying the provision of choices in privacy pol- daresan, M. Allman, and C. K. P. Gill, “Apps, trackers, privacy,
icy text,” in Proceedings of the 2017 Conference on Empirical and regulators,” in 25th Annual Network and Distributed System
Methods in Natural Language Processing, 2017, pp. 2764–2769. Security Symposium, NDSS 2018, 2018.

[18] Disconnect, “Privacy Icons,” https : / / web.archive.org / web / [35] A. Aktypi, J. Nurse, and M. Goldsmith, “Unwinding ariadne’s
20170709022651/disconnect.me/icons, accessed: 07-01-2017. identity thread: Privacy risks with fitness trackers and online so-
cial networks,” in Proceedings of the 2017 ACM SIGSAC Confer-
[19] B. Edelman, “Adverse selection in online ”trust” certifications,” ence on Computer and Communications Security. ACM, 2017.
in Proceedings of the 11th International Conference on Elec-
tronic Commerce, ser. ICEC ’09. New York, NY, USA: ACM, [36] E. Steel and A. Dembosky, “Health apps run into privacy snags,”
2009, pp. 205–212. Financial Times, 2013.

[20] T. Foremski, “TRUSTe responds to Facebook privacy prob- [37] Norwegian Consumer Council, “Appfail report threats to con-
lems...” http : / / www.zdnet.com / article / truste - responds - to - sumers in mobile apps,” Norwegian Consumer Council, Tech.
facebook-privacy-problems/, 2017, accessed: 2017-10-01. Rep., 2016.

[21] A. D. Miyazaki and S. Krishnamurthy, “Internet seals of ap- [38] E. M. Caudill and P. E. Murphy, “Consumer online privacy: Legal
proval: Effects on online privacy policies and consumer percep- and ethical issues,” Journal of Public Policy & Marketing, vol. 19,
tions,” Journal of Consumer Affairs, vol. 36, no. 1, pp. 28–49, no. 1, pp. 7–19, 2000.
2002. [39] R. Pitofsky, S. Anthony, M. Thompson, O. Swindle, and T. Leary,
[22] G. Glavaš, F. Nanni, and S. P. Ponzetto, “Unsupervised text seg- “Privacy online: Fair information practices in the electronic mar-
mentation using semantic relatedness graphs,” in *SEM 2016: ketplace,” Statement of the Federal Trade Commission before the
The Fifth Joint Conference on Lexical and Computational Seman- Committee on Commerce, Science and Transportation, United
tics : proceedings of the conference ; August 11-12 2016, Berlin, States Senate, Washington, DC, 2000.
Germany. Stroudsburg, Pa.: Association for Computational Lin- [40] J. R. Landis and G. G. Koch, “The measurement of observer
guistics, 2016, pp. 125–130. agreement for categorical data,” biometrics, pp. 159–174, 1977.
[23] Y. Kim, “Convolutional neural networks for sentence classifica- [41] TRUSTe, “TRUSTe & Disconnect Introduce Visual Icons
tion,” in Proceedings of the 2014 Conference on Empirical Meth- to Help Consumers Understand Privacy Policies,” http :
ods in Natural Language Processing, EMNLP 2014, October 25- / / www.trustarc.com / blog / 2014 / 06 / 23 / truste - disconnect -
29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest introduce-visual-icons-to-help-consumers-understand-privacy-
Group of the ACL, 2014, pp. 1746–1751. policies/, June 2013, accessed: 07-01-2017.

16
[42] R. Ramanath, F. Liu, N. M. Sadeh, and N. A. Smith, “Unsuper- [60] J. O. McGinnis and R. G. Pearce, “The great disruption: How
vised alignment of privacy policies using hidden markov mod- machine intelligence. will transform the role of lawyers in the
els,” in Proceedings of the 52nd Annual Meeting of the Associa- delivery of legal services,” Fordham L. Rev., vol. 82, pp. 3041–
tion for Computational Linguistics, ACL 2014, June 22-27, 2014, 3481, 2014.
Baltimore, MD, USA, Volume 2: Short Papers, 2014, pp. 605–
[61] F. Pasquale and G. Cashwell, “Four futures of legal automation,”
610.
UCLA L. Rev. Discourse, vol. 63, p. 26, 2015.
[43] C. Schneider, “10 reasons why ai-powered, automated customer
[62] L.-E. Holtz, H. Zwingelberg, and M. Hansen, “Privacy policy
service is the future,” https://www.ibm.com/blogs/watson/2017/
icons,” in Privacy and Identity Management for Life. Springer,
10/10-reasons-ai-powered-automated-customer-service-future,
2011, pp. 279–285.
October 2017, accessed: 10-01-2017.
[63] S. Zimmeck, Z. Wang, L. Zou, R. Iyengar, B. Liu, F. Schaub,
[44] B. P. Knijnenburg, M. C. Willemsen, and S. Hirtbach, “Receiving
S. Wilson, N. Sadeh, S. M. Bellovin, and J. Reidenberg, “Au-
recommendations and providing feedback: The user-experience
tomated analysis of privacy requirements for mobile apps,” in
of a recommender system,” in International Conference on Elec-
24th Annual Network and Distributed System Security Sympo-
tronic Commerce and Web Technologies. Springer, 2010, pp.
sium, NDSS 2017, 2017.
207–216.
[64] F. Liu, S. Wilson, F. Schaub, and N. Sadeh, “Analyzing vocabu-
[45] H. T. Dang, D. Kelly, and J. J. Lin, “Overview of the trec 2007
lary intersections of expert annotations and topic models for data
question answering track.” in Trec, vol. 7, 2007, p. 63.
practices in privacy policies,” in 2016 AAAI Fall Symposium Se-
[46] H. Llorens, N. Chambers, N. Mostafazadeh, J. Allen, and ries, 2016.
J. Pustejovsky, “Qa tempeval: Evaluating temporal information
understanding with qa.” [65] K. M. Sathyendra, F. Schaub, S. Wilson, and N. Sadeh, “Au-
tomatic extraction of opt-out choices from privacy policies,” in
[47] M. Wang, N. A. Smith, and T. Mitamura, “What is the jeop- 2016 AAAI Fall Symposium Series, 2016.
ardy model? a quasi-synchronous grammar for qa.” in EMNLP-
CoNLL, vol. 7, 2007, pp. 22–32. [66] J. Rao, H. He, and J. Lin, “Noise-contrastive estimation for an-
swer selection with deep neural networks,” in Proceedings of
[48] S. Robertson, “Understanding inverse document frequency: on the 25th ACM International on Conference on Information and
theoretical arguments for IDF,” Journal of Documentation, Knowledge Management, ser. CIKM ’16. New York, NY, USA:
vol. 60, pp. 503–520, 2004. ACM, 2016, pp. 1913–1916.
[49] M. Beaulieu, M. Gatford, X. Huang, S. Robertson, S. Walker, and
P. Williams, “Okapi at trec-5,” NIST SPECIAL PUBLICATION Appendix A: Full Classification Results
SP, pp. 143–166, 1997.
We present the classification results at the category
[50] A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, “Vi- level for the Segment Classifier and at 15 selected at-
sual instance retrieval with deep convolutional networks,” arXiv
preprint arXiv:1412.6574, 2014.
tribute levels, using the hyperparameters of Table 1.
[51] M. Feng, B. Xiang, M. R. Glass, L. Wang, and B. Zhou, “Ap- Classification results at the category level for the Segment Classifier
plying deep learning to answer selection: A study and an open Sup-
Top-1
task,” in 2015 IEEE Workshop on Automatic Speech Recognition Label Prec. Recall F1 port
Prec.
and Understanding, ASRU 2015, Scottsdale, AZ, USA, December
13-17, 2015, 2015, pp. 813–820. Data Retention 0.83 0.66 0.71 0.68 88
[52] M. Tan, C. dos Santos, B. Xiang, and B. Zhou, “Improved repre- Data Security 0.88 0.83 0.85 0.79 201
Do Not Track 0.94 0.97 0.95 0.88 16
sentation learning for question answer matching,” in Proceedings
1st Party Collection 0.79 0.79 0.79 0.79 1211
of the 54th Annual Meeting of the Association for Computational Specific Audiences 0.96 0.94 0.95 0.93 156
Linguistics, 2016. Introductory/Generic 0.81 0.66 0.70 0.75 369
Policy Change 0.95 0.84 0.88 0.93 112
[53] K. Järvelin and J. Kekäläinen, “Cumulated gain-based evalua-
Non-covered Practice 0.76 0.67 0.70 0.60 280
tion of ir techniques,” ACM Transactions on Information Systems Privacy Contact Info 0.90 0.85 0.87 0.88 137
(TOIS), vol. 20, no. 4, pp. 422–446, 2002. 3rd Party Sharing 0.79 0.80 0.79 0.82 908
[54] Facebook, “Wiki word vectors,” https : / / fasttext.cc / docs / en / Access, Edit, Delete 0.89 0.75 0.80 0.87 133
User Choice/Control 0.74 0.74 0.74 0.69 433
pretrained-vectors.html, 2017, accessed: 2017-10-01.
Average 0.85 0.79 0.81 0.80
[55] Cambridge English Language Assessment, Cambridge English
Proficiency Certificate of Proficiency in English CEFR level C2,
Handbook for Teachers. University of Cambridge, 2013. Classification results for attribute: change-type

[56] J. Goodman, “Legal technology: the rise of the chatbots,” https: Label Prec. Recall F1 Support
//www.lawgazette.co.uk/features/legal- technology- the- rise- of- privacy-relevant-change 0.78 0.76 0.77 77
the-chatbots/5060310.article, 2017, accessed: 2017-04-27. unspecified 0.79 0.76 0.76 90

[57] A. Levy, “Microsoft ceo satya nadella: for the future of chat Average 0.78 0.76 0.76
bots, look at the insurance industry,” http://www.cnbc.com/
2017 / 01/ 09 / microsoft - ceo - satya - nadella - bots - in - insurance - Classification results for attribute: notification-type
industry.html, 2017, accessed: 2017-04-27. Label F1 Support
Prec. Recall
[58] F. Liu, N. L. Fella, and K. Liao, “Modeling language vagueness general-notice-in-privacy-policy 0.80 0.77 0.78 76
in privacy policies using deep neural networks,” in 2016 AAAI general-notice-on-website 0.64 0.62 0.62 52
Fall Symposium Series, 2016. personal-notice 0.69 0.66 0.67 50
unspecified 0.81 0.72 0.75 24
[59] T. Hwang, “The laws of (legal) robotics,” Robot, Robot & Hwang
LLP, Tech. Rep., 2013. Average 0.73 0.69 0.71

17
Classification results for attribute: do-not-track-policy Classification results for attribute: third-party-entity
Label Prec. Recall F1 Support Label Prec. Recall F1 Support
honored 1.00 1.00 1.00 8 collect-on-first-party-website-
not-honored 1.00 1.00 1.00 26 app 0.78 0.64 0.68 113
Average 1.00 1.00 1.00 receive-shared-with 0.87 0.87 0.87 843
see 0.83 0.79 0.81 63
Classification results for attribute: security-measure track-on-first-party-website-app 0.75 0.86 0.79 107
unspecified 0.60 0.51 0.52 57
Label Prec. Recall F1 Support
Average 0.77 0.74 0.73
data-access-limitation 0.89 0.78 0.81 35
generic 0.84 0.83 0.83 102
privacy-review-audit 0.97 0.58 0.62 13 Classification results for attribute: access-type
privacy-security-program 0.87 0.69 0.73 31 Label Prec. Recall F1 Support
secure-data-storage 0.82 0.64 0.69 17
secure-data-transfer 0.91 0.80 0.84 26 edit-information 0.65 0.62 0.63 172
secure-user-authentication 0.97 0.58 0.63 12 unspecified 0.98 0.64 0.71 14
view 0.55 0.53 0.53 47
Average 0.90 0.70 0.74
Average 0.73 0.60 0.62
Classification results for attribute: personal-information-type
Label Prec. Recall F1 Support Classification results for attribute: audience-type
computer-information 0.84 0.80 0.82 88 Label Prec. Recall F1 Support
contact 0.90 0.89 0.90 342
cookies-and-tracking-elements 0.95 0.92 0.94 272 californians 0.98 0.97 0.98 60
demographic 0.93 0.90 0.92 86 children 0.98 0.97 0.97 161
financial 0.89 0.86 0.87 99 europeans 0.97 0.95 0.96 23
generic-personal-information 0.82 0.79 0.80 441
Average 0.98 0.97 0.97
health 1.00 0.56 0.61 8
ip-address-and-device-ids 0.93 0.93 0.93 104
location 0.88 0.88 0.88 107 Classification results for attribute: choice-scope
personal-identifier 0.67 0.61 0.63 31
social-media-data 0.73 0.84 0.78 23 Label Prec. Recall F1 Support
survey-data 0.77 0.86 0.81 22 both 0.61 0.53 0.54 71
unspecified 0.71 0.70 0.71 456 collection 0.74 0.68 0.70 302
user-online-activities 0.80 0.82 0.81 224 first-party-collection 0.63 0.55 0.56 109
user-profile 0.79 0.68 0.72 96 first-party-use 0.80 0.68 0.71 236
Average 0.84 0.80 0.81 third-party-sharing-collection 0.81 0.60 0.64 98
third-party-use 0.57 0.51 0.50 60
unspecified 0.55 0.55 0.55 76
use 0.62 0.55 0.56 140
Classification results for attribute: purpose
Average 0.67 0.58 0.59
Label Prec. Recall F1 Support
additional-service-feature 0.75 0.76 0.75 374 Classification results for attribute: action-first-party
advertising 0.92 0.91 0.92 286
analytics-research 0.88 0.86 0.87 239 Label Prec. Recall F1 Support
basic-service-feature 0.76 0.73 0.74 401
collect-in-mobile-app 0.84 0.75 0.79 68
legal-requirement 0.92 0.91 0.91 79
collect-on-mobile-website 0.58 0.54 0.56 11
marketing 0.86 0.83 0.84 312
collect-on-website 0.65 0.65 0.65 739
merger-acquisition 0.95 0.96 0.95 38
unspecified 0.61 0.60 0.60 294
personalization-customization 0.79 0.80 0.80 149
service-operation-and-security 0.81 0.77 0.79 200 Average 0.67 0.64 0.65
unspecified 0.72 0.68 0.70 249
Average 0.84 0.82 0.83 Classification results for attribute: does-does-not
Label Prec. Recall F1 Support
Classification results for attribute: choice-type
does 0.82 0.93 0.86 1436
Label Prec. Recall F1 Support does-not 0.82 0.93 0.86 200
browser-device-privacy-controls 0.89 0.95 0.92 171 Average 0.82 0.93 0.86
dont-use-service-feature 0.69 0.65 0.67 213
first-party-privacy-controls 0.75 0.62 0.66 71
opt-in 0.78 0.81 0.79 406 Classification results for attribute: retention-period
opt-out-link 0.82 0.74 0.77 167
opt-out-via-contacting-company 0.87 0.81 0.84 127 Label Prec. Recall F1 Support
third-party-privacy-controls 0.82 0.62 0.66 99 indefinitely 0.45 0.48 0.47 8
unspecified 0.65 0.54 0.56 117 limited 0.74 0.75 0.75 27
Average 0.78 0.72 0.73 stated-period 0.94 0.94 0.94 10
unspecified 0.82 0.77 0.77 41
Average 0.74 0.74 0.73

18
Classification results for attribute: identifiability
Label Prec. Recall F1 Support
aggregated-or-anonymized 0.89 0.89 0.89 284
identifiable 0.81 0.81 0.81 492
unspecified 0.63 0.63 0.63 98
Average 0.77 0.78 0.77

Appendix B: Applications’ Screenshots


In this appendix, we first show screenshots of PriBot’s
web app, answering questions about multiple companies
(Fig. 12 to Fig. 17). Next, we show screenshots from our
web application for navigating the results produced by
Polisis (Fig. 18 to Fig. 20). These apps are available at
https://pribot.org.

Fig. 13: The first answer about data retention in the


case of Medium. Notice the semantic matching in the
absence of common terms. Notice also that only one
answer is shown as it is the only one with high confi-
dence. Hence, the user is not distracted by irrelevant
answers.

Fig. 14: The answer given a question “do you bla bla
bla”, showcasing the power of the confidence metric,
accounting for unknown words in the question and re-
laying that particular reason to the user.
Fig. 12: The first answer from our chatbot implemen-
tation of PriBot about third-party sharing in the case
of Bose.com. Answers are annotated by a header men-
tioning the high level category (e.g., Context of sharing
with third parties). The confidence metric is also high-
lighted int the interface.

Fig. 15: This case illustrates the scenario when PriBot


finds no answer in the policy and explains the reason
based on the automatically detected high-level category
(explanations are preset in the application).

19
Fig. 16: This case illustrates the power of subword
embeddings. Given a significantly misspelled question
“how much shoud i wait for you to delet my data”, Pri-
Bot still finds the most relevant answer.

Fig. 17: This case, with the policy of Oyoty.com, il-


lustrates automatic accounting for discrepancies across
segments (Sec. 7.1) by warning the user about that.

20
Fig. 18: We show a case where our web app visualizes the result produced by Polisis. The app shows the flow of the
data being collected, the reasons behind that, and the choices given to the user in the privacy policy. The user can
check the policy statements for each link by hovering over it.

Fig. 19: In this case, the security aspects of the policy are highlighted based on the labels extracted from Polisis. The
user has the option to see the related statement by expanding each item in the app.

21
Fig. 20: Here, the user is presented with the choices possible, automatically retrieved from Polisis.

22

You might also like