You are on page 1of 8

c o m p u t e r l a w & s e c u r i t y r e v i e w 2 7 ( 2 0 1 1 ) 4 5 e5 2

available at www.sciencedirect.com

www.compseconline.com/publications/prodclaw.htm

The limits of privacy in automated profiling and data mining

Bart W. Schermer 1

abstract

Keywords: Automated profiling of groups and individuals is a common practice in our information
Automated profiling society. The increasing possibilities of data mining significantly enhance the abilities to
Data mining carry out such profiling. Depending on its application, profiling and data mining may cause
Privacy particular risks such as discrimination, de-individualisation and information asymmetries.
In this article we provide an overview of the risks associated with data mining and the
strategies that have been proposed over the years to mitigate these risks. From there we
shall examine whether current safeguards that are mainly based on privacy and data
protection law (such as data minimisation and data exclusion) are sufficient. Based on
these findings we shall suggest alternative policy options and regulatory instruments for
dealing with the risks of data mining, integrating ideas from the field of computer science
and that of law and ethics.
ª 2011 Dr. Bart W. Schermer. Published by Elsevier Ltd. All rights reserved.

1. Introduction private companies and institutions to help overcome the


problem of information overload.
Profiling is the process of discovering correlations between While the benefits of profiling and data mining for public
data in databases that can be used to identify and represent and private sector institutions are significant, misuse or abuse
a human or nonhuman subject (individual or group) and/or the of profiling may have negative consequences for individuals
application of profiles (sets of correlated data) to individuate and society as a whole. Much has been written over the years
and represent a subject or to identify a subject as a member of about the possible negative effects of profiling and data
a group or category.2 Profiling is used for purposes ranging mining.4 Often, the negative effects of profiling and data
from anti-terrorism to direct marketing. Profiling relies heavily mining are framed as threats to (informational) privacy.
on data mining for its effectiveness. Data mining, or knowl- Subsequently, proposed regulatory solutions are predomi-
edge discovery in databases (KDD), is the nontrivial extraction nantly aimed at protecting privacy. While this approach might
of implicit, previously unknown, and potentially useful infor- provide some measure of protection, it is questionable
mation from data.3 Over the past decades it has evolved from whether it is the most effective regulatory approach. It might
an experimental technology into an important instrument for be that a more targeted (regulatory) strategy is necessary to

1
Bart W. Schermer is an assistant professor at eLaw@Leiden, the centre for law in the information society of Leiden University. This
article was written as part of the research project ‘Data mining without discrimination’. Bart wishes to thank the project team (Bart
Custers, Toon Calders, Sicco Verwer and Tal Zarsky) for their valuable comments on the original draft.
2
Hildebrandt, M. (2008), Defining Profiling: A New Type of Knowledge? In: Profiling the European Citizen, Cross-Disciplinary Perspectives
(Hildebrandt, M., Gutwirth, S., eds.), Springer Science, p. 17.
3
Frawley, W.J., Piatetsky-Shapiro, G., Matheus, C.J. (1992), Knowledge Discovery in Databases: An Overview. In: AAAI Magazine, Fall
1992, p. 58.
4
See for instance: Marx, G.T. (1985), The surveillance society: the threat of 1984-style techniques. In: The Futurist, June 21-6; Vedder, A.
(1999), KDD: The challenge to individualism. In: Ethics and Information Technology, Volume 1, Number 4/December, 1999; Lyon, D. (2003),
Surveillance as social sorting, privacy risk and digital discrimination, New York: Routledge; Custers, B. (2003), Effects of Unreliable Group
Profiling by means of Data Mining. In: Lecture Notes in Computer Science, Volume 2843/2003, Springer; and Hildebrandt, M., Gutwirth, S.
(2008), Profiling the European Citizen, New York: Springer Verlag.
0267-3649/$ e see front matter ª 2011 Dr. Bart W. Schermer. Published by Elsevier Ltd. All rights reserved.
doi:10.1016/j.clsr.2010.11.009
46 c o m p u t e r l a w & s e c u r i t y r e v i e w 2 7 ( 2 0 1 1 ) 4 5 e5 2

address these issues. In this article we shall explore the risks determined using known information. When it comes to
associated with automated profiling and data mining and profiling, this means that information about an individual is
examine whether privacy is still a viable regulatory approach mined in order to determine whether he/she fits the previously
to mitigate these risks, or that additional safeguards are established profile. The training data in predictive mining
necessary. The twofold problem statement to be explored in therefore usually consists of a collection of annotated objects or
this article is therefore: individuals; e.g., demographics of a set of individuals together
with the annotation which are known terrorists. As such,
Is privacy an effective approach when it comes to dealing with predictive data mining is often called “supervised data mining”.
possible negative effects of data mining? An often-used method in predictive data mining is classi-
Are there other strategies available that might prove more effective? fication. By using data mining algorithms called classifiers, we
can establish whether a new object fits a previously estab-
In this article we shall approach these questions from lished class. Classes are based on input fields that contain
a multi-disciplinary perspective, since this article was written different attributes associated with the class. When an object
as part of a scientific project that focuses specifically on the shares attributes with the data objects belonging to the class,
cooperation between the fields of computer science and law.5 it is likely that it also belongs to this class. For instance, the
class label “canary” would be made up of attributes such as
{yellow, beak, wings, tail}. When an object has a certain
2. Data mining approaches in profiling amount of these attributes, it becomes increasingly likely that
it is indeed a canary.
When we study the use of data mining for profiling purposes, It is important to note that classification only determines
we can distinguish among two distinct data mining approaches, the likelihood of something belonging to a certain class. When
viz., “descriptive data mining” and “predictive data mining”. classification is based on few determining factors (e.g.,
something is either 1, or it is not 1) classification can be quite
accurate. However, when a certain event or occurrence is
2.1. Descriptive data mining
dependent on a multitude of factors, accurate classification
becomes more difficult, if not impossible. Also, often many of
The goal of descriptive data mining is to discover unknown
the factors that determine the class of the object are not
relations between different data objects in a database.
present in the dataset. For instance, whether or not someone
Descriptive data mining algorithms try to discover knowledge
will grow up to be a criminal is dependent on so many factors,
about a certain domain by determining commonalities
that accurate classification (i.e., criminal or not criminal) is
between different objects and attributes. By discovering
impossible. As such, the outcome of a classifier on a new
correlations between data objects in a dataset that is repre-
object will always be of a probabilistic nature.
sentative of a certain domain, we can gain insight to it.6
Descriptive data mining is interesting because more insight
into a particular domain allows for better planning and allo-
cation of resources in this realm. For instance, descriptive data 3. Ethical and legal issues associated with
mining on tax declaration forms may find out a natural divi- profiling and data mining
sion of people into a limited number of groups based on their
economical activity, financial situation and occupation. Also, While profiling and data mining have proven to be very useful
descriptive data mining could expose relations between the tools in dealing with the information overload, they are not
different fields on the forms. It is important to notice here that without controversy. The reason for this is that there are
in descriptive data mining no “target” has been given. There- a number of ethical and legal issues associated with profiling
fore, descriptive data mining is often also called “unsuper- and data mining, some of which I shall describe in this Section
vised data mining”. Moreover, descriptive data mining I shall distinguish between the risks associated with profiling,
describes relations rather than explains them. The existence and issues that may arise due to an incorrect application of
of a correlation in a dataset does not necessarily mean that data mining in the context of profiling.
this relation will always occur in the real world, nor does it
explain why the correlation is there. In other words, it is vital
3.1. Risks associated with profiling
not to mix up correlation with causation when it comes to
descriptive data mining.
While data mining and profiling are for the most part
conceptually framed as threats to informational privacy, it is
2.2. Predictive data mining our opinion that the strong association with privacy obfus-
cates the actual risks that profiling and data mining may pose
As the name implies, the goal of predictive data mining is to to groups and individuals. In our view the most significant
make a prediction about events based on patterns that were risks associated with profiling and data mining are discrimi-
5
nation, de-individualisation and information asymmetries.
NWO Project Data mining without discrimination. For infor-
While Article 15 of the European Data protection directive
mation see: http://www.onderzoekinformatie.nl/en/oi/nod/
onderzoek/OND1337064/ (1995/46/EC) addresses the issue of automated profiling and
6
Cocx (2009), Algorithmic tools for data-oriented law enforcement, the risks associated with it, it does not feature prominently in
PhD. Thesis, University of Leiden, p. 3. our current privacy discourse.
c o m p u t e r l a w & s e c u r i t y r e v i e w 2 7 ( 2 0 1 1 ) 4 5 e5 2 47

3.1.1. Discrimination profiling may have on individuals, group profiling can also
Classification and division are at the heart of (predictive) data lead to stigmatisation of group members. Moreover, divisions
mining. As such, discrimination is part and parcel of profiling into groups can damage societal cohesion. When group
and data mining. However, there are situations where profiles, whether correct or not, become public knowledge,
discrimination is considered unethical and even illegal. This people may start treating each other accordingly. For instance,
can occur for instance when a data mining exercise is when people start believing that citizens of Amsterdam are
focussed on characteristics such as ethnicity, gender, religion more criminal, people may start to react and communicate
or sexual preference. But even without a prior desire to judge with more suspicion towards citizens of Amsterdam, regard-
people on the basis of particular characteristics, there is the less of the correctness of such a profile.
risk of inadvertently discriminating against particular groups
or individuals. The reason for this is that predictive data 3.1.3. Information asymmetries
mining algorithms may “learn” to discriminate on the basis of Data mining can lead to valuable insights for those parties
biased data used to train the algorithm. employing it. When data mining is aimed at gaining more
A classifier must be trained in order to classify data. When insight into individuals or groups, we encounter the problem
the training data is contaminated (for instance because it of information asymmetry. Information asymmetries may
discriminates against a particular group), the classifier will influence the level playing field between government and
learn to classify in a biased way, strengthening discriminatory citizens, and between businesses and consumers, upsetting
effects. Such cases occur naturally when the decision process the current balance of power between different parties.
leading to the labels was biased. An example in the area of law In the context of the relation between government and
enforcement may help to explain this. When police officers citizens information asymmetries can affect individual
have targeted an ethnic minority disproportionately in the past autonomy. If data mining indeed yields actionable knowledge,
based on their own bias, it is likely that these minorities will the government will have more power. Moreover, the fear of
feature more prominently in crime statistics. If these crime strong data mining capabilities on the part of the government
statistics are then used as training data for a classifier, chances may “chill” the willingness of people to engage in for instance
are high that the classifier will learn that that there is a strong political activities, out of fear of being watched. For this
correlation between ethnicity and crime. This in turn will lead panoptic fear to materialise, a data mining application does not
to discriminating results that can constitute the basis for future even have to be effective.9
discrimination. This effect is further strengthened by the fact Information asymmetries may upset the level economic
that a classifier will most likely not have access to all the playing field between consumers and businesses. Further-
important factors on which to base a prediction because, e.g., more, there are instances where data mining can aid in
they are missing in the data. Therefore, the importance of those making decisions about consumers that are considered
factors that are present in the data grows and will be even more unwanted, unethical or illegal. An example of this would be
important in prediction than they were in the input data. excluding particular individuals or groups from goods and
services based on their ethnicity.10
3.1.2. De-individualisation The issue of information asymmetry is exacerbated by the
In many cases data mining is in large parts concerned with limited transparency of data mining. Since data mining (in
classification and thus there is the risk that persons are judged particular predictive data mining) is used to make decisions
on the basis of group characteristics rather than on their own about groups and individuals, people will be affected by data
individual characteristics and merits.7 Group profiles usually mining exercises. However, for the most part it will be unclear
contain statistics and therefore the characteristics of group to persons why a particular decision has been made and on
profiles may be valid for the group and for individuals as what grounds. This could lead to a sense of helplessness. A
members of that group, though not for individuals as such. An problem that is compounded by the fact that it is difficult to
example may illustrate this. For instance when people in seek redress from automated decision processes. Professor of
Amsterdam are 3% more criminal than people in the rest of law at the George Washington University Law School, Daniel
the Netherlands, this characteristic goes for the group (i.e., Solove has likened this situation with that of Josef K. in Kaf-
people in Amsterdam), for the individuals as members of that ka’s Der Prozess.11
group (i.e., randomly chosen people living in Amsterdam), but
not for the individuals as such (i.e., for John, Mary and William 3.2. Application issues
who all live in Amsterdam). When individuals are judged by
group characteristics they do not possess as individuals, this The risks described above may manifest themselves regard-
may strongly influence the advantages and disadvantages of less of the fact that the data mining was applied correctly in
using group profiles.8 Apart from the negative effects group
9
Schermer, B.W. (2007), Software Agents, Surveillance, and the
Right to Privacy: a Legislative Framework for Agent-enabled
7
Vedder, A. (1999), KDD: The challenge to individualism. In: Surveillance, PhD. Thesis, Leiden University, p. 42.
10
Ethics and Information Technology, Volume 1, Number 4/December, Zarsky, T.Z. (2006), Online Privacy, Tailoring, and Persuasion.
1999. In: Privacy and technologies of Identity, a cross disciplinary conversa-
8
Custers, B.H.M. (2010), Data Mining with Discrimination Sensitive tion, Strandburg, K., Stan Raicu, D. (eds), Chapter 12, pp. 209e224,
and Privacy Sensitive Attributes. Proceedings of ISP 2010, Interna- Springer, 2006.
11
tional Conference on Information Security and Privacy, 12e14 Solove, D.J. (2004), The Digital Person: Technology and Privacy in
July 2010, Orlando, Florida (forthcoming). the Information Age. New York: New York University Press, p. 47.
48 c o m p u t e r l a w & s e c u r i t y r e v i e w 2 7 ( 2 0 1 1 ) 4 5 e5 2

a technical sense. But as we have seen in the example of 3.2.3. Data dredging
discrimination, when data mining is applied incorrectly, the In unguided descriptive data mining we look for correlations in
risks associated with profiling and data mining may be the data without using a pre-defined working hypothesis.
strengthened. Therefore we shall discuss several common Dependent on the size of the dataset and the “confidence
application issues associated with data mining that may pose interval” used to determine correlations, our data mining
additional risks to groups and individuals. exercise will yield certain results. While these results might
indeed be significant, there is also a chance that they are
3.2.1. Accuracy and reliability completely random. So, the results we find (and the hypothesis
The success of a data mining exercise is dependent on the we formulate on the basis of these results) need to be validated
quality of the raw data being mined. If the data is inaccurate, to exclude the possibility that the correlation is in fact totally
the results will also be inaccurate. This is true for both random. It is important to note that the dataset used in con-
descriptive and predictive data mining. In the area of predic- structing the hypothesis cannot be used for the validation of
tive data mining issues with accuracy and reliability are that same hypothesis, since this data is by default “biased”
particularly problematic, given the fact that the results of towards supporting that particular hypothesis. The risk in data
a predictive data mining exercise are oftentimes used to make dredging is that we present the results of the initial data
(automated) decisions about individuals and/or groups. But mining exercise as facts, rather than as a hypothesis that
even if the raw data is free of errors, accuracy and reliability needs to be tested further. An example to illustrate this point is
remain an issue. the following: suppose we want to test if some people have
In particular there is the problem of “false positives” and special mental abilities and are able to predict the future. We
“false negatives.” This means that people that in fact do not fit set up an experiment in which 1000 volunteers need to predict
the class are fitted in the class (a false positive), or people that fit which sequence of heads and tails will emerge from flipping an
the class are left out (false negative). False positives and false unbiased coin 10 times. Statistics teach us that if a participant
negatives occur for various reasons, one being that not all makes random guesses, he or she still has a chance of 1 out of
information is available. For instance, in the example of the 1024 to get the sequence fully correct. As such, we can expect
“canary classification”, the presence of attributes such as that on average one participant will predict the outcome of all
{yellow, beak, wings, tail} were strong indications that an object 10 tosses correctly. Of course, this outcome does not support at
could be classified as a canary. However, a duckling is also all the claim that this participant has special mental abilities.
yellow and has a beak, wings and a tail, making it a false positive First the participant needs to confirm his perfect scores in new
in our scenario. controlled experiments. To link this example to data mining:
one could consider the outcome of the prediction task as the
3.2.2. Causation versus correlation input data. There are 1000 hypotheses being tested: “partici-
The goal of data mining is to find implicit and previously pant 1 can predict everything correctly” till “participant 1000
unknown relations between data. As such, data mining yields can predict everything correctly”. As such the outcome of the
new knowledge about a given problem space. In descriptive experiment is used to generate the “pattern”; e.g., “participant
data mining, this knowledge is based on the correlation 174 can predict everything correctly” that will be presented to
between certain objects and attributes. However, while data the user. It is instructive to see that even a statistical test for
mining can establish that there is a relationship between significance will confirm the validity of this pattern.
certain objects and attributes, it does not explain why this
relationship exists. As such, it is important that we do not
mistake correlation for causation.12 For instance, data mining
may reveal that burglars use cannabis more often than other 4. The problem with privacy
people. While it is tempting to point to cannabis as a cause for
burglary, the data does not support such a conclusion. In the previous section we have discussed several risks asso-
Data mining experts warn for the fact that a correlation ciated with data mining in the context of profiling. The current
between certain facts does not imply a causal relation, nor approach to mitigating these risks is by invoking the right to
does it explain why there is a correlation between facts.13 Such (informational) privacy. At the core of informational privacy is
a warning needs to be heeded, given the fact that data mining the notion that data subjects have the right and the ability to
efforts and statistics might provide input for policymaking. shield personal data from third parties. Given the societal
Since the goal of policymaking is addressing the causes of an importance of the free flow of information, this right to
issue rather than fighting symptoms, it is important to know informational privacy is balanced by the legitimate interests
more about the background, emergence and causation of of third parties to process personal data. In Europe, the Data
certain events. In particular unguided descriptive data mining Protection Directive sets forth the rules under which data
is less suited, if not unfit for the discovery of this information.13 controllers are allowed to process personal data on individ-
uals. A data controller is not allowed to process personal data
unless he has a legitimate purpose (see article 7 of the Direc-
12
tive). Furthermore, the Directive sets forth some core data
See for an in depth analysis of causation: Pearl, D. (2009),
protection principles such as data quality, security safeguards
Causality: Models, Reasoning, and Inference, Cambridge: Cambridge
University Press, (2nd edition). and data minimisation.
13
Cocx, (2009), Algorithmic tools for data-oriented law enforcement, Personal data protection is important and has provided
PhD. Thesis, University of Leiden, p. 143. individuals with protection from misuse and abuse of their
c o m p u t e r l a w & s e c u r i t y r e v i e w 2 7 ( 2 0 1 1 ) 4 5 e5 2 49

personal data. There are however a number of closely linked attributes such as area code and income, we need a mecha-
issues of a technological, legal and societal nature with this nism to determine whether this is actually the case. Removing
approach. sensitive attributes from the dataset may deny us the means to
From a data mining perspective the primary issue with check ex post whether an algorithm has indeed discriminated
informational privacy is that by limiting the use of (particular) against a certain group. The reason for this is that by removing
personal data, we run the risk of reducing the accuracy of the sensitive attributes from the data, we merely remove the key
data mining exercise.10 So while privacy may be protected, the indicators of discriminatory results, but not other, more indi-
utility of the data mining exercise is reduced. Informational rect indicators that may lead to similarly discriminating
privacy does not only decrease the usefulness of data mining, results.
it also increases the probabilities of false positives and false A final issue concerns user consent in relation to the
negatives. Moreover, data exclusion and data minimisation do transparency of automated profiling. While the data subject
not necessarily provide sufficient protection from the risks of may give his or her consent for the processing of personal data
profiling, because it is often still possible to uniquely identify and the use of automated profiling, it will likely be unclear to
particular persons from the data, even after key attributes the data subject what the actual extent and impact of profiling
such as name, address and social security number have been will be on his person. As such, data subjects may underestimate
removed. This indirect identification is troublesome from the risks of automated profiling. Moreover, consenting with
a legal perspective, since personal data protection law is so automated profiling may yield direct benefits for consumers
dependent on the notion of personal data. Also, it is difficult to (such as free goods or services), whereas the risks of automated
determine which pieces of data can lead to the identification profiling may be less clear and tangible. This raises the question
of a person once combined. But even in those instances where to what extent the data subject can freely give his or her
identification is impossible, there is no guarantee that indi- informed consent.
viduals are not adversely influenced. Take for instance the From the above considerations we may conclude that the
practice of behavioural targeting in online advertising. Actual concept of privacy and its application in data protection law
identification is not necessary for targeting, merely being able does not provide adequate protection from the risks associated
to individualise users on the basis of their IP-address or their with automated profiling. Moreover, privacy law does not deal
browser cookies is enough for the targeting to be effective. adequately with the application issues mentioned in this
Privacy and data protection law provide little protection article. While the principle of data quality (article 6 of the Data
against this issue, since they are primarily focussed on a priori Protection Directive) and the right not be subjected to a deci-
privacy protection through the minimisation of data.14 sion solely based on the automated processing of data (article
It may even be so that invoking the right to informational 15 of the Data Protection Directive) should provide some
privacy is counterproductive in some cases. New research measure of protection against application issues, in practice
suggests that this is the case for anti-discrimination.15 In order their effect is limited. In part this can be attributed to the fact
to counter discrimination, the Data Protection Directive that the data mining community (adept at spotting application
prohibits the processing of sensitive attributes such as issues) and the legal community do not co-operate enough.
ethnicity or religion. However, it is questionable whether this These technological and legal-technical issues highlight
approach is effective. In general it is to be expected that other a key weakness in the current approach to data protection and
demographic attributes in the dataset still allow for an accu- informational privacy. Privacy and data protection law is
rate identification of the ethnicity of a person. For example, the based primarily on ex ante protection, but has little in the way
ethnicity of a person might be strongly linked with the postal of ex post protection mechanisms. This has everything to do
code of his residential area, leading to indirect racial discrim- with the nature of privacy as a mechanism for hiding. We can
inatory based on postal code. This effect and its exploitation is say that privacy in many cases is a means rather than an end
often referred to as redlining, stemming from the practice of in itself.17 The actual underlying goals of privacy protection
denying or increasing services such as, e.g., mortgages or (autonomy, equal treatment, economic equality) are protected
health care to residents in certain often racially determined through the right to privacy. However, once this layer of ex
areas.8 Research shows that even after removing the sensitive ante protection (i.e., the possibility to hide or shield informa-
attribute from the dataset discrimination persists.16 Since tion from observation) has been breached, there are few
discrimination may also be based on underlying “neutral” mechanisms for protecting the underlying interests. Because
of the nature of information processing in today’s hyper-
14
connected network society, this layer of ex ante protection is
Zarsky, T.Z. (2002), Mine Your Own Business! Making the Case
becoming weaker and weaker.
for the Implications of the Data Mining of Personal Information in
the Forum of Public Opinion. In: Yale Journal of Law and Technology, Moreover, a priori privacy protection also carries with it
pp. 1e57. particular political issues. While privacy is a powerful meme
15
Verwer and Calders (2010), Three Naive Bayes Approaches for in the political debate about fair information processing, it is
Discrimination-Free Classification. In: Data Mining: special issue also oftentimes seen as the antithesis of values that benefit
with selected papers from ECMLePKDD 2010; Springer. from openness such as security, innovation and efficiency.
16
Pedreschi, D., Ruggieri, R., and Turini, F. (2008), Discrimination-
Furthermore, given the fact that privacy is construed as an
aware Data Mining. In: Proceedings of the 14th ACM SIGKDD Confer-
ence on Knowledge Discovery and Data Mining; Kamiran, F., Calders,
17
T. (2009), Classification without discrimination, IEEE International Schermer, B.W. (2007), Software Agents, Surveillance, and the
Conference on Computer, Control & Communication (IEEE-IC4); Right to Privacy: a Legislative Framework for Agent-enabled
IEEE press. Surveillance, PhD. Thesis, Leiden University, p. 198.
50 c o m p u t e r l a w & s e c u r i t y r e v i e w 2 7 ( 2 0 1 1 ) 4 5 e5 2

individual right it often loses out in the public debate, because apparently objective and incontrovertible.21 The same pattern
it is positioned against the interests of society as a whole.18 can be seen in politics; much, if not too much, weight is
attached to data mining results and statistics. Therefore, the
need for awareness also extends to policymakers.
5. The way forward While it would be beneficial to have clear cut procedural
mechanisms to verify and validate the results of data mining
We have established that while a useful mechanism for exercises used to support the political decision-making
reducing potential risks of profiling and data mining, infor- process, such mechanisms are hard if not impossible to design.
mational privacy by itself does not provide adequate protec- Rather, we should look towards educating policymakers about
tion. The current starting point of a priori privacy protection, the limits of data mining and the interpretation of results.
based on access restrictions on personal information, is Another approach might be to get a “second opinion” from data
increasingly inadequate in a world of automated and inter- mining experts in collaboration with legal experts for data
linked databases and information networks. In most cases, the mining results that play a significant role in decision- or
output is much more relevant, since this is the information policymaking.
that is being used for decision making. It thus seems useful to
complement a priori access controls with a posteriori account- 5.2. Predictive data mining
ability.19 Currently there is very little accountability and over-
sight when it comes to data mining. For the most part this is Predictive data mining is particularly suited for making deci-
because the data mining process is a black box for outsiders. sions in concrete cases. In particular the practice of classifi-
The black box nature of data mining is compounded by the fact cation is used to make decisions about groups and individuals
that the persons traditionally involved in assessing the legality (for instance flagging someone as a potential terrorist, or
of (business) processes (policymakers and legal professionals) denying someone a loan). Protection against the risks of
are in large parts unable to assess effectively the legality of predictive data mining should be primarily aimed at making
data mining algorithms. In part this is because organisations the decision-making process more transparent, removing any
are not open and transparent about their decision-making possibilities for the inaccurate or incorrect application of
processes, but the fact that policymakers and legal profes- predictive data mining algorithms, and detecting any illegal or
sionals often lack the required knowledge of computer science unethical decisions (most likely a posteriori).
and mathematics also plays a significant role. Consider, e.g., a bank that wants to use historical informa-
tion on personal loans to learn models for predicting for new
5.1. Descriptive data mining loan applicants the probability that they will default their loan.
It could very well be that this data shows that members of
Since descriptive data mining is primarily used for decision certain ethnic groups are more likely to default their loan.
support tasks and yields input for policymaking, it may Nevertheless, from an ethical and legal point of view it is
(indirectly) affect the (legal) position of groups and individ- unacceptable to use the ethnicity of a person to deny the loan to
uals.20 While data mining professionals caution that data him or her, as this would constitute an infringement of the
mining results are not always directly suited as input for discrimination laws. In such cases, the ethnicity of a person is
policymaking, statistics and automated profiles created using likely to be an information carrier rather than a distinguishing
descriptive data mining play an important role in policy- factor; people from a certain ethnic group are more likely to
making and business decisions nonetheless. Measures should default their loan because, e.g., the average level of education in
therefore be taken to identify and avoid errors that result from this group is lower. In such a situation it is in general perfectly
an incorrect application or interpretation of descriptive data acceptable to use level of education for selecting loan candi-
mining. Problems with accuracy and reliability, mistaking dates, even though this would lead to favouring one ethnic
correlation for causation, and data dredging are examples of group over another. The bank could legally decide to split up the
application issues that need to be taken into account when it group of loan applicants according to their education level, and
comes to descriptive data mining. learn more fine-grained models for each of these groups sepa-
Above all, an increased awareness with data subjects and rately. A prerequisite for this grouping or stratification approach
data controllers about the possibilities, limits and risks of data is of course that the attribute education level is present in the
mining is necessary. One of the reasons for introducing article dataset. When this attribute is not present in the dataset,
15 in the Data Protection Directive is that human decision because, e.g., it was not recorded, we are in a situation that using
makers are inclined to attach too much weight to the results of ethnicity, directly or indirectly, will improve predictive accu-
automated decision-making tools, seeing their results as racy. Currently, data mining algorithms for mining predictive
models use accuracy as single most important quality measure
18
Schermer, B.W. (2007), Software Agents, Surveillance, and the and criterion to be optimised. In cases as described above, this
Right to Privacy: a Legislative Framework for Agent-enabled behaviour, however may lead to discriminative models.
Surveillance, PhD. Thesis, Leiden University, p. 115.
19
Weitzner et al. (2006), Transparent Accountable Data Mining: New
5.2.1. Accountability
Strategies for Privacy Protection, MIT Technical Report.
20
The recent retrial and subsequent acquittal of Dutch murder It must be noted that in individual cases it is almost impossible
suspect Lucia de Berk after serving six years in prison highlights to prove discrimination; given the abundance of data available,
the impact faulty (interpretation of) statistics may have on
21
a person’s life. COM (92) 422 finaleSYN 287, 15 October 1992, p. 26.
c o m p u t e r l a w & s e c u r i t y r e v i e w 2 7 ( 2 0 1 1 ) 4 5 e5 2 51

almost always it is possible to find ex post a neutral explanation data subjects should be informed about the logic and rationale
why the person was denied loan, marked as a terrorist, et behind the decision-making process, but as of yet the appli-
cetera. Unless striking evidence such as internal memos cation of Article 12 in practice seems very limited. The legal
explicating the discriminative procedure that was followed, duty to be transparent should carry over into the design of
data only is insufficient. Therefore, the decision procedure data mining algorithms and accompanying IT-infrastructures,
itself should be part of the discrimination assessment. i.e., “transparency by design”. In this area Weitzner et al. for
There are two (complementary) approaches to enhance instance, propose a system in which the user can verify
accountability in predictive data mining.22 The first is to certain data about him- or herself and check whether auto-
improve detection of privacy and ethical sensitivities in data mated decision processes are indeed correct.19 Such a system
mining results ex ante. Fule and Roddick for instance have of transparency supplemented by user control could, in large
proposed a system that weighs the sensitivity of data processing parts, take away privacy fears that stem from the “Kafka
and flags those algorithms that are most likely to contain ethical metaphor”. Furthermore, it could help in reducing false posi-
and privacy sensitivities.23 Schreuders has proposed a more tives. However, it must be noted that it does not take away any
generic approach that focuses on reviewing data mining and “Big Brother” effects (rather it could strengthen and enforce
decision-making processes using a set of targeted questions.24 these effects as users are more aware ethrough con-
The second approach is to determine whether an algorithm frontatione about the power of data mining), nor does trans-
has discriminated ex post. This is an area that is still in large parency in itself legitimise the use of personal data.
parts experimental. Ruggieri et al. suggest an approach tailored
specifically towards detecting possible discrimination in data 5.2.3. Risk aware data mining algorithms
mining.25 In their approach, Ruggieri et al. use data of individ- Apart from detecting of data mining algorithms that might
uals enriched by the decision that was taken for these individ- adversely affect groups and individuals, we must ensure that
uals to detect rules that specify subgroups in the data for which these risks are mitigated through the programming code
the decisions significantly differ from the overall decisions. itself. This application of Lessig’s idea of “Code as Law” can
take various forms.28
5.2.2. Transparency The development of privacy-preserving data mining algo-
Closely related to the issue of accountability is that of trans- rithms has been a field of research for quite some time now.
parency. One of the key concerns regarding profiling and data The notion of privacy aware data mining algorithms, suggested
mining is the lack of transparency and the subsequent limits amongst other by Agrawal and Srikant, uses methods such as
of redress from automated decision making. Greater trans- the exclusion or masking of sensitive attributes to preserve
parency is therefore important when it comes to the legiti- privacy.29 However, as described above, what is likely more
mate use of profiling and application of data mining. The effective is to look beyond privacy (privacy being a means
European Data Protection Directive already contains a provi- rather than an end in itself) and program algorithms to become
sion (article 15) that states that data subjects may not be “aware” of the underlying risks such as discrimination. In this
subjected to automated decision-making processes that have approach we would need to instruct the system on which
legal effect or otherwise significantly influence their position. information it should (or should not) base its decisions.30 In the
In the current privacy discourse however, article 15 plays but case of discrimination for instance, it is desirable to have
a minor role. Norwegian Data Protection expert Bygrave has a means to “tell” the algorithm that its predictions should be
argued that article 15 should be regarded as one of the core independent of sensitive attributes such as sex and ethnicity.
data protection principles.26 We agree with Bygrave on this We call such constraints “independency constraints”.31 By
point and argue further that more attention should be given to using these constraints, we can structure systems that are
the information notice requirements of articles 10, 11 and 12 discrimination aware by design.
of the Directive. Apart from informing the data subject about
the purpose of the information processing, data subjects
should have more knowledge about the actual decision- 6. Conclusions
making process underlying an automated decision.27 Article
12, paragraph a, third dash, of the Directive already states that While profiling and data mining are important tools for
22
dealing with information overload, it is of vital importance
I use the term ‘accountability’ here in the sense of answerability.
23 that they are used in a responsible manner as to avoid risks to
Fule, P., Roddick, J.F. (2006), Detecting privacy and ethical
sensitivity in data mining results. In: Proceedings of the 27th Aus- groups and individuals. We have established that risks asso-
tralasian conference on Computer science, Volume 26, pp. 159e166. ciated with descriptive and predictive data mining include
24
Schreuders, E. (2001), Data mining, de Toetsing van Beslisregels en
28
Privacy, ITeR reeks 48 (in Dutch). Lessig, L. (2006), Code 2.0, New York: Perseus Book Group.
25 29
Ruggieri, S., Pedreschi, D., and Turini, F. (2010), Integrating Agrawal, R., Srikant, R. (2000), Privacy-Preserving Data Mining.
induction and deduction for finding evidence of discrimination. In: ACM SIGMOD Record, Volume 29, Issue 2 (June 2000), pp.
In: Artificial Intelligence & Law, 18:1e43; Springer. 439e450.
26 30
Bygrave, L. (2001), Minding the machine: article 15 of the EC Kamiran, F., Calders, T. (2009), Classification without discrimi-
data protection directive and automated profiling. In: Computer nation, IEEE International Conference on Computer, Control &
Law & Security Report, 2001, Volume 17, pp. 17e24. Communication (IEEE-IC4); IEEE press.
27 31
Groothuis, M. (2003), Beschikken en digitaliseren: over normering Kamiran, F., Calders, T., Pecehnizkiy, M. (2009), Building
van de elektronische overheid, PhD Thesis, Leiden University, p. 69 Classifiers with Independency Constraints. In: IEEE ICDM Work-
(in Dutch). shop on Domain Driven Data Mining. IEEE press 1.
52 c o m p u t e r l a w & s e c u r i t y r e v i e w 2 7 ( 2 0 1 1 ) 4 5 e5 2

discrimination, de-individualisation and information asym- featured far more prominently in the data mining and auto-
metries. Apart from these risks we have observed that appli- mated profiling debate.
cation issues such as data dredging, issues with accuracy and To avoid the application issues mentioned we must raise the
reliability, and mistaking correlation for causation, can also level of awareness with policymakers about the limits of auto-
negatively affect groups and individuals. mated profiling and data mining. Apart from raising awareness,
The traditional approach to mitigating these issues is mechanisms should be implemented that detect the improper
invoking the right to privacy. In our opinion, such a strategy is use of data mining, profiling, and statistics in policymaking.
no longer viable by itself. We have shown that traditional Such a mechanism could be a “second opinion” or “sanity
approaches rooted in the idea of ex ante privacy protection are check” by a panel of data mining experts and legal professionals.
by themselves insufficient for dealing with the negative The last approach is to build algorithms and decision-
consequences of automated profiling and data mining, and may making procedures in such a way that they minimise risk.
even be counterproductive as they render detecting discrimi- This means going beyond the idea of ‘privacy by design’ by
nation more difficult. In particular the notions of data mini- looking towards the underlying issues involved in a particular
misation and data exclusion are troublesome, as they may data mining and automated profiling process. The creation of
reduce the accuracy of data mining and may deny us the data independency constraints for instance, is a way to minimise
necessary to detect discrimination in automated profiling. the risk of discrimination in automated profiling.
Furthermore, privacy law will not adequately protect us from What becomes clear from these proposed solutions is that
the application issues mentioned in this article (viz., accuracy they require a strong inter-disciplinary mindset and
and reliability, data dredging and mistaking correlation for approach. Within our own project we found that the legal
causation). professionals and computer scientists use a totally different
When it comes to additional and alternative strategies to vocabulary. The most important recommendation is therefore
mitigate the risks of automated profiling we must look to strengthen the ties between these different fields of science
towards mechanisms that increase the accountability (both (computer science, law, ethics) in order to create necessary
through ex ante screening of data mining applications for safeguards.
possible risks and ex post checking of results) and trans-
parency of automated profiling. The legal basis for imple- Bart W. Schermer (schermer@considerati.com) is an assistant
menting these mechanisms can be found in article 15 of the professor at eLaw@Leiden, Centre for Law in the Information
Data Protection Directive, which in our opinion should be Society.