You are on page 1of 8

Ethics and Information Technology 1: 275–281, 1999.

© 2000 Kluwer Academic Publishers. Printed in the Netherlands.

KDD: The challenge to individualism

Anton Vedder
Tilburg University, Faculty of Law / Eindhoven University of Technology, Faculty of Technology Management, The Netherlands

Abstract. KDD (Knowledge Discovery in Databases) confronts us with phenomena that can intuitively be
grasped as highly problematic, but are nevertheless difficult to understand and articulate. Many of these problems
have to do with what I call the “deindividualization of the person”: a tendency of judging and treating persons on
the basis of group characteristics instead of on their own individual characteristics and merits. This tendency will
be one of the consequences of the production and use of group profiles with the help of KDD. Current privacy
law and regulations, as well as current ethical theory concerning privacy, start from too narrow a definition of
“personal data” to capture these problems. In this paper, I introduce the notion of “categorical privacy” as a
starting point for a possible remedy for the failures of the current conceptions of privacy. I discuss some ways in
which the problems relating to group profiles definitely cannot be solved and I suggest a possible way out of these
problems. Finally, I suggest that it may take us a step forward if we would begin to question the predominance
of privacy norms in the social debate on information technologies and if we would be prepared to introduce
normative principles other than privacy rules for the assessment of new information technologies. If we do not
succeed in articulating the problems relating to KDD clearly, one day we may find ourselves in a situation where
KDD appears to have undermined the methodic and normative individualism which pervades the mainstream of
morality and moral theory.

Introduction these phases, a search hypothesis is used to guide the

Knowledge discovery in databases (KDD) is the non- Much has been written already about the technical
trivial extraction of implicit, previously unknown, ins and outs of KDD. In this paper, I want to draw
and potentially useful information from data (Frawley, attention to another side of KDD. I will focus on
Piatetsky-Shapiro and Matheus, 1991). The greatest ethical, legal, and public policy aspects. I will start
opportunities of KDD are description and – foremost by giving some preliminaries in order to delimit the
– prediction of behavior (Fayyad, Piatetsky-Shapiro phenomena and problems I want to discuss.
and Smyth, 1996). The process of KDD focuses First, it should be noted that the views which I
on finding understandable patterns, especially when will put forward do not apply to all forms of KDD.
working with large databases. In recent years, KDD Throughout this paper, I will treat of KDD only insofar
has been acknowledged as an important set of tech- as it involves the use of personal data. I include KDD
niques in analyzing data for purposes such as dir- insofar as it involves the use of personal data in com-
ect marketing and credit scoring. Other applications bination with other data, but I will not address KDD
include checking for patterns in criminal behavior for involving no personal data at all. For instance, KDD
forensic and judicial purposes and analyzing med- focusing on production processes exclusively with the
ical data or data about medical drug consumption in help of data about machines and materials will not be
combination with demographic data to predict poten- of my concern, here.
tial risk groups. The KDD process is usually divided Secondly, by “personal data” I do not exclusively
into three phases: the data warehousing phase, the mean data immediately relating to individually iden-
data mining phase, and the interpretation phase. Con- tifiable persons. I refer to personal data in a broader
fusingly, the whole process of KDD, i.e., all three sense, also including data which originally, at some
phases together, are often referred to also as “data time, has been linked to individual persons, but at
mining.” It may be better to use the latter notion some other time has been processed and become part
only for the middle phase of KDD. In the data ware- of a set of anonymous data and complied data. These
housing phase, data is collected, enriched, checked, types of data originally contained or were accompan-
and coded. The data is analyzed in the data min- ied by identifiers of individual persons. At some stage
ing phase. Finally, the results are interpreted. During of aggregation or processing, the data was disconnec-

ted from these individual identifiers and was combined First, there are some principles regarding data qual-
with an identifier of a group of individuals. These ity. Personal data should only be collected for spe-
group identifiers can be group or class characteristics cified, explicit, legitimate purposes and should not be
as diverse as a certain age, the ownership of a certain further processed in a way incompatible with these
type of car, the residency in a certain area with a certain purposes. No excessive amounts of data should be col-
postal code, a certain position, characteristics indicat- lected, relative to the purpose for which the data is
ing the use of a certain medicine or a combination of collected. Moreover, the data should be accurate and,
some such characteristics. if applicable, kept up to date. Every reasonable step
I will argue that it is precisely the use of these must be taken to ensure that inaccurate or incomplete
group-linked data and the production of generaliz- data is either rectified or erased. Also, personal data
ations and profiles of data or information subjects should be kept in a form which permits identification
defined by such group characteristics, which can be of data subjects for no longer than is necessary for the
highly problematic. However, I will first point out purpose for which the data was collected.
that the current privacy law and regulations are based Secondly, some principles apply for legitimizing
on a narrow definition of personal data. Next, I will personal data processing. If an individual has unam-
draw attention to the serious social problems that may biguously given his or her consent, data processing is
arise from the production and application of certain legitimate. Without such consent, there are only a few
personal data in the broader sense. I will show that situations in which data processing is legitimate. For
these problems cannot be captured on the basis of instance, personal data processing is legitimate if it
current conceptions of privacy in law, regulation or is needed for the performance of a contract to which
even ethical theory for that matter. As to this, I will the data subject is a party, for compliance with a legal
introduce the notion of “categorical privacy” as a start- obligation, to protect a vital interest of the data subject,
ing point for a possible remedy for the defects of the for the performance of a task which is carried out in
current conceptions of privacy. To this, I will add the public interest, or for the purposes of the legitimate
some remarks about the way in which the problems interests pursued by the controller or by third parties to
referred to certainly cannot be solved. I will close with whom the data is disclosed.
the suggestion that our normative framework for the Thirdly, the data subject has some specific rights
assessment of modern information technology stands with regard to “his or her” personal data. Among these
in need of extension. The contemporary one-sidedness rights are the right of access (knowing what data is
resulting from the preponderance of privacy norms, to being stored and whether the data relating to the data
my mind, should be counterbalanced by invoking addi- subject are being processed), the right of rectification,
tional normative principles, e.g. principles regarding the right to know to whom the data has been disclosed,
social justice, equality, and fairness. and the right to object to the processing of data relating
to the data subject.
The definitions and principles formulated in the
Personal data, law, and ethics European Directive are mirrored in the national pri-
vacy laws and regulations of the European Union
Personal data is often considered to be the exclusive countries, since a Directive must be implemented in
kind of data eligible for protection by privacy law and national law and regulation. Therefore, the impact of
privacy norms. Personal data, furthermore, is com- the Directive’s definition and principles should not be
monly defined as data and information relating to an underestimated.
identified or identifiable person. A clear illustration The Directive’s definitions and principles them-
of this rather narrow starting point can be found in selves certainly reflect ideas about informational pri-
the highly influential European Directive 95/46/EC of vacy currently held amongst legal and ethical theorists.
the European Parliament and of the European Council Sometimes, these theoretical views on informational
of 24 October 1995 “on the protection of individuals privacy are not much more than implicit assump-
with regard to the processing of personal data and on tions. However, things are different and more articulate
the free movement of such data.” With regard to the where theorists define informational privacy as being
processing of personal data, the Directive poses some in control over (the accessibility of) personal inform-
basic principles. For the purposes of this paper, I will ation, or where they indicate some kind of personal
highlight some of these. It is important to notice that freedom, such as the preference freedom in the vein
– as may be expected from the definition of personal of John Stuart Mill’s individuality, as the ultimate
data – most of these principles lean heavily on the idea point and key value behind privacy (see, for instance,
that there is some kind of direct connection between a Parent, 1983; Johnson, 1989). These theorists con-
person and his or her data. sider privacy to be mainly concerned with information

relating to designated individuals. They also tend to ation about persons having a certain probability of
advocate protective measures in terms of safeguarding manifesting certain diseases, lifestyles, etc. may eas-
an individual’s control and consent vis-à-vis certain ily give rise to stigmatization and discrimination. The
dispersions of personal information. information may also be used for giving or deny-
Now, applying the narrow definition of personal ing access to provisions, like insurances, loans, or
data and protective measures like the Directive’s and jobs.
some philosophers’ to the KDD process is not without Secondly, increasing the use and production of
difficulties. Of course, as long as the process involves generalizations and profiles on the basis of personal
personal data in the strict sense of data relating to information can lead to growing unfairness in social
an identified or identifiable individual, the principles interaction. This is poignantly clear in the case of what
apply without reservation. However, as soon as the I will call nondistributive generalizations and profiles,
data has ceased to be personal data in the strict sense, it as opposed to distributive generalizations and profiles.
is not at all clear how the principles should be applied. Distributive generalizations and profiles assign certain
For instance, the right of rectification applies to the properties to a data or information subject, consist-
personal data in the strict sense itself; if does not ing of a group of persons however defined – in such
apply to information derived from this data. The same a way that these properties are actually and uncondi-
goes for the requirement of consent. Once the data has tionally manifested by all the members of that group.
become anonymous, or has been processed and gener- Distributive generalizations and profiles are put in
alized, an individual cannot exert any influence on the the form of down-to-earth, matter-of-fact statements.
processing of the data at all. The rights and require- Nondistributive generalizations and profiles, however,
ments make no sense regarding anonymous data and are framed in terms of probabilities and averages and
group profiles. medians, or significant deviances from other groups.
They are based on comparisons of members of the
group with each other and/or on comparisons of one
Deindividualization particular group with other groups. Nondistributive
generalizations and profiles are, therefore, signific-
KDD confronts us with a paradoxical situation. The antly different from distributive generalizations and
data used and the generalizations and profiles created profiles. The properties in nondistributive generaliza-
do not always qualify as personal data. Nevertheless, tions and profiles apply to individuals as members of
the ways in which the generalizations and profiles are the reference group, whereas these individuals taken
applied may have a serious impact on the persons from as such need not in reality exhibit these properties.
whom the data was originally taken or, even more for For instance, in a credit-scoring application a loan can
that matter, to whom the generalizations and profiles be refused on the basis of the fact that an applicant
are eventually applied. In fact, since these general- belongs to a reference group, e.g. having a certain
izations and profiles are often used as if they were kind of job, which is the information subject of a
personal data while in fact they are not, the impact nondistributive profile of a bad debtor, whereas the
on individual persons is sometimes even stronger than applicant himself is in fact an extremely trustworthy
with the use of “real” personal data. Let me elaborate person who has not missed an instalment on a loan
on this. in his whole life. Or an applicant may be refused a
First of all, where generalizations and profiles are life insurance on the basis of a nondistributive gen-
used as a basis for formulating policies of public and eralization of certain health risks of the group (e.g.
private organizations, or where they just slip into the defined by a postal code) to which he or she happens
body of public knowledge, individuals are affected to belong, whereas he or she is a clear exception to the
indirectly. Persons are judged and treated more and average risks of his or her group. In all such cases,
more as members of a group (i.e. the reference group the individual is judged and treated on the basis of,
that makes up the data or information subject) rather coincidentally, belonging to the “wrong” category of
than as individuals with their own characteristics and persons.
merits. This consequence of KDD using or produ- Of course, these problematic consequences are not
cing personal data in the broad sense may, at first unique to KDD. As a matter of fact, they are inherent
sight, seem rather innocent. It loses, however, much to many forms of matching and profiling, and even to
of its innocent appearance where the information con- certain forms of noncomputerized generalizations of
tained in the generalizations or profiles is of a sensitive personal data (Vedder, 1995, 6–11, 105–114, 1997).
nature, because it is typically susceptible of prejudice What is new with KDD, however, is the enormous
and taboo or because it can be used for selections scale on which data can be processed and generaliz-
in allocation procedures. So, for instance, inform- ations and profiles can be produced. Relatively new,

also, are the ever-growing possibilities of discovering tions regarding the protection of personal data, as it
hitherto unnoticed relationships between characterist- is commonly conceived of. For instance, distributive
ics and features of persons, created by KDD. This, generalizations and profiles may sometimes be right-
by the way, also creates ample opportunities of cov- fully thought of as infringements of (individual) pri-
ering up or hiding the use of certain delicate pieces vacy when the individuals involved can easily be
of information. On the basis of a statistical correlation identified through a combination with other informa-
between the ownership of a certain kind of car and tion available to the recipient or through spontaneous
belonging to a high-risk group for a certain disease, recognition.
an insurance company could for instance allocate its In the case of nondistributive profiles and general-
health insurance according to the type of car owned izations, however, the information remains attached to
by a candidate. The company would then be able to an information subject constituted by a group. It can-
select candidates without asking or checking for their not be traced back to individual persons in any straight-
health condition and prospects; it would not arouse forward sense. The groups which are the information
the suspicion of selecting on the basis of health cri- subjects of nondistributive profiles and generalizations
teria. This possibility may be used in countries where can often only be identified by those who defined them
selection on the basis of health is forbidden for health for a special purpose. From the perspectives of others
insurances. than the producers and certain users of the profiles and
However this may be, the generalizations and pro- generalizations, the definition of the information sub-
files will be used more and more as a basis for policy- ject will remain hidden because they do not know the
making by public and private organizations. In this specific purposes of the definition. When coincident-
way, they may ultimately give rise to a new social strat- ally found out by the latter, they will probably think of
ification. Although many uses of the products of KDD the definition as being arbitrarily chosen. Most import-
are morally acceptable, and even desirable, many other antly, however, the information contained in the profile
possible applications are at odds with commonly held or generalization envisages individuals as members of
values regarding the individuality of human persons. groups; it does not envisage the individuals as such.
I think, therefore, that the time has come for a new Supposing for the sake of argument that the profile
orientation in data protection, or, perhaps even better, or generalization has been produced in a methodically
in the moral and legal assessment of information tech- sound and reliable way, it only tells us some truth about
nology in general. I am not in a position to provide individual members of those groups in a very qualified,
clear-cut solutions, but I will suggest a starting point conditional manner. Therefore, privacy rules and con-
for a discussion of these problems and indicate ways ventions, as they are traditionally conceived of, do not
in which they definitely cannot be solved. apply.
Regarding the privacy of the information subject,
i.e. the reference group as a whole, one might think
Categorical privacy that perhaps we could be saved by a notion of col-
lective privacy. However, collective privacy will not
Most conceptions of individual privacy currently put do the job properly. The notion of collective privacy
forward in law, regulation, and ethical debate have is too easily associated with the concept of collective
one feature in common: not only do they assume that rights. The subjects of collective rights are groups or
the personal data with which privacy is concerned ori- communities. In order to make sense of the idea of
ginally contains statements about states of affairs or collective rights, these subjects are often treated as
aspects accompanied by indicators of individual nat- beings analogous to persons or moral agents, or at
ural persons, but they also assume that the data as a least as conglomerates having certain characteristics
result of processing continues to contain statements which cannot ultimately and exhaustively be explained
about states of affairs or aspects accompanied by iden- by the input of the individual members. Furthermore,
tifiers of individual natural persons. This feature of they are often thought to be structured or organized
current privacy conceptions has two important con- in some way so as to be able to exercise their rights
sequences: it makes it difficult to label the problematic or let their rights be advocated by vicarious agents
aspects of using data abstracted from personal data (Hartney, 1991). All of these properties are out of the
and producing and applying group profiles and gen- question as regards the reference groups of the pro-
eralizations; it also makes it difficult to fathom the files and generalizations we are considering. From the
seriousness of these problems in practice. perspective of their members, these groups are mostly
It should be observed that group profiles and gen- randomly defined. Their members do not have any spe-
eralizations may occasionally be incompatible with cial ties of loyalty among one another. They do not
respect for individual privacy and laws and regula- have organizational structures either. Therefore, they

are not capable of taking decisions or acting in their turn from the group as such to the individual members
quality of collectivities. of the groups, then an individual’s possibility of refusal
In order to remove the deficiencies of current con- or opting out could be harmful to other members of the
ceptions of individual privacy as regards analytical and reference group as well as to the very person refusing
distinctive evaluative potential, we would be better off to allow personal information to be used in produ-
by something which I would prefer to call “categorical cing the profile. For, actual refusal will reduce the
privacy”. I suggest that we conceive of categorical pri- reliability of the profile or generalization, while, never-
vacy as a value, in many respects similar to individual theless, all members of the reference group, including
privacy, except that it relates to data or information the individual who opted out, are at risk of being
to which two conditions do apply: (1) the informa- judged and treated on the basis of just this profile or
tion was originally taken from the personal sphere of generalization with reduced reliability. Of course, the
individuals, and – after aggregation and processing possibility of opting out may also, in some respects,
according to statistical methods – is no longer accom- benefit the members of the reference group. If, in the
panied by identifiers of individual natural persons, but, case of profiles or generalizations intended for applic-
instead, by identifiers of groups of persons; (2) when ation in selection procedures, only people with bad
attached to identifiers of groups and when disclosed, risks actually refuse the use of their information this
the information is apt to carry with it the same kind may turn out to be rather advantageous for the healthy
of negative consequences for the members of those and well-to-do. This, however, does not diminish the
groups as it would for an individual natural person wrongfulness of, for instance, judging and treating per-
if the information were accompanied by identifiers of sons on the basis of properties which they do not, if
that individual. only with a decreased probability, instantiate.
Categorical privacy is strongly connected with indi- Perhaps then the only way to protect individu-
vidual privacy. The values which oppose infringements als against the possible negative consequences of the
of individual privacy, such as personal autonomy, indi- use of generalizations and profiles based on personal
viduality, and certain social interests, equally oppose information in the broad sense lies in a careful case-
infringements of categorical privacy. Unlike collect- by-case assessment of the ways in which abstracted
ive privacy, however, categorical privacy has its points personal data, group profiles and generalizations are in
in respecting and protecting the individual rather than fact used and can be used. By meticulously investigat-
in respecting and protecting the group to which the ing and evaluating these applications one may hope to
individual belongs. Furthermore, the conception of find starting points for restrictions of the purposes for
categorical privacy presented here – just like many which these data, profiles, and generalizations may be
current conceptions of individual privacy – builds on produced and applied. An elaborate proposal concern-
a conventionally predefined conception of (informa- ing such acceptable and unacceptable purposes cannot
tion concerning) the personal sphere (Johnson, 1992). be provided here. It is important, however, to keep
Categorical privacy, however, is different from its indi- in mind that solutions will not be found only in for-
vidual counterpart in that it draws attention to the bidding the production and application of profiles and
attribution of generalized properties to members of generalizations for certain purposes. In many cases, it
groups, which, however, may result in the same effects may be more appropriate to reconsider those purposes
as the attribution of particularized properties to indi- themselves. Sometimes it may be easier or even mor-
viduals as such. In this respect, categorical privacy ally more desirable to do something about social and
resembles stereotyping and wrongful discrimination economic arrangements that induce wrongful applic-
on the basis of stereotypes (Harvey, 1991). ations of information technologies like the ones used
Infringements of categorical privacy cannot be for KDD than abolishing those applications. This is
dealt with in ways similar to those in which indi- the case especially where, for instance, profiles and
viduals are protected against possible infringements generalizations can be used for desirable purposes and
of individual informational privacy. The application of for undesirable purposes at the same time. Also, in
principles and rights of, for instance, rectification and such situations where there is a possibility of good use
consent to potential infringements on categorical pri- and bad use of the same newly produced information,
vacy is to a large extent impossible. Even if it were doubtlessly some help is to be expected from crypto-
possible, it would nevertheless be unacceptable for logists. Sometime in the future it must be possible to
obvious reasons. First, as has been explained above, make the information in certain profiles and generaliz-
the reference group of the generalization or profile ations accessible to some people and not to others, for
will only rarely be able to reach and enact collective some purposes and not for others, and to protect data-
decisions because of its lack of organizational structure bases against the possibility of applying certain types
and personal or social ties. Secondly, if one were to of queries.

Closing remarks ing deindividualizing effects, little by little, undermine

both methodic and normative individualism, which
I have tried to draw attention to the problems of KDD are the hallmarks of the still dominant utilitarian and
using personal data in terms of categorical privacy, and (neo-)Kantian moral frameworks.1 Perhaps that is the
I have indicated the shortcomings of traditional privacy inevitable outcome of an unavoidable process of which
conceptions. My main concern was to define some we are now experiencing the beginnings and to which
important problematic consequences which KDD may we will grow accustomed in the end. If, however,
have for the ways in which individuals are judged and there are possibilities of influencing the process, then
treated, so that they may not be overlooked, but be those possibilities will certainly start with gradually
critically assessed. In this way I hope to have con- becoming more articulate.
tributed to a fruitful discussion of the solution of these
There is one other point I would like to make. I have Acknowledgements
tried, in a sense, to extend the notion of privacy and the
traditional privacy norms. Of some of the problematic I am grateful to Robert van Kralingen and Eric
aspects of KDD I am still not completely sure whether Schreuders, Tilburg University, for their suggestions
they are best articulated in terms of privacy at all, be and contributions to an earlier version of this paper
it individual or categorical. Especially regarding those relating to my reading of the European Directive
problems where producing the information is not so 95/46/EC. I presented this earlier version of the paper
much the problem, but applying it for certain purposes at the CEPE Conference, December 1998 at the Lon-
is, other normative principles may be more appropri- don School of Economics. I also want to thank Her-
ate. For instance, articulations in terms of social justice man Tavani, Rivier College, Nashua, for his inspiring
and fairness may sometimes be more to the point than comments on that occasion.
those in terms of (categorical) privacy.
My reason for still adhering to privacy is in a sense
a strategic one. The notion of privacy is overwhelm- Note
ingly dominant in the legal and moral vocabularies
and conceptual frameworks of those working in the 1. Another interesting topic relating to KDD is that it chal-
fields of information technology and law, information lenges our traditional views and definitions of knowledge
technology and public policy, and even to large extent by depersonalizing knowledge. This might in turn affect
information technology and ethics. In these circles, our views about the relationship between knowledge and
moral responsibility and the status of the moral actor. I
the privacy vocabulary has almost become the lingua
will address questions concerning the depersonalization of
franca for the assessment of information technology. knowledge and its moral consequences elsewhere.
This is a good thing in many ways. It facilitates com-
munication and guarantees in some respects a clear
discussion. However, it also has its dark side. The one-
sidedness that comes with the privacy vocabulary is apt References
to create an imbalance in our capacities to perceive,
analyze, and articulate the multifarious moral aspects U. Fayyad, G. Piatesky-Shapiro and P. Smyth. Knowledge Dis-
of information technology. Although in this paper I covery and Data Mining: Towards a Unifying Framework. In
have chosen the compromise of staying within some E. Simoudis, J. Hian and U. Fayyad, editors, Proceedings of
limits of the privacy vocabulary, I hope to have shown the Second International Conference on Knowledge Discov-
by way of an exemplary case that the traditional pri- ery & Data Mining. AAAI Press / MIT Press, Menlo Park,
vacy notion is not sufficient for grasping the problems Cal., 1996.
relating to information technology. W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus. Know-
Returning to KDD in particular, I think that becom- ledge Discovery in Databases: An Overview. In G. Piatetsky-
Shapiro and W. J. Frawley, editors, Knowledge Discovery
ing conscious of the problems of KDD is all but a
in Databases. AAAI Press/MIT Press, Menlo Park, Cal,
luxury. A bigger danger is lurking, if we do not Cambridge, Mass/London, 1991.
become alert to problems described here. It is a M. Hartney. Some Confusions Concerning Collective Rights.
danger that may affect our moral outlook as a whole. Canadian Journal of Law and Jurisprudence 4: 293–314,
The unreflected and undirected production of gener- 1991.
alizations about groups of persons with the help of J. Harvey. Stereotypes and Group-claims: Epistemological and
KDD may in the end show itself to be a threat to Moral Issues and Their Implications for Multiculturalism in
some of the basic premises of traditional morality and Education. Journal of Philosophy and Education 24: 39–50,
moral theory. KDD may through its steadily ongo- 1991.

J. Johnson. Privacy, Liberty and Integrity. Public Affairs A. H. Vedder. The Values of Freedom. Thesis, Utrecht Univer-
Quarterly 3: 15–34, 1989. sity, 1995.
J. Johnson. A Theory of the Nature and Value of Privacy. Public A. H. Vedder. Privatization, Information Technology and Pri-
Affairs Quarterly 6: 271–288, 1992. vacy: Reconsidering the Social Responsibilities of Organiz-
W. A. Parent, Recent Work on the Concept of Privacy. American ations. In Geoff Moore, editor, Business Ethics: Principles
Philosophical Quarterly 20: 341–356, 1983. and Practice, pp. 215–226. Business Education Publishers,
Sunderland, 1997.