You are on page 1of 28

Excavating A

The Politics of Images in Machine Learning Training Sets

By Kate Crawford and Trevor Paglen

You open up a database of pictures used to In short, how did we get here?
train artificial intelligence systems. At first,
things seem straightforward. You’re met
with thousands of images: apples and There’s an urban legend about the early
oranges, birds, dogs, horses, mountains, days of machine vision, the subfield of
clouds, houses, and street signs. But as artificial intelligence (AI) concerned with
you probe further into the dataset, people teaching machines to detect and interpret
begin to appear: cheerleaders, scuba images. In 1966, Marvin Minsky was a
divers, welders, Boy Scouts, fire walkers, young professor at MIT, making a name for
and flower girls. Things get strange: A himself in the emerging field of artificial
photograph of a woman smiling in a bikini intelligence.[1] Deciding that the ability to
is labeled a “slattern, slut, slovenly woman, interpret images was a core feature of
trollop.” A young man drinking beer is intelligence, Minsky turned to an
categorized as an “alcoholic, alky, undergraduate student, Gerald Sussman,
dipsomaniac, boozer, lush, soaker, souse.” and asked him to “spend the summer
A child wearing sunglasses is classified as linking a camera to a computer and getting
a “failure, loser, non-starter, unsuccessful the computer to describe what it saw.”
person.” You’re looking at the “person” [2] This became the Summer Vision
category in a dataset called ImageNet, one Project.[3] Needless to say, the project of
of the most widely used training sets for getting computers to “see” was much
machine learning. harder than anyone expected, and would
take a lot longer than a single summer.
Something is wrong with this picture.
The story we’ve been told goes like this:
brilliant men worked for decades on the
Where did these images come from? Why problem of computer vision, proceeding in
were the people in the photos labeled this fits and starts, until the turn to probabilistic
way? What sorts of politics are at work modeling and learning techniques in the
when pictures are paired with labels, and 1990s accelerated progress. This led to the
what are the implications when they are current moment, in which challenges such
used to train technical systems? as object detection and facial recognition

1

have been largely solved.[4] This arc of turn to the question of labeling: how do
inevitability recurs in many AI narratives, humans tell computers which words will
where it is assumed that ongoing technical relate to a given image? And what is at
improvements will resolve all problems and stake in the way AI systems use these
limitations. labels to classify humans, including by
race, gender, emotions, ability, sexuality,
But what if the opposite is true? What if the and personality? Finally, we turn to the
challenge of getting computers to “describe purposes that computer vision is meant to
what they see” will always be a problem? In serve in our society—the judgments,
this essay, we will explore why the choices, and consequences of providing
automated interpretation of images is an computers with these capacities.
inherently social and political project, rather
than a purely technical one. Understanding
the politics within AI systems matters more
than ever, as they are quickly moving into Training A
the architecture of social institutions:
deciding whom to interview for a job, which Building AI systems requires data.
students are paying attention in class, Supervised machine-learning systems
which suspects to arrest, and much else. designed for object or facial recognition are
trained on vast amounts of data contained
For the last two years, we have been within datasets made up of many discrete
studying the underlying logic of how images. To build a computer vision system
images are used to train AI systems to that can, for example, recognize the
“see” the world. We have looked at difference between pictures of apples and
hundreds of collections of images used in oranges, a developer has to collect, label,
artificial intelligence, from the first and train a neural network on thousands of
experiments with facial recognition in the labeled images of apples and oranges. On
early 1960s to contemporary training sets the software side, the algorithms conduct a
containing millions of images. statistical survey of the images, and
Methodologically, we could call this project develop a model to recognize the
an archeology of datasets: we have been difference between the two “classes.” If all
digging through the material layers, goes according to plan, the trained model
cataloguing the principles and values by will be able to distinguish the difference
which something was constructed, and between images of apples and oranges
analyzing what normative patterns of life that it has never encountered before.
were assumed, supported, and
reproduced. By excavating the construction Training sets, then, are the foundation on
of these training sets and their underlying which contemporary machine-learning
structures, many unquestioned systems are built.[5] They are central to
assumptions are revealed. These how AI systems recognize and interpret the
assumptions inform the way AI systems world. These datasets shape the epistemic
work—and fail—to this day. boundaries governing how AI systems
operate, and thus are an essential part of
This essay begins with a deceptively understanding socially significant questions
simple question: What work do images do about AI.
in AI systems? What are computers meant
to recognize in an image and what is But when we look at the training images
misrecognized or even completely widely used in computer-vision systems,
invisible? Next, we look at the method for we find a bedrock composed of shaky and
introducing images into computer systems skewed assumptions. For reasons that are
and look at how taxonomies order the rarely discussed within the field of
foundational concepts that will become computer vision, and despite all that
intelligible to a computer system. Then we institutions like MIT and companies like

Google and Facebook have done, the


project of interpreting images is a
profoundly complex and relational
endeavor. Images are remarkably
slippery things, laden with multiple
potential meanings, irresolvable
questions, and contradictions. Entire
subfields of philosophy, art history, and
media theory are dedicated to teasing
out all the nuances of the unstable
relationship between images and
meanings.[6]

Images do not describe themselves.


This is a feature that artists have
explored for centuries. Agnes Martin
creates a grid-like painting and dubs it
“White Flower,” Magritte paints a picture
of an apple with the words “This is not
an apple.” We see those images
differently when we see how they’re
labeled. The circuit between image,
label, and referent is flexible and can be
reconstructed in any number of ways to
do different kinds of work. What’s more,
those circuits can change over time as
the cultural context of an image shifts,
and can mean different things depending
on who looks, and where they are
located. Images are open to
interpretation and reinterpretation.

This is part of the reason why the tasks


of object recognition and classification
are more complex than Minksy—and
many of those who have come since—
initially imagined.

Despite the common mythos that AI and


the data it draws on are objectively and
scientifically classifying the world,
everywhere there is politics, ideology,
prejudices, and all of the subjective stuff
of history. When we survey the most
widely used training sets, we find that
this is the rule rather than the exception.

"White Flower" Agnes Martin, 1960

Anatomy of a Training Se dataset is to help machine-learning


systems recognize and label these
emotions for newly captured, unlabeled
images). The implicit, top-level taxonomy
Although there can be considerable here is something like “facial expressions
variation in the purposes and architectures depicting the emotions of Japanese
of different training sets, they share some women.”
common properties. At their core, training If we go down a level from taxonomy, we
sets for imaging systems consist of a arrive at the level of the class. In the case
collection of images that have been labeled of JAFFE, those classes are happiness,
in various ways and sorted into categories. sadness, surprise, disgust, fear, anger, and
As such, we can describe their overall neutral. These categories become the
architecture as generally consisting of three organizing buckets into which all of the
layers: the overall taxonomy (the aggregate individual images are stored. In a database
of classes and their hierarchical nesting, if used in facial recognition, as another
applicable), the individual classes (the example, the classes might correspond to
singular categories that images are the names of the individuals whose faces
organized into, e.g., “apple,”), and each are in the dataset. In a dataset designed for
individually labeled image (i.e., an object recognition, those classes
individual picture that has been labeled an correspond to things like apples and
apple). Our contention is that every layer of oranges. They are the distinct concepts
a given training set’s architecture is infused used to order the underlying images.
with politics.
At the most granular level of a training set’s
Take the case of a dataset like the “The architecture, we find the individual labeled
Japanese Female Facial Expression image: be it a face labeled as indicating an
(JAFFE) Database,” developed by Michael emotional state; a specific person; or a
Lyons, Miyuki Kamachi, and Jiro Gyoba in specific object, among many examples. For
1998, and widely used in affective JAFFE, this is where you can find an
computing research and development. The individual woman grimacing, smiling, or
dataset contains photographs of 10 looking surprised.
Japanese female models making seven
facial expressions that are meant to There are several implicit assertions in the
correlate with seven basic emotional JAFFE set. First there’s the taxonomy itself:
states.[7] (The intended purpose of the that “emotions” is a valid set of visual

Japanese Female Facial Expression JAFFE Database, image credit M. Lyons et al. (1999)

concepts. Then there’s a string of additional The Canonical Training Set:


assumptions: that the concepts within
“emotions” can be applied to photographs ImageNe
of people’s faces (specifically Japanese
women); that there are six emotions plus a One of the most significant training sets in
neutral state; that there is a fixed the history of AI so far is ImageNet, which
relationship between a person’s facial is now celebrating its tenth anniversary.
expression and her true emotional state; First presented as a research poster in
and that this relationship between the face 2009, ImageNet is a dataset of
and the emotion is consistent, measurable, extraordinary scope and ambition. In the
and uniform across the women in the words of its cocreator, Stanford Professor
photographs. Fei-Fei Li, the idea behind ImageNet was
to “map out the entire world of objects.”[10]
At the level of the class, we find Over several years of development,
assumptions such as “there is such a thing ImageNet grew enormous: the
as a ‘neutral’ facial expression” and “the development team scraped a collection of
significant six emotional states are happy, many millions of images from the internet
sad, angry, disgusted, afraid, surprised.”[8] and briefly became the world's largest
At the level of labeled image, there are academic user of Amazon’s Mechanical
other implicit assumptions such as “this Turk, using an army of piecemeal workers
particular photograph depicts a woman with to sort an average of 50 images per minute
an ‘angry’ facial expression,” rather than, into thousands of categories.[11] When it
for example, the fact that this is an image was finished, ImageNet consisted of over
of a woman mimicking an angry 14 million labeled images organized into
expression. These, of course, are all more than 20 thousand categories. For a
‘performed” expressions—not relating to decade, it has been the colossus of object
any interior state, but acted out in a recognition for machine learning and a
laboratory setting. Every one of the implicit powerfully important benchmark for the
claims made at each level is, at best, open field.
to question, and some are deeply
contested.[9] Navigating ImageNet’s labyrinthine
structure is like taking a stroll through
Borges’s infinite library. It is vast and filled
The JAFFE training set is relatively modest with all sorts of curiosities. There are
as far as contemporary training sets go. It categories for apples, apple aphids, apple
was created before the advent of social butter, apple dumplings, apple geraniums,
media, before developers were able to apple jelly, apple juice, apple maggots,
scrape images from the internet at scale, apple rust, apple trees, apple turnovers,
and before piecemeal online labor apple carts, applejack, and applesauce.
platforms like Amazon Mechanical Turk There are pictures of hot lines, hot pants,
allowed researchers and corporations to hot plates, hot pots, hot rods, hot sauce,
conduct the formidable task of labeling hot springs, hot toddies, hot tubs, hot-air
huge quantities of photographs. As training balloons, hot fudge sauce, and hot water
sets grew in scale and scope, so did the bottles.
complexities, ideologies, semiologies, and
politics from which they are constituted. To ImageNet quickly became a critical asset
see this at work, let’s turn to the most iconic for computer-vision research. It became the
training set of all, ImageNet. basis for an annual competition where labs
around the world would try to outperform
each other by pitting their algorithms
against the training set, and seeing which
one could most accurately label a subset of
images. In 2012, a team from the
University of Toronto used a Convolutional

Interface used by Amazon Turk Workers to label pictures in ImageNet

Neural Network to handily win the top prize, organized into a nested hierarchy, going
bringing new attention to this technique. from general concepts to more specific
That moment is widely considered a turning ones. For example, the concept “chair” is
point in the development of contemporary nested as artifact > furnishing > furniture >
AI.[12] The final year of the ImageNet seat > chair. The classification system is
competition was 2017, and accuracy in broadly similar to those used in libraries to
classifying objects in the limited subset had order books into increasingly specific
risen from 71.8% to 97.3%. That subset did categories.
not include the “Person” category, for
reasons that will soon become obvious. While WordNet attempts to organize the
entire English language,[13] ImageNet is
restricted to nouns (the idea being that
nouns are things that pictures can
Taxonomy represent). In the ImageNet hierarchy,
every concept is organized under one of
The underlying structure of ImageNet is nine top-level categories: plant, geologic
based on the semantic structure of formation, natural object, sport, artifact,
WordNet, a database of word fungus, person, animal, and miscellaneous.
classifications developed at Princeton Below these are layers of additional nested
University in the 1980s. The taxonomy is classes.
organized according to a nested structure
of cognitive synonyms or “synset.” Each As the fields of information science and
“synset” represents a distinct concept, with science and technology studies have long
synonyms grouped together (for example, shown, all taxonomies or classificatory
“auto” and “car” are treated as belonging to systems are political.[14] In ImageNet
the same synset). Those synsets are then (inherited from WordNet), for example, the

If we move from taxonomy down a level, to the 21,841 categories in the ImageNet hierarchy,
we see another kind of politics emerge.

category “human body” falls under the Sexual Crimes,” which the American
branch Natural Object > Body > Human Library Association's Task Force on Gay
Body. Its subcategories include “male Liberation finally convinced the Library of
body”; “person”; “juvenile body”; “adult Congress to change in 1972 after a
body”; and “female body.” The “adult body” sustained campaign.[16]
category contains the subclasses “adult
female body” and “adult male body.” We
find an implicit assumption here: only
“male” and “female” bodies are “natural.” Categories
There is an ImageNet category for the term
“Hermaphrodite” that is bizarrely (and There’s a kind of sorcery that goes into the
offensively) situated within the branch creation of categories. To create a category
Person > Sensualist > Bisexual > alongside or to name things is to divide an almost
the categories “Pseudohermaphrodite” and infinitely complex universe into separate
“ S w i t c h H i t t e r. ” [ 1 5 ] T h e I m a g e N e t phenomena. To impose order onto an
classification hierarchy recalls the old undifferentiated mass, to ascribe
Library of Congress classification of phenomena to a category—that is, to name
LGBTQ-themed books under the category a thing—is in turn a means of reifying the
“Abnormal Sexual Relations, Including existence of that category.

In the case of ImageNet, noun categories problematic, illogical, and cruel, especially
such as “apple” or “apple butter” might when it comes to labels applied to people.
seem reasonably uncontroversial, but not
all nouns are created equal. To borrow an ImageNet contains 2,833 subcategories
idea from linguist George Lakoff, the under the top-level category “Person.” The
concept of an “apple” is more nouny than subcategory with the most associated
the concept of “light”, which in turn is more pictures is “gal” (with 1,664 images)
nouny than a concept such as “health.”[17] followed by “grandfather” (1,662), “dad”
Nouns occupy various places on an axis (1,643), and chief executive officer (1,614).
from the concrete to the abstract, and from With these highly populated categories, we
the descriptive to the judgmental. These can already begin to see the outlines of a
gradients have been erased in the logic of worldview. ImageNet classifies people into
ImageNet. Everything is flattened out and a huge range of types including race,
pinned to a label, like taxidermy butterflies nationality, profession, economic status,
in a display case. The results can be behaviour, character, and even morality.

Selections from the "Person" classes, ImageNet

9
There are categories for racial and national
identities including Alaska Native, Anglo-
American, Black, Black African, Black
Woman, Central American, Eurasian,
German American, Japanese, Lapp, Latin
American, Mexican-American, Nicaraguan,
Nigerian, Pakistani, Papuan, South
American Indian, Spanish American, Texan,
Uzbek, White, Yemeni, and Zulu. Other
people are labeled by their careers or
hobbies: there are Boy Scouts,
cheerleaders, cognitive neuroscientists,
hairdressers, intelligence analysts,
mythologists, retailers, retirees, and so on.

As we go further into the depths of


ImageNet’s Person categories, the
classifications of humans within it take a
sharp and dark turn. There are categories
for Bad Person, Call Girl, Drug Addict,
Closet Queen, Convict, Crazy, Failure,
F l o p , F u c k e r, H y p o c r i t e , J e z e b e l ,
K l e p t o m a n i a c , L o s e r, M e l a n c h o l i c ,
Nonperson, Pervert, Prima Donna,
Schizophrenic, Second-Rater, Spinster,
Streetwalker, Stud, Tosser, Unskilled
Person, Wanton, Waverer, and Wimp.
There are many racist slurs and
misogynistic terms.

Of course, ImageNet was typically used for


object recognition—so the Person category
was rarely discussed at technical
conferences, nor has it received much
public attention. However, this complex
architecture of images of real people,
tagged with often offensive labels, has
been publicly available on the internet for a
decade. It provides a powerful and
important example of the complexities and
dangers of human classification, and the
sliding spectrum between supposedly
unproblematic labels like “trumpeter” or
“tennis player” to concepts like “spastic,”
“mulatto,” or “redneck.” Regardless of the
supposed neutrality of any particular
category, the selection of images skews the
meaning in ways that are gendered,
racialized, ableist, and ageist. ImageNet is
an object lesson, if you will, in what
happens when people are categorized like
objects. And this practice has only become
more common in recent years, often inside

10

ImageNet Roulette: An Experiment in


Classi catio

The ImageNet dataset is typically used for object


recognition. But as part of our archeological
method, we were interested to see what would
happen if we trained an AI model exclusively on its
“person” categories. The result of that experiment
is ImageNet Roulette.

ImageNet Roulette uses an open-source Caffe


deep-learning framework (produced at UC
Berkeley) trained on the images and labels in the
“person” categories (which are currently “down for
maintenance”). Proper nouns were removed.

When a user uploads a picture, the application irst


runs a face detector to locate any faces. If it inds
any, it sends them to the Caffe model for
classi ication. The application then returns the
original images with a bounding box showing the
detected face and the label the classi ier has
assigned to the image. If no faces are detected, the
application sends the entire scene to the Caffe
model and returns an image with a label in the
upper left corner.

As we have shown, ImageNet contains a number of


problematic, offensive, and bizarre categories.
Hence, the results ImageNet Roulette returns often
draw upon those categories. That is by design: we
want to shed light on what happens when technical
systems are trained using problematic training
data. AI classi ications of people are rarely made
visible to the people being classi ied. ImageNet
Roulette provides a glimpse into that process—and
to show how things can go wrong.

ImageNet Roulette does not store the photos


people upload.

11
f
fi
f
n

f
f

f
the big AI companies, where there is no Labeled Image
way for outsiders to see how images are
being ordered and classified.

Finally, there is the issue of where the Images are laden with potential meanings,
thousands of images in ImageNet’s Person irresolvable questions, and contradictions.
class were drawn from. By harvesting In trying to resolve these ambiguities,
images en masse from image search ImageNet’s labels often compress and
engines like Google, ImageNet’s creators simplify images into deadpan banalities.
appropriated people’s selfies and vacation One photograph shows a dark-skinned
photos without their knowledge, and then toddler wearing tattered and dirty clothes
labeled and repackaged them as the and clutching a soot-stained doll. The
underlying data for much of an entire field. child’s mouth is open. The image is
[18] When we take a look at the bedrock completely devoid of context. Who is this
layer of labeled images, we find highly child? Where are they? The photograph is
questionable semiotic assumptions, echoes simply labeled “toy.”
of nineteenth-century phrenology, and the
representational harm of classifying images But some labels are just nonsensical. A
of people without their consent or woman sleeps in an airplane seat, her right
participation. arm protectively curled around her
pregnant stomach. The image is labeled
“snob.” A photoshopped picture shows a
smiling Barack Obama wearing a Nazi
uniform, his arm raised and holding a Nazi
flag. It is labeled “Bolshevik.”

ACCUSED BOLSHEVIK

Labeled Images, ImageNet (faces redacted by the authors)

12

ANTI-SEMITE KLEPTOMANIAC

CLOSET QUEEN MYTHOLOGIST

TOY GOOD PERSON


13
SLATTERN HERMAPHRODITE

JEWESS LOSER

14
MIXED-BLOOD

SHARECROPPER

15
SNOB TOSSER

16
SWINGER
At the image layer of the training set, like skulls, and compiled meticulous archives of
everywhere else, we find assumptions, labeled images and measurements, all in
politics, and worldviews. According to an effort to use “mechanical” processes to
ImageNet, for example, Sigourney Weaver detect visual signals in classifications of
is a “hermaphrodite,” a young man wearing race, criminality, and deviance from
a straw hat is a “tosser,” and a young bourgeois ideals. This was done to capture
woman lying on a beach towel is a and pathologize what was seen as deviant
“kleptomaniac.” But the worldview of or criminal behavior, and make such
ImageNet isn’t limited to the bizarre or behavior observable in the world.
derogatory conjoining of pictures and
labels. And as we shall see, not only have the
underlying assumptions of physiognomy
Other assumptions about the relationship made a comeback with contemporary
between pictures and concepts recall training sets, but indeed a number of
p h y s i o g n o m y, t h e p s e u d o s c i e n t i f i c training sets are designed to use
assumption that something about a algorithms and facial landmarks as latter-
person’s essential character can be day calipers to conduct contemporary
gleaned by observing features of their versions of craniometry.
bodies and faces. ImageNet takes this to
an extreme, assuming that whether For example, the UTKFace dataset
someone is a “debtor,” a “snob,” a (produced by a group at the University of
“swinger,” or a “slav” can be determined by Tennessee at Knoxville) consists of over
inspecting their photograph. In the weird 20,000 images of faces with annotations for
metaphysics of ImageNet, there are age, gender, and race.The dataset’s
separate image categories for “assistant authors state that the dataset can be used
professor” and “associate professor”—as for a variety of tasks, like automated face
though if someone were to get a promotion, detection, age estimation, and age
their biometric signature would reflect the progression.[21]
change in rank.
The annotations for each image include an
Of course, these sorts of assumptions have estimated age for each person, expressed
their own dark histories and attendant in years from zero to 116. Gender is a
politics. binary choice: either zero for male or one
for female. Second, race is categorized
from zero to four, and places people in one
of five classes: White, Black, Asian, Indian,
or “Others.”
UTK: Making Race and Gender from
Your Face The politics here are as obvious as they
are troubling. At the category level, the
In 1839, the mathematician François researchers’ conception of gender is as a
Arago claimed that through photographs, simple binary structure, with “male” and
“objects preserve mathematically their “female” the only alternatives. At the level
forms.”[19] Placed into the nineteenth- of the image label is the assumption that
century context of imperialism and social someone’s gender identity can be
Darwinism, photography helped to animate ascertained through a photograph.
—and lend a “scientific” veneer to—various
forms of phrenology, physiognomy, and The classificatory schema for race recalls
eugenics.[20] Physiognomists such as many of the deeply problematic racial
Francis Galton and Cesare Lombroso classifications of the twentieth century. For
created composite images of criminals, example, the South African apartheid
studied the feet of prostitutes, measured regime sought to classify the entire
population into four categories: Black,

17

UTKFace Dataset

18

Classification scheme used in UTKFace Dataset

White, Colored, or Indian.[22] Around 1970, skin.[25] IBM publicly promised to improve
the South African government created a their facial-recognition datasets to make
unified “identity passbook” called The Book them more “representative” and published
of Life, which linked to a centrally managed the “Diversity in Faces” (DiF) dataset as a
database created by IBM. These result.[26] Constructed to be “a
classifications were based on dubious and computationally practical basis for ensuring
shifting criteria of “appearance and general fairness and accuracy in face recognition,”
acceptance or repute,” and many people the DiF consists of almost a million images
were reclassified, sometimes multiple of people pulled from the Yahoo! Flickr
times.[23] The South African system of Creative Commons dataset, assembled
racial classification was intentionally very specifically to achieve statistical parity
different from the American “one-drop” rule, among categories of skin tone, facial
which stated that even one ancestor of structure, age, and gender.[27]
African descent made somebody Black,
likely because nearly all white South The dataset itself continued the practice of
Africans had some traceable black African collecting hundreds of thousands of images
ancestry.[24] Above all, these systems of of unsuspecting people who had uploaded
classifications caused enormous harm to pictures to sites like Flickr.[28] But the
people, and the elusive classifier of a pure dataset contains a unique set of categories
“race” signifier was always in dispute. not previously seen in other face-image
However, seeking to improve matters by datasets. The IBM DiF team asks whether
producing “more diverse” AI training sets age, gender, and skin color are truly
presents its own complications. sufficient in generating a dataset that can
ensure fairness and accuracy, and
concludes that even more classifications
are needed. So they move into truly
IBM’S Diversity in Face
strange territory: including facial symmetry
IBM’s “Diversity in Faces” dataset was and skull shapes to build a complete
created as a response to critics who had picture of the face. The researchers claim
s h o w n t h a t t h e c o m p a n y ’s f a c i a l - that the use of craniofacial features is
recognition software often simply did not justified because it captures much more
recognize the faces of people with darker granular information about a person's face
than just gender, age, and skin color alone.

19

IBM's Diversity in Faces

The paper accompanying the dataset the language of increasing “fairness” and
specifically highlights prior work done to “mitigating bias, ” clearly there are strong
show that skin color is itself a weak business imperatives to produce tools that
predictor of race, but this begs the question will work more effectively across wider
of why moving to skull shapes is markets. However, here too the technical
appropriate. process of categorizing and classifying
people is shown to be a political act. For
Craniometry was a leading methodological example, how is a “fair” distribution
approach of biological determinism during achieved within the dataset?
the nineteenth century. As Stephen Jay
Gould shows in his book The Mismeasure IBM decided to use a mathematical
of Man, skull size was used by nineteenth- approach to quantifying “diversity” and
and twentieth-century pseudoscientists as “evenness,” so that a consistent measure
a spurious way to claim inherent superiority of evenness exists throughout the dataset
of white people over black people, and for every feature quantified. The dataset
different skull shapes and weights were also contains subjective annotations for
said to determine people’s intelligence— age and gender, which are generated using
always along racial lines.[29] three independent Amazon Turk workers
for each image, similar to the methods
While the efforts of companies to build used by ImageNet.[30] So people’s gender
more diverse training sets is often put in and age are being ‘predicted’ based on

20

three clickworkers’ guesses about what’s discernible by using statistical methods to


shown in a photograph scraped from the look for formal patterns across a collection
internet. It harkens back to the early of labeled images. Images of people
carnival games of “Guess Your Weight!”, dubbed “losers,” the theory goes, contain
with similar levels of scientific validity. some kind of visual pattern that
distinguishes them from, say, “farmers,”
U l t i m a t e l y, b e y o n d t h e s e d e e p “assistant professors,” or, for that matter,
methodological concerns, the concept and apples. Finally, this approach assumes that
political history of diversity is being drained all concrete nouns are created equally, and
of its meaning and left to refer merely to that many abstract nouns also express
expanded biological phenotyping. Diversity themselves concretely and visually (i.e.,
in this context just means a wider range of “happiness” or “anti-Semitism”).
skull shapes and facial symmetries. For
computer vision researchers, this may The training sets of labeled images that are
seem like a “mathematization of fairness” ubiquitous in contemporary computer vision
but it simply serves to improve the and AI are built on a foundation of
efficiency of surveillance systems. And unsubstantiated and unstable
even after all these attempts at expanding epistemological and metaphysical
the ways people are classified, the assumptions about the nature of images,
Diversity in Faces set still relies on a binary labels, categorization, and representation.
classification for gender: people can only Furthermore, those epistemological and
be labelled male or female. Achieving parity metaphysical assumptions hark back to
amongst different categories is not the historical approaches where people were
same as achieving diversity or fairness, visually assessed and classified as a tool of
and IBM’s data construction and analysis oppression and race science.
perpetuates a harmful set of classifications
within a narrow worldview. Datasets aren’t simply raw materials to
feed algorithms, but are political
interventions. As such, much of the
discussion around “bias” in AI systems
Epistemics of Training Set misses the mark: there is no “neutral,”
“natural,” or “apolitical” vantage point that
What are the assumptions undergirding training data can be built upon. There is no
visual AI systems? First, the underlying easy technical “fix” by shifting
theoretical paradigm of the training sets demographics, deleting offensive terms, or
assumes that concepts—whether “corn”, seeking equal representation by skin tone.
“gender,” “emotions,” or “losers”—exist in The whole endeavor of collecting images,
the first place, and that those concepts are categorizing them, and labeling them is
fixed, universal, and have some sort of itself a form of politics, filled with questions
transcendental grounding and internal about who gets to decide what images
consistency. Second, it assumes a fixed mean and what kinds of social and political
and universal correspondences between work those representations perform.
images and concepts, appearances and
essences. What’s more, it assumes
uncomplicated, self-evident, and Missing Person
measurable ties between images,
referents, and labels. In other words, it
assumes that different concepts—whether In January 2019, images in ImageNet’s
“corn” or “kleptomaniacs”—have some kind “Person” category began disappearing.
of essence that unites each instance of Suddenly, 1.2 million photos were no longer
them, and that that underlying essence accessible on Stanford University’s servers.
expresses itself visually. Moreover, the Gone were the pictures of cheerleaders,
theory goes, that visual essence is scuba divers, welders, altar boys, retirees,

21

MS CELEB dataset

and pilots. The picture of a man drinking response to research published by Adam
beer characterized as an “alcoholic” Harvey and Jules LaPlace,[32] Duke
disappeared, as did the pictures of a University took down a massive photo
woman in a bikini dubbed a “slattern” and a repository of surveillance-camera footage
young boy classified as a “loser.” The of students attending classes (called the
picture of a man eating a sandwich (labeled Duke Multi-Target, Multi-Camera [MTMC]
a “selfish person”) met the same fate. dataset). It turned out that the authors of
When you search for these images, the the dataset had violated the terms of their
ImageNet website responds with a Institutional Review Board approval by
statement that it is under maintenance, and collecting images from people in public
only the categories used in the ImageNet space, and by making their dataset publicly
competition are still included in the search available.[33]
results.
Similar datasets created from surveillance
But once it came back online, the search footage disappeared from servers at the
functionality on the site was modified so University of Colorado Colorado Springs,
that it would only return results for and more from Stanford University, where a
categories that had been included in collection of faces culled from a webcam
ImageNet’s annual computer-vision installed at San Francisco’s iconic
contest. As of this writing, the “Person” Brainwash Cafe was “removed from access
category is still browsable from the data at the request of the depositor.”[34]
set’s online interface, but the images fail to
load. The URLs for the original images are By early June, Microsoft had followed suit,
still downloadable.[31] removing their landmark “MS-CELEB”
collection of approximately 10 million
Over the next few months, other image photos from 100,000 people scraped from
collections used in computer-vision and AI the internet in 2016. It was the largest
research also began to disappear. In public facial-recognition dataset in the

22

world, and the people included were not for drone surveillance systems to detect
just famous actors and politicians, but also and isolate violent behavior in crowds. The
journalists, activists, policy makers, team created the Aerial Violent Individual
academics, and artists.[35] Ironically, (AVI) Dataset, which consists of 2,000
several of the people who had been images of people engaged in five activities:
included in the set without any consent are punching, stabbing, shooting, kicking, and
known for their work critiquing surveillance strangling. In order to train their AI, they
and facial recognition itself, including asked 25 volunteers between the ages of
filmmaker Laura Poitras, digital rights 18 and 25 to mimic these actions. Watching
activist Jillian York, critic Evgeny Morozov, the videos is almost comic. The actors
and author of Surveillance Capitalism stand far apart and perform strangely
Shoshana Zuboff. After an investigation in exaggerated gestures. It looks like a
the Financial Times based on Harvey and children’s pantomime, or badly modeled
LaPlace’s work was published, the set game characters.[38] The full dataset is not
disappeared.[36] A spokesperson for available for the public to download. The
Microsoft claimed simply that it was lead researcher, Amarjot Singh (now at
removed “because the research challenge Stanford University), said he plans to test
is over.”[37] the AI system by flying drones over two
major festivals, and potentially at national
On one hand, removing these problematic borders in India.[39] [40]
datasets from the internet may seem like a
victory. The most obvious privacy and An archeological analysis of the AVI
ethical violations are addressed by making dataset—similar to our analyses of
them no longer accessible. However, taking ImageNet, JAFFE, and Diversity in Faces
them offline doesn’t stop their work in the —could be very revealing. There is clearly
world: these training sets have been a significant difference between staged
downloaded countless times, and have performances of violence and real-world
made their way into many production AI cases. The researchers are training drones
systems and academic papers. By erasing to recognize pantomimes of violence, with
them completely, not only is a significant all of the misunderstandings that might
part of the history of AI lost, but come with that. Furthermore, the AVI
researchers are unable to see how the dataset doesn’t have anything for “actions
assumptions, labels, and classificatory that aren’t violence but might look like it”;
approaches have been replicated in new neither do they publish any details about
systems, or trace the provenance of skews their false-positive rate (how often their
and biases exhibited in working systems. system detects nonviolent behavior as
Facial-recognition and emotion-recognition violent).[41] Until their data is released, it is
AI systems are already propagating into impossible to do forensic testing on how
hiring, education, and healthcare. They are they classify and interpret human bodies,
part of security checks at airports and actions, or inactions.
interview protocols at Fortune 500
companies. Not being able to see the basis This is the problem of inaccessible or
on which AI systems are trained removes disappearing datasets. If they are, or were,
an important forensic method to being used in systems that play a role in
understand how they work. This has everyday life, it is important to be able to
serious consequences. study and understand the worldview they
normalize. Developing frameworks within
For example, a recent paper led by a PhD which future researchers can access these
student at the University of Cambridge data sets in ways that don’t perpetuate
introduced a real-time drone surveillance harm is a topic for further work.
system to identify violent individuals in
public areas. It is trained on datasets of
“violent behavior” and uses those models

23

Conclusion: Who decides application of justice. Some of them truly


believed they were “de-biasing” criminal
justice systems, creating “fairer” outcomes
The Lombrosian criminologists and other through the application of their “scientific”
phrenologists of the early twentieth century and “objective” methods.
didn’t see themselves as political
reactionaries. On the contrary, as Steven
Jay Gould points out, they tended to be Amid the heyday of phrenology and
liberals and socialists whose intention was “criminal anthropology,” the artist René
“to use modern science as a cleansing Magritte completed a painting of a pipe and
broom to sweep away from jurisprudence coupled it with the words “Ceci n’est pas
the outdated philosophical baggage of free une pipe.” Magritte called the painting La
will and unmitigated moral responsibility.” trahison des images, “The Treachery of
[42] They believed their anthropometric Images.” That same year, he penned a text
method of studying criminality could lead to in the surrealist newsletter La Révolution
a more enlightened approach to the surréaliste. “Les mots et les images” is a
playful romp through the complexities and

Memphis Sanitation Workers Strike of 1968

24

subtleties of images, labels, icons, and Representations aren’t simply confined to


references, underscoring the extent to the spheres of language and culture, but
which there is nothing at all straightforward have real implications in terms of rights,
about the relationship between images and liberties, and forms of self-determination.
words or linguistic concepts. The series
would culminate in a series of paintings: There is much at stake in the architecture
“This Is Not an Apple.” and contents of the training sets used in AI.
They can promote or discriminate, approve
The contrast between Magritte and the or reject, render visible or invisible, judge or
physiognomists’ approach to representation enforce. And so we need to examine them
speaks to two very different conceptions of —because they are already used to
the fundamental relationship between examine us—and to have a wider public
images and their labels, and of discussion about their consequences,
representation itself. For the rather than keeping it within academic
physiognomists, there was an underlying corridors. As training sets are increasingly
faith that the relationship between an part of our urban, legal, logistical, and
image of a person and the character of that commercial infrastructures, they have an
person was inscribed in the images important but underexamined role: the
themselves. Magritte’s assumption was power to shape the world in their own
almost diametrically opposed: that images images.
in and of themselves have, at best, a very
unstable relationship to the things seem to
represent, one that can be sculpted by
whoever has the power to say what a
particular image means. For Magritte, the
meaning of images is relational, open to
contestation. At first blush, Magritte’s
painting might seem like a simple semiotic
stunt, but the underlying dynamic Magritte
underlines in the painting points to a much
broader politics of representation and self-
representation.

Struggles for justice have always been, in


part, struggles over the meaning of images
and representations. In 1968, African
American sanitation workers went on strike
to protest dangerous working conditions
and terrible treatment at the hands of
Memphis’s racist government. They held up
signs recalling language from the
nineteenth-century abolitionist movement:
“I AM A MAN.” In the 1970s, queer-
liberation activists appropriated a symbol
originally used in Nazi concentration camps
to identify prisoners who had been labeled
as homosexual, bisexual, and transgender.
The pink triangle became a badge of pride,
one of the most iconic symbols of queer-
liberation movements. Examples such as
these—of people trying to define the
meaning of their own representations—are
everywhere in struggles for justice.

25

END NOTE

[1] Minsky currently faces serious Hall Series in Artificial Intelligence individuals, and clearly visible in
allegations related to convicted (Upper Saddle River, NJ: Prentice observable biological mechanisms
pedophile and rapist Jeffrey Hall, 2010), 987. regardless of cultural context. But
Epstein. Minsky was one of Ekman’s work has been deeply
several scientists who met with [5] In the late 1970s, Ryszard criticized by psychologists,
Epstein and visited his island Michalski wrote an algorithm anthropologists, and other
retreat where underage girls were based on “symbolic variables” and researchers who have found his
forced to have sex with members logical rules. This language was theories do not hold up under
of Epstein’s coterie. As scholar very popular in the 1980s and sustained scrutiny. The
Meredith Broussard observed, this 1990s, but, as the rules of psychologist Lisa Feldman Barrett
was part of a broader culture of decision-making and qualification and her colleagues have argued
exclusion that became endemic in became more complex, the that an understanding of emotions
AI: “as wonderfully creative as language became less usable. At in terms of these rigid categories
Minsky and his cohort were, they the same moment, the potential of and simplistic physiological causes
also solidified the culture of tech as using large training sets triggered is no longer tenable. Nonetheless,
a billionaire boys’ club. Math, a shift from this conceptual AI researchers have taken his
physics, and the other “hard” clustering to contemporary work as fact, and used it as a basis
sciences have never been machine-learning approaches. See for automating emotion detection.”
hospitable to women and people of Ryszard Michalski, “Pattern Meredith Whitaker et al., “AI Now
color; tech followed this lead.” See Recognition as Rule-Guided Report 2018” (AI Now Institute,
Meredith Broussard, Artificial Inductive Inference.” IEEE December 2018), https://
Unintelligence: How Computers Transactions on Pattern Analysis ainowinstitute.org/
Misunderstand the World and Machine Intelligence, 2, 349– AI_Now_2018_Report.pdf. See
(Cambridge, Massachusetts, and 361, 1980. also Lisa Feldman Barrett et al.,
London: MIT Press, 2018), 174. “Emotional Expressions
[6] There are hundreds of scholarly Reconsidered: Challenges to
[2] See Daniel Crevier, AI: The books in this category, but for a Inferring Emotion From Human
Tumultuous History of the Search good place to start, see W. J. T. Facial Movements,” Psychological
for Artificial Intelligence (New York: Mitchell, Picture Theory: Essays Science in the Public Interest 20,
Basic Books, 1993), 88. on Verbal and Visual no. 1 (July 17, 2019): 1–68, https://
Representation, Paperback ed., doi.org/
[3] Minsky gets the credit for this [Nachdr.] (Chicago: University of 10.1177/1529100619832930.
idea, but clearly Papert, Sussman, Chicago Press, 2007).
and teams of “summer workers” [9] See, for example, Ruth Leys,
were all part of this early effort to [7] M. Lyons et al., “Coding Facial “How Did Fear Become a Scientific
get computers to describe objects Expressions with Gabor Wavelets,” Object and What Kind of Object Is
in the world. See Seymour A. in Proceedings Third IEEE It?”, Representations 110, no. 1
Papert, “The Summer Vision International Conference on (May 2010): 66–104, https://
Project,” July 1, 1966, https:// Automatic Face and Gesture doi.org/10.1525/rep.2010.110.1.66.
dspace.mit.edu/handle/ Recognition (Third IEEE Leys has offered a number of
1721.1/6125. As he wrote: “The International Conference on critiques of Ekman’s research
summer vision project is an Automatic Face and Gesture program, most recently in Ruth
attempt to use our summer Recognition, Nara, Japan: IEEE Leys, The Ascent of Affect:
workers effectively in the Comput. Soc, 1998), 200–205, Genealogy and Critique
construction of a significant part of https://doi.org/10.1109/ (Chicago and London: University of
a visual system. The particular AFGR.1998.670949. Chicago Press, 2017). See also
task was chosen partly because it Lisa Feldman Barrett, “Are
can be segmented into sub- Emotions Natural Kinds?”,
[8] As described in the AI Now Report Perspectives on Psychological
problems which allow individuals to 2018, this classification of
work independently and yet Science 1, no. 1 (March 2006):
emotions into six categories has its 28–58, https://doi.org/10.1111/
participate in the construction of a root in the work of the psychologist
system complex enough to be a j.1745-6916.2006.00003.x; Erika
Paul Ekman. “Studying faces, H. Siegel et al., “Emotion
real landmark in the development according to Ekman, produces an
of ‘pattern recognition’.” Fingerprints or Emotion
objective reading of authentic Populations? A Meta-Analytic
interior states—a direct window to Investigation of Autonomic
[4] Stuart J. Russell and Peter the soul. Underlying his belief was
Norvig, Artificial Intelligence: A Features of Emotion Categories.,”
the idea that emotions are fixed Psychological Bulletin, 20180201,
Modern Approach, 3rd ed, Prentice and universal, identical across

26

https://doi.org/10.1037/ Intelligence on Social Media,” Big Areas Act used four classes:
bul0000128. Data & Society 6, no. 1 (January “Europeans, Asiatics, persons of
2019): 205395171881956, https:// mixed race or coloureds, and
[10] Fei-Fei Li, as quoted in Dave doi.org/ ‘natives’ or pure-blooded
Gershgorn, “The Data That 10.1177/2053951718819569. individuals of the Bantu race”
Transformed AI Research—and (Bowker and Star, 197). Black
Possibly the World,” Quartz, July [15] These are some of the South Africans were required to
26, 2017, https://qz.com/1034972/ categories that have now been carry pass books and could not, for
the-data-that-changed-the- entirely deleted from ImageNet as example, spend more than 72
direction-of-ai-research-and- of January 24, 2019. hours in a white area without
possibly-the-world/. Emphasis permission from the government
added. [16] For an account of the politics of for a work contract (198).
classification in the Library of
[11] John Markoff, “Seeking a Better Congress, see Sanford Berman, [23] Bowker and Star, 208.
Way to Find Web Images,” The Prejudices and Antipathies: A Tract
New York Times, November 19, on the LC Subject Heads [24] See F. James Davis, Who Is
2012, sec. Science, https:// Concerning People (Metuchen, NJ: Black? One Nation’s Definition,
www.nytimes.com/2012/11/20/ Scarecrow Press, 1971). 10th anniversary ed. (University
science/for-web-images-creating- Park, PA: Pennsylvania State
new-technology-to-seek-and- [17] We’re drawing in part here on University Press, 2001).
find.html. the work of George Lakoff in
Women, Fire, and Dangerous [25] See Joy Buolamwini and Timnit
[12] Their paper can be found here: Things: What Categories Reveal Gebru, “Gender Shades:
Alex Krizhevsky, Ilya Sutskever, about the Mind (Chicago: Intersectional Accuracy Disparities
and Geoffrey E. Hinton, “ImageNet University of Chicago Press, in Commercial Gender
Classification with Deep 2012). Classification,” in Conference on
Convolutional Neural Networks,” in Fairness, Accountability, and
Advances in Neural Information [18] See Deng, Jia, Wei Dong, Transparency, 2018, 77–91, http://
Processing Systems 25, ed. F. Richard Socher, Li-Jia Li, Kai Li, proceedings.mlr.press/v81/
Pereira et al. (Curran Associates, and Li Fei-Fei, “Imagenet: A Large- buolamwini18a.html.
Inc., 2012), 1097–1105, http:// Scale Hierarchical Image
papers.nips.cc/paper/4824- Database” In 2009 IEEE [26] Michele Merler et al., “Diversity
imagenet-classification-with-deep- Conference on Computer Vision in Faces,” ArXiv:1901.10436 [Cs],
convolutional-neural-networks.pdf. and Pattern Recognition, pp. 248– January 29, 2019, http://arxiv.org/
255. abs/1901.10436.
[13] Released in the mid-1980s, this
lexical database for the English [19] Quoted in Allan Sekula, “The [27] “Webscope | Yahoo Labs,”
language can be seen as a Body and the Archive,” October 39 accessed August 28, 2019, https://
thesaurus that defines and groups (1986): 3–64, https://doi.org/ webscope.sandbox.yahoo.com/
English words into synsets, i.e., 10.2307/778312. catalog.php?
sets of synonyms. https:// datatype=i&did=67&guccounter=1.
wordnet.princeton.edu This project [20] Ibid; for a broader discussion of
takes place in a broader history of objectivity, scientific judgment, and [28] Olivia Solon, “Facial
computational linguistics and a more nuanced take on Recognition’s ‘Dirty Little Secret’:
natural-language processing photography’s role in it, see Millions of Online Photos Scraped
(NLP), which developed during the Lorraine Daston and Peter without Consent,” March 12, 2019,
same period. This subfield aims at Galison, Objectivity, Paperback ed. https://www.nbcnews.com/tech/
programming computers to (New York: Zone Books, 2010). internet/facial-recognition-s-dirty-
process and analyze large little-secret-millions-online-photos-
amounts of natural language data, scraped-n981921.
using machine-learning algorithms. [21] “UTKFace - Aicip,” accessed
August 28, 2019, http://
aicip.eecs.utk.edu/wiki/UTKFace. [29] Stephen Jay Gould, The
[14] See Geoffrey C. Bowker and Mismeasure of Man, revised and
Susan Leigh Star, Sorting Things expanded (New York: Norton,
Out: Classification and Its [22] See Paul N. Edwards and
Gabrielle Hecht, “History and the 1996). The approach of measuring
Consequences, First paperback intelligence based on skull size
edition, Inside Technology Technopolitics of Identity: The
Case of Apartheid South Africa,” was prevalent across Europe and
(Cambridge, Massachusetts and the US. For example, in France,
London: MIT Press, 2000): 44, Journal of Southern African
Studies 36, no. 3 (September Paul Broca and Gustave Le Bon
107; Anja Bechmann and Geoffrey developed the approach of
C. Bowker, “Unsupervised by Any 2010): 619–39, https://doi.org/
10.1080/03057070.2010.507568. measuring intelligence based on
Other Name: Hidden Layers of skull size. See Paul Broca, “Sur le
Knowledge Production in Artificial Earlier classifications used in the
1950 Population Act and Group crâne de Schiller et sur l’indice

27

cubique des crânes,” Bulletin de la [33] Jake Satisky, “A Duke Study cf19b956-60a2-11e9-
Société d’anthropologie de Paris, Recorded Thousands of Students’ b285-3acd5d43599e.
I° Série, t. 5, fasc. 1, p. 253-260, Faces. Now They’re Being Used
1864. Gustave Le Bon, L’homme All over the World,” The Chronicle, [37] Locker, “Microsoft, Duke, and
et les sociétés. Leurs origines et June 12, 2019, https:// Stanford Quietly Delete
leur développement (Paris: Edition www.dukechronicle.com/article/ Databases”
J. Rothschild, 1881). In Nazi 2019/06/duke-university-facial-
Germany, the “anthropologist” Eva recognition-data-set-study- [38] Full video here: Amarjot Singh,
Justin wrote about Sinti and Roma surveillance-video-students-china- Eye in the Sky: Real-Time Drone
people, based on anthropometric uyghur. Surveillance System (DSS) for
and skull measurements. See Eva Violent Individuals Identification,
Justin, Lebensschicksale artfremd [34] “2nd Unconstrained Face 2018, https://www.youtube.com/
erzogener Zigeunerkinder und Detection and Open Set watch?
ihrer Nachkommen [Biographical Recognition Challenge,” accessed time_continue=1&v=zYypJPJipYc.
destinies of Gypsy children and August 28, 2019, https://
their offspring who were educated vast.uccs.edu/Opensetface/;
in a manner inappropriate for their [39] Steven Melendez, “Watch This
Russell Stewart, Brainwash Drone Use AI to Spot Violence in
species], doctoral dissertation, Dataset (Stanford Digital
Friedrich-Wilhelms-Universit t Crowds from the Sky,” Fast
Repository, 2015), https:// Company, June 6, 2018, https://
Berlin, 1943. purl.stanford.edu/sx925dc9385. www.fastcompany.com/40581669/
watch-this-drone-use-ai-to-spot-
[30] “Figure Eight | The Essential [35] Melissa Locker, “Microsoft, violence-from-the-sky.
High-Quality Data Annotation Duke, and Stanford Quietly Delete
Platform,” Figure Eight, accessed Databases with Millions of Faces,”
August 28, 2019, https:// [40] James Vincent, “Drones Taught
Fast Company, June 6, 2019, to Spot Violent Behavior in Crowds
www.figure-eight.com/. https://www.fastcompany.com/ Using AI,” The Verge, June 6,
90360490/ms-celeb-microsoft- 2018, https://www.theverge.com/
[31] The authors made a backup of deletes-10m-faces-from-face- 2018/6/6/17433482/ai-automated-
the ImageNet dataset prior to database. surveillance-drones-spot-violent-
much of its deletion. behavior-crowds.
[36] Madhumita Murgia, “Who’s
[32] Their “MegaPixels” project is Using Your Face? The Ugly Truth [41] Ibid.
here: https://megapixels.cc/ about Facial Recognition,”
Financial Times, April 19, 2019,
https://www.ft.com/content/ [42] Gould, The Mismeasure of Man,
140.

28

You might also like