You are on page 1of 545

Roots of Human Sociality

Wenner-Gren International Symposium Series

Series Editor: Leslie C. Aiello, President, Wenner-Gren Foundation for


Anthropological Research, New York.

Since its inception in 1941, the Wenner-Gren Foundation has convened


more than 125 international symposia on pressing issues in
anthropology.
These symposia affirm the worth of anthropology and its capacity
to address the nature of humankind from a wide variety of perspectives.
Each symposium brings together participants from around the world,
representing different theoretical disciplines and traditions, for a week-
long engagement on a specific issue. The Wenner-Gren International
Symposium Series was initiated in 2000 to ensure the publication and
distribution of the results of the foundation’s International Symposium
Program.
Prior to this series, some landmark Wenner-Gren volumes include: Man’s
Role in Changing the Face of the Earth (1956), ed. William L. Thomas; Man
the Hunter (1968), eds Irv DeVore and Richard B. Lee; Cloth and Human
Experience (1989), eds Jane Schneider and Annette Weiner; and Tools,
Language and Cognition in Human Evolution (1993), eds Kathleen Gibson
and Tim Ingold. Reports on recent symposia and further information
can be found on the foundation’s website at www.wennergren.org.
Roots of Human Sociality<br/>
Culture, Cognition and Interaction

Edited by<br/>

N. J. Enfield and Stephen C. Levinson


First published 2006 by Berg Publishers
Published 2020 by Routledge<br/>
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
<br/>
605 Third Avenue, New York, NY 10017

Routledge is an imprint of the Taylor & Francis Group, an informa business


Copyright © Wenner-Gren Foundation for Anthropological Research 2006
All rights reserved. No part of this book may be reprinted or reproduced or utilised in any
form or by any electronic, mechanical, or other means, now known or hereafter invented,

including photocopying and recording, or in any information storage or retrieval system,


without permission in writing from the publishers.

Notice:<br/>
Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data<br/>


Roots of human sociality : culture, cognition and interaction / edited
by N.J. Enfield and Stephen C. Levinson. – English ed.<br/>
p. cm. (Wenner-gren

international symposium series)<br/>


Includes bibliographical references and index.<br/>
ISBN-13: 978-1-84520-394-8 (pbk.)<br/>
ISBN-13: 978-1-84520-393-1 (cloth)<br/>
1. Social interaction. 2. Human evolution. 3. Social evolution. I. Enfield,
N. J., 1966- II. Levinson, Stephen C.<br/>
HM1111.E64 2006<br/>
306.01–dc22<br/>
2006021669

British Library Cataloguing-in-Publication Data<br/>


A catalogue record for this book is available from the British Library.

ISBN13: 978-1-8452-0394-8 (pbk)

Typeset by JS Typesetting Ltd, Porthcawl, Mid Glamorgan


Contents

Acknowledgments viii

List of Figuresx

List of Contributors xiii

Introduction: Human Sociality as a New


Interdisciplinary Field
N. J.
Enfield and Stephen C. Levinson 1

Part 1: Properties of Human Interaction

1 On the Human "Interaction Engine"


Stephen C. Levinson 39

2 Interaction: The Infrastructure for Social Institutions, the


Natural Ecological Niche for Language, and the Arena in
which Culture is Enacted
Emanuel A. Schegloff 70

3 Human Sociality as Mutual Orientation in a Rich


Interactive Environment: Multimodal Utterances and
Pointing in Aphasia
Charles Goodwin 97

Social
4 Actions, Social Commitments
Herbert H. Clark 126
Part 2: Psychological Foundations

5 Infant Pointing at 12 Months: Communicative Goals,


Motives, and Social-Cognitive Abilities
Ulf Liszkowski 153

The
6 Developmental Interdependence of Theory of Mind
and Language
Janet Wilde Astington 179

7 Constructing the Social Mind: Language and False-Belief


Understanding
Jennie E. Pyers 207

8 Sylvia's Recipe: The Role of Imitation and Pedagogy in the


Transmission of Cultural Knowledge
György Gergely and Gergely Csibra 229

Part 3: Culture and Sociality

The
9 Thought that Counts: Interactional Consequences of
Variation in Cultural Theories of Meaning
Eve Danziger 259

10 Cultural Perspectives on Infant–Caregiver Interaction


Suzanne Gaskins 279

11 Joint Commitment and Common Ground in a Ritual


Event
William F. Hanks 299

12 Habits and Innovations: Designing Language for New,


Technologically Mediated Sociality
Elizabeth
Keating 329

Part 4: Cognition in Interaction

13 Meeting Other Minds through Gesture: How Children Use


their Hands to Reinvent Language and Distribute
Cognition
Susan Goldin-Meadow 353
14 The Distributed Cognition Perspective on Human
Interaction
Edwin Hutchins 375

15 Social Consequences of Common Ground


N. J. Enfield 399

16 Why a Deep Understanding of Cultural Evolution is


Incompatible with Shallow Psychology
Dan Sperber 431

Part 5: Evolutionary Perspectives

17 Culture and the Evolution of the Human Social Instincts


R. Boyd and P. J. Richerson 453

18 Parsing Behavior: A Mundane Origin for an Extraordinary


Ability?
Richard W. Byrne 478

19 Why Don't Apes Point?


Michael Tomasello 506

Index
525
Acknowledgments

This book is an outcome of the 134th Symposium of the Wenner-


Gren Foundation for Anthropological Research, which convened
October 2–9, 2004, in Duck, NC. Wenner-Gren symposia are renowned
for their intensity, and this one was no exception. After presentation,
discussion, and the scrutiny of 25 pairs of eyes, the precirculated papers
emerge phoenix-like as chapters of this book. The chapters reflect only
indirectly the intense weeklong discussion, and we would therefore
like to take the opportunity to thank our distinguished discussants
who did much to goad, encourage, and meld the discussion in useful
directions: Maurice Bloch, Alessandro Duranti, Richard Fox, Jane Hill,
and Catherine Snow. We also gratefully acknowledge the participation
of Federico Rossano (who served as symposium monitor) and Paul
Kockelman (whose presentation appeared as Kockelman 2005). We
thank the contributors for their generosity in accepting our invitation to
the meeting, producing papers for precirculation, putting so much into
the meeting itself, and finally getting the revisions done in good time.
We have also benefited enormously from the considered comments on
part or all of the text from Dick Fox, Bob Arundale, and three incisive
anonymous reviewers.
The editors, as conveners of the meeting, would like to thank above
all two officers of the Wenner-Gren Foundation: Dick Fox and Laurie
Obbink. Dick Fox, then-president of the foundation, approached us in
late 2002 with the idea that a meeting of this kind might help to remedy
the near absence of any productive relationship between anthropology
and the cognitive sciences. Dick remained closely engaged throughout
the project, and this book stands as he retires as a timely reminder of
his broad vision for anthropology. Our second special debt is owed
to Laurie Obbink, who as organizer of countless such symposia has
a deep grasp of the art involved in making these meetings work as
brainstorming sessions so different from your average conference or
Acknowledgments

workshop. She was responsible for most of the groundwork, which


made the meeting run so flawlessly. On countless occasions, Laurie went
well beyond the call of duty, always with inimitable grace. In addition,
we thank Mary Beth Moss, also of the Wenner-Gren Foundation, for
providing further organizational assistance during the meeting. During
production of the book, the efficient and professional contributions of
Victoria Malkin at Wenner-Gren and Ken Bruce at Berg were invaluable.
Finally, for much help in preparing the manuscript, we are indebted
to Edith Sjoerdsma.

N. J. E. and S. C. L.

Reference
Kockelman, P. 2005. The semiotic stance. Semiotica 157(1–4):233–304.
List of Figures

1.1.
I.1. Interlocking concepts developed in the different chapters.
ToM Theory of Mind.
= 10
1.1. Kpémuwó, deaf home signer on Rossel Island, inventing a
Kpemuwo,
way to communicate about abstract ideas concerning
sorcery. 43
1.2. Guugu Yimithirr person reference sequence. 58
1.3. Rossel Island name-avoidance sequence. 59
3.1. Requesting the gaze of a hearer. 99
3.2. Displaying slots and alternatives. 101
3.3. Decomposing a noun phrase. 102
3.4. Multimodal assessment. 105
3.5. Topic initiation. 110
3.6. Pointing toward a distant alternative. 112
3.7. A complex gesture sequence grounded in a nonpresent
space. 115
5.1. Pointing across trials: Mean
proportion of trials in which
infants pointed at least once. 162
5.2. Point repetitions within trials: Mean number of points per
trial with at least one point. 163
5.3. Looking behavior across trials with a point: Mean number
of looks to E per trial with a point. 163
5.4. Schematic setup of Study 2 with barriers. 165
5.5. Still-frame showing a 12-month-old pointing for the
experimenter to one of two objects on the shelves
behind her. 168
5.6. Mean percent of trials with the first point to target or

distractor. 169
8.1. Selective imitation of the modeled "head action" by
14-month-olds in the "hands occupied" vs "hands free"
demonstration conditions. 240
11.1. DC (the shaman) and patient's wife (number 200). 303
11.2. DC's (the shaman) altar (number 465). 304
12.1. Old technology: text typing tool. 330
12.2. New technology: Face-to-machine with signs. 331
12.3. Rose signing BABY at chin height rather than waist
height. 337
12.4. Sign made is oriented to webcam on top of computer. 338
12.5. Rose signing O-K near webcam location. 339
12.6. The signer (right) has a mirror image view of his own
actions. 341
12.7. The top image shows the signer's sign space is not in
camera (note
range difference between the two
participants). 341
12.8. Text messaging and signing. 343
12.9. Modeling individual aspects of a joint activity. 345
14.1. Navigation team on the bridge of a navy ship. 381
14.2. Enacting provisional lines of position. The bearing
recorder is completing his conversation turn while
plotter positions his hand to take the (gesture and talk)
floor. 382
14.3. Three lines of position fix the
position of the
ship
(represented by the triangle). The anticipated course
extends from the fix triangle to the estimated position,
EP (half-circle), where the ship is expected to be at the
time of the next fix. 383
14.4. The dashed lines indicate a poorlychosen pair of
landmarks for the next fix. The angles of intersection
among the LOPs should be open. 384
14.5. The trajectory of the bearing recorder's gesture is
complex. 385
14.6. The trajectory of the bearing recorder's gesture as it was

performed over the chart. 386


15.1. Joint attention on washing machine console. 400
15.2. Conversation among Lao speakers, lowland Laos.
Foreground Woman shifts back, having finished
Foreground
preparations to chew betel nut (in basket, lower
foreground) 403
15.3. Background Woman moves forward, to reach in
direction of basket (lower
(iower foreground). 403
15.4. Foreground Woman passes basket to Background
Woman, inferring the goal of her reaching action. 404
15.5. Two men waiting for iunch
lunch to be served, iowland
lowland Laos.
Woman in kitchen (out of frame) calling out "Please
is
is
turn off the water!" 410
15.6. Man gets up to turn off switch of electric water pump. 411
15.7. Kou (light shirt) has just arrived at his village home in a

pickup truck loaded with passengers, mostly children.


Saj (dark shirt), a neighbor, comes by to investigate. 417
16.1. Two kinds of link in SCCCs. 438
16.2. Simplified fragment of the CCCC of a folktale: The
content of several public narratives heard over time is
remembered as a single mental story and may be retold
as a public narrative, contributing to the cultural

distribution of the tale. 438


17.1. Suppose that there population
were a of
people who
were paired at random and play the stag hunt. The
average payoff of each strategy is plotted as a function of
the fraction of players who choose to hunt stag. 455
17.2. Suppose that there were a population of people who
were paired at random and play the prisoner's dilemma.

The average payoff of each strategy is plotted as a


function of the fraction of players who choose to
cooperate. 457
18.1. The sprawling, umbelliferousplant Peucedanum linderi
presents challenge to eat.
a challenge 486
18.2. Flow chart for a typical adult gorilla processing nettle
Laportea alatipes leaves. 487
18.3. A proposed evolutionary path of cognition: monkey to
human 496
List of Contributors

Janet Wilde Astington, University of Toronto


R. Boyd, University of California, Los Angeles

Richard W. Byrne, University of St Andrews


Herbert H. Clark, Stanford University
Gergely Csibra , Birkbeck, University of London

Eve Danziger , University of Virginia, Charlottesville


N. J. Enfield, Max-Planck-Institute for Psycholinguistics, Nijmegen
Suzanne Gaskins, Northeastern Illinois University, Chicago

György Gergely , Hungarian Academy of Sciences, Budapest


Susan Goldin-Meadow, University of Chicago
Charles Goodwin, Applied Linguistics, University of California, Los
Angeles [UCLA] and The Cotsen Institute of Archaeology, UCLA
William F. Hanks, University of California, Berkeley
Edwin Hutchins, University of California, San Diego [La Jolla]

Elizabeth Keating , University of Texas at Austin


Stephen C. Levinson, Max-Planck-Institute for Psycholinguistics,
Nijmegen
Ulf Liszkowski, Max-Planck-Institute for Evolutionary Anthropology,
Leipzig

Jennie E. Pyers , Department of Psychology, Wellesley College


P. J. Richerson , University of California, Davis
Emanuel A. Schegloff, University of California, Los Angeles

Dan Sperber, Institut Jean Nicod, Centre National de la Recherche


Scientifique–École des Hautes Études en Sciences Sociales–École Normale
Supérieure, Paris
Michael Tomasello, Max-Planck-Institute for Evolutionary
Anthropology,
Leipzig
Introduction: Human Sociality as a
New Interdisciplinary Field
N. J. Enfield and Stephen C. Levinson

At the heart of the uniquely human way of life is our peculiarly intense,
mentally mediated, and highly structured way of interacting with
one another. This rests on participation in a common mental world, a
world in which we have detailed expectations about each other’s behavior,
beliefs about what we share and do not share in the way of knowledge,
intentions, and motivations. That itself relies both on communication
(linguistic and otherwise) and on a level of cooperation unique in the
animal world. This mode of cooperative, mentally mediated interaction
enables the accumulation of cultural capital and historical emergence
of cultures. By inheriting a world of social organizations and values,
individuals are released from reinventing the wheel. In turn, cultural
capital shapes the style of interaction in local social groups, hiding
shared commonalities behind the veil of distinct languages, cultural
styles, and forms of social organization.
This, at least, is the thesis of this book. It brings together anthropologists,
linguists, psychologists, and sociologists whose work has not been
juxtaposed before. When we put the pieces of the jigsaw together,
what emerges is a new map of a still underexplored terrain—the roots or
foundations of human sociality.1 We propose that this is a new scientific
domain, a coherent subject for investigation constituted by intersecting
principles of different orders (ethological, psychological, sociological,
and cultural) that work together to produce an emergent system, a
system of human sociality and social interaction.
In this introductory chapter, we want to give readers a sense of how
the rest of the chapters fit together to form an outline of this domain.
Introduction

We first sketch some contributing research traditions and the ways they
fit together. We go on to delineate the different phenomena that are the
focus of the individual chapters, drawing attention to the connections
that run through them. Finally, we sketch our own synthesis of the
domain.
The ideas in this book ramify and connect with one another in
multiple ways. Although no linear order of chapters could capture such
a network of connections, our division of the book into five parts aims
to emphasize certain linking themes. Part 1 consists of four chapters
focusing on central properties of face-to-face interaction, the arena in
which human sociality is centrally exercised. Part 2 focuses on
psychological
foundations of human sociality, exploring the question of just
what it takes to pull off human interaction as we know it. Part 3 deals
with issues of culture and cultural difference, and the ways sociocultural
forces may play a role in structuring interaction and interactional
expectations, and vice versa. Part 4 explores ways in which cognition
is defined by its being exercised in social interaction, and how the
social exercising of cognition has effects both on our understanding of
the individual’s psychology (part 2) and of the higher levels of social
organization and broader cultural conventions (part 3). Part 5 features
phylogenetic perspectives, with two chapters asking how key features
of the human system for interaction could have evolved and a third
chapter comparing human social abilities with those of the other great
apes.

Distinctive Properties of Human Sociality


The focus of this book is the distinctive nature of human sociality, the
character of the social interaction that underpins social life. We do not
mean its mere complexity. Many animals, not just humans, have complex
social lives. Ants, for example, have hierarchies, complex divisions of
labor, advanced fungal agriculture, communication, organized transport,
colonization, and warfare. In this chemical society, the essential glue that
holds the vast ant communities together is pheromonal. Our inquiry
into the roots of human sociality asks about the nature of the special
kinds of social bonds that set humans apart. Our brand of sociality
distinguishes us even from our nearest relatives, the apes. We perhaps
share with the apes some basic social principles: flexible coalitions, out-
marriage, short-lived hierarchies. But these few commonalities are not
going to explain the divergences: human advanced agriculture, elaborate
communication systems, organized transport, planned colonization or
Introduction

warfare, to mention a few antlike properties. Above all, the primate


background does not explain the extraordinary cultural variety of
human social organization, communication, and lifestyle. The entire
enterprise of ethnographic research has been dedicated to understanding
this diversity, and the rich details of cultural worlds that individuals
inhabit. Less attention, certainly by anthropologists, has been paid to
understanding the commonalities, the shared foundations in human
cognition, motivation, instinct and social interaction that make these
variations possible. Here, we know much less because standard social
inquiry trades on these commonalities (e.g., in participant-observation)
without examining them, being prompted mostly by the discovery of
difference. But there is a hidden raft of commonality that makes the
expression of difference possible in the first place.
Supported by uniquely human abilities, and responsive to context-
specific motivations and accumulated cultural conventions, human
social interaction exhibits striking properties not found elsewhere in
the animal world. It involves frequent, intense, and highly structured
interaction, using complex communication systems, on which the rest
of culture depends for its realization. Robust parallels across cultures
in the organization of everyday talk suggest an ethological foundation
to human interaction. But above all what makes human interaction
qualitatively distinct in the animal kingdom is that it is built on inter-
subjectivity, enabling a brand of joint action that is truly open-ended
in goals and structure. This provides the building blocks for human
cultural diversity.
Uniquely human phenomena such as cooperation, commensality,
morality, and the inhibitions that underlie it, prolonged dependence of
offspring, capacity for intention attribution, planned deception, and the
highly structured nature of social interaction form an interdependent
network. The researcher may be positioned at any point on this network
and see human sociality as branching from that point. One might
say, for example, that cooperation is the key: It is cooperation that
makes morality essential; allows collective investment in offspring;
and lies behind the sheer interest in social interaction, the special
communicative abilities, and the cultural shaping of shared lifestyle.
Other starting points are possible. Different authors in this book start
from different corners of this network (see also Kockelman 2005), but,
crucially, they agree that human social life is intricately structured
through the attribution of actions, motives, intentions, and beliefs to
fellow interactants. (They do not always agree, however, as to the best
analysis of how such attribution is achieved; see below.)
Contributing Research Traditions
Ideas presented in this book relate historically to a number of major
strands of research that have hitherto remained largely unconnected.
One of our motivations as conveners of this project was to bring different
traditions together, and allow a common focus on the foundations of
human interaction to emerge. We outline four domains: Theory of
Mind (ToM), Gricean pragmatics, the analysis of talk and action in
interaction, and related developments in anthropology.

ToM and the Psychology of Human Interaction


In developmental psychology, a keen research interest has arisen in
the human ability to attribute knowledge, intentions and beliefs to
other humans, and to monitor these attributed inner states, using such
ongoing models to interpret actions and events.2 Some developmental
disorders such as autism can be understood as a failure to achieve this
level of understanding (Baron-Cohen 1995).
The development of an understanding of other minds in human
children takes a partly puzzling course. A comprehensive understanding
of others’ inner states, and especially that others might have false beliefs,
develops at around four years of age, surprisingly late in childhood (see
Astington for discussion and references). By comparison, normal infants’
mental mastery of the material world (naive physics) seems complete
by even one year of age, and the fundamental grammatical structures
of a language are well in place by three at the latest. Some theorists
suggest that children gradually have to construct a fully articulated set
of psychological skills for modeling and reasoning about others’ internal
states—that is, what others might want, think, feel, and know (or not
know, as the case may be). This set of skills is known as ToM (in the
sense of an actor’s “theory,” not of an analyst’s).
An apparently difficult and particularly late-developing component
of ToM is the ability to attribute false belief. For example, a child in
the course of development learns that whereas he or she knows that
chocolate in a chocolate box has been replaced by pencils, others may
nevertheless expect there to be chocolate there. Children under the
age of four do not show evidence of being capable of attributing false
beliefs like this. (These false-belief tasks are sometimes taken to define
ToM in a strict sense.3)
But here is the puzzle. It seems on first principles that without beliefs
about what others do or do not know one could not be a competent
interactant. How would you know what to tell me or what would require
pointing out? How and why would you even begin communicating
if you had no developed concept of other minds? From age one and
even earlier, children show much evidence of taking other’s beliefs and
intentions into account. Their development rests on it (see Gergely
and Csibra, Liszkowski, Tomasello). That children of 12 months use
pointing gestures to inform—for example, telling an adult the location
of something the adult is looking for—shows that they have the ability
not only to produce action oriented to another’s mental state (e.g.,
someone not knowing where something is) but to presuppose that the
action (e.g., a pointing gesture) will be recognized by another as having a
communicative intention. There is much already in place by age one.
Clearly, then, ToM is a matter of incremental mastery. Among the
issues this volume struggles with are what the components of a ToM
must be, what the incremental stages are, and at what point down
that incremental hierarchy we share elements of ToM with our nearest
cousins among primates. Judging from ape behavior, it seems harder to
understand beliefs than desires. Judging from human infants, it is harder
to understand second-order beliefs (John thinks that Mary believes
the chocolate is in the box) than first-order ones (Mary believes the
chocolate is in the box). Does effective joint action—a possible precursor
to culture—presuppose ToM? Is language crucial for discovering the full
potential of a ToM? Are there culture-specific practices that encourage,
or constrain, its development in childhood? With a better grasp of
these components, the stages by which they develop, and the ways
they are deployed in daily life, we may better understand both human
ontogeny and phylogeny.

Gricean Pragmatics
A second line of work guiding the debates in this book originates in
philosophy, specifically in H. P. Grice’s (1957) idea that meaning is
grounded in the recognition of intention. Seeing you fall over ahead
of me up a steep path, I am relieved to see you get up and wave in
my direction, taking your wave as designed to make me think you are
OK. The wave works because you have correctly calculated that I will
recognize the plan behind your action, namely getting me to recognize
that you intend me to think you are OK. In this example, the wave has
a nonce or one-off meaning recoverable against a background of your
figuring what I would figure when I see it.
Grice’s idea is important because it shows meaning, in a broad sense,
to be independent of language or convention. This points to possible
precursors to conventional meaning, in ontogeny, diachrony, and,
perhaps, phylogeny. On this account, meaning is not a property of
signs or symbols, but a property of minds in (mediated) interaction
with other minds. Conventional meanings can be thought of as arising
from repeated use of what were once novel signals. If I fall down and
likewise wave, we might set up a miniconvention that then spreads
through the community of hikers.
Another important aspect of this psychologizing of meaning is that it
allows us to analyze the unspoken communicative contents associated
with conventional symbols. For conventional meanings never exhaust
the import of what is said. The simplest utterance usually carries with
it a penumbra of intended but unspoken thoughts. (Consider What are
you doing tonight? which is likely to be forecasting an invitation, not
simply asking a question.)
The whole business of exchanging intentions in communication relies
on background assumptions that help to narrow the range of intention
attribution. Grice (1975) suggested that the essential background
assumption by which interactants constrain and guide their inferences
about speaker intentions is a principle of cooperation. (The principle,
comprising maxims of quality, quantity, relevance, and manner, has
since been updated in modern recastings such as Levinson 2000 and
Sperber and Wilson 1986.) Recipients of others’ signals work on the
assumption that such signals have been designed specifically for them
to extract the intended meaning. In turn, senders of such signals design
those signals in such a way as to take into account such an expectation of
targeted design on the part of hearers. By a principle of audience design
(or “recipient design”; Sacks and Schegloff 1979), any utterance should
have been formulated by a speaker with the intention that it cause just
the right effect in the receiver, taking into account the common ground
of the particular combination of speaker and addressee(s). For example,
in telling you something about my colleague John, I will first refer to
him in a way appropriate to your knowledge of him—for example, as
John if we commonly know him as John, but as, say, a colleague of mine
if I suppose you have never met him (Enfield and Stivers in press).
In sum, Gricean principles require the modeling of others’ inner
states, and thus presuppose a ToM. They also entail a stock of common
ground, readily provided by culture (e.g., that How do you do? is not
seeking information, that it is OK to strip down on the beach but not
on the street, or that sweet desserts come after savory main courses;
Enfield 2000; Levinson 1995). E. Goody (1995) suggests that the entire
structure of social roles in a society should be understood against this
background, providing systematic constraints on appropriate social
intentions and their ascription.
Microanalysis of Social Interaction
Detailed study of the systematics of social interaction in its own right was
initiated by a string of 20th-century mavericks including G. Bateson, R.
Birdwhistell, H. Garfinkel, and E. Goffman. The study of the systematics
of social interaction has since passed largely to conversation analysts
and other students of talk and action in interaction, with research
resulting in a detailed inventory of observed interactional practices and
patterns.4 Most of these practices can be characterized as sequences of
interlocking social actions (e.g., turns at talk) whose interpretations are
associated with specific (sometimes culture-specific) expectations and
preferences. The taking of turns at talk, the openings and closings of
conversations, the structure of request sequences, practices for correcting
or repairing utterances, and so on, have been carefully explored in
English-language conversation. There is also an increasing knowledge of
how these things work in other languages. Emerging from this research
are candidate universals for the organization of human interaction
(see Schegloff), such as the mechanism for transition of turns at talk
in informal conversation, and the ways in which interlocutors correct
and repair their own and others’ utterances—a crucial mechanism for
maintaining intersubjectivity. There is a strong expectation that such
structures should be universal, given that they are essential for preserving
order and agreement in moment by moment social experience. (See
Goffman 1981:14, for a list of such “system requirements and system
constraints,” including “framing capabilities,” Gricean principles, and
“nonparticipant constraints.”)
Conversation analysts try to avoid the psychological turn that
characterizes
ToM research and Gricean pragmatics. They prefer to talk in terms of
actions as recognizable through the details of their observable structure
and their specific placement in sequences of action. But they are equally
interested in intersubjectivity, the way in which a shared understanding
is arrived at. Hence the special interest in “intercalibrative” mechanisms
like repair and audience design.5

Related Developments in Anthropology


Biological anthropology has entertained various solutions to its central
puzzle of the evolution of human cognition and language, conceived
in terms of causal adaptational pressures (e.g., Dunbar et al. 1999;
Richerson and Boyd 2004). Most if not all of these hypotheses have an
interactional flavor. Primatologists and comparative psychologists have
tried to pinpoint exactly what properties humans share with their nearest
primate relatives, and they have tried to probe to what extent deception,
and more generally ToM capabilities, extend across the higher primates
(Byrne and Whiten 1988; Sussman and Chapman 2004; Tomasello et al.
2005; Whiten and Byrne 1997). Just as the ability for planned deception
has been seen as a defining Rubicon in human evolution, so imitation
has been seen as an important threshold for the possibility of acquiring
culture. The cooperative nature of human interaction raises fundamental
evolutionary puzzles (Boyd and Richerson 2005, this volume; Henrich et
al. 2004). Cooperative instincts are unlikely to evolve and persist under
natural selection, which means there must be higher-level checks on
good intentions, a calculus of motives pushing a spiraling cognitive arms
race into an intensely intentional world. The complexity of the social
world that resulted may have become a selecting environment that put
further pressures on cognitive development and interactional skills.
In sociocultural anthropology, as in the social sciences generally,
the interactional foundations of social life have not been a central
focus of research. Indeed, the key method of participant-observation
presupposes the transparency of the interactional medium through
which research is done, like the entomologist who uses the microscope
as a tool rather than analyzing it. We trade on our “common humanity”
to do anthropological research, yet typically without documenting or
analyzing the mediating interactional interface. But if we ask how
interaction itself works as a system, and how through the specifics
of social interaction we can come to learn the things we know—both
as analysts and as participants—important empirical questions arise.
To what extent is the unexamined interactional system a constant
across sociocultural space and time? To what extent is cultural context
constructed by modulating specific interactional parameters? To what
extent can differences in the conduct of social interaction affect cognitive
or cultural categories?
Despite a relative neglect of the details of face-to-face interaction
in mainstream sociocultural anthropology, our theme has numerous
points of contact with several vital strands of ethnographic research.
Work examining cultural conceptions of the person as agent and
actor suggests that in some traditional societies there is an ideological
reluctance to attribute thoughts and intentions to others (Shore
1998). (However, work on divination and religion proposes that these
practices have bases in our interactional, intention-attributing instincts:
Boyer 1994, 2002; Goody 1995; Zeitlyn 1995.) Different ideologies of
intersubjectivity may be related to child-rearing practices, as explored in
literature on language socialization (see Gaskins; Schieffelin and Ochs
1986). A succession of frameworks has arisen from within linguistic
anthropology for thinking about constraints on verbal interaction in
(culture-)specific settings (Duranti 2001; Gumperz and Hymes 1986;
Hymes 1964), and about the invocation of such frames on the fly
through “contextualization cues” (Gumperz 1982). This wider cultural
context is captured in Geertz’s notion of “thick description” (Geertz
1973), tying the ideological and historical particularities of culture to
the conduct of everyday life. Not unrelated in spirit is an important
cross-fertilization between ethnographic research and the microanalysis
of interaction (e.g., Goffman 1963, 1964, 1974; Goodwin 1994, 2000;
Sidnell 2001, 2005), which connects through to the new “cognitive
ethnography” traditions that examine distributed cognition, the idea
that social institutions work through a cognitive division of labor
organized in situated interpersonal interaction (Hutchins 1995).
Notwithstanding these glimpses of insight from within social and
linguistic anthropology—many of which are drawn on in the chapters
of this book—the foundational nature of social interaction, in all its
detail, is yet to be properly recognized in the larger compass of social
anthropology.

Human Interaction as the Focus of a New


Interdisciplinary Field
One aim of this book is to define and consolidate a new field of research,
a multidisciplinary approach to human interaction, its organization,
and its constitutive role in social life. The project asserts the centrality
of social interaction in the organization of human societies. Research
in multiple disciplines shows just how intricately organized human
interaction is, using multimodal channels of communication, building
on detailed presumptions and shared understandings, foreseeing courses
of action, and attuning to cultural settings. Underlying all this is a
specialized cognition, crucially involving intention attribution or “mind
reading” and the accumulation of shared understandings that makes
historical culture possible.
From the chapters of this book, there emerges a set of closely interlocked
concepts and lines of enquiry, which we sketch in Fig. I.1.
In the diagram, the boxes represent some of the crucial concepts
in the discussion. The arrows show the links emphasized by different
authors in this book. The diagram can stand as a mnemonic for the
complex arguments adduced throughout the book, demonstrating
interconnections between what at first seem rather different areas
Figure I.1. Interlocking concepts developed in the different chapters. ToM =
Theory of Mind.

of research. The density of connections indexes a clear domain of


integrated inquiry.
To see this, let us take a single question, and follow where it leads us
through the network of ideas and the chapters that make up this book.
How did our unique brand of sociality evolve, with its varied linguistic
and cultural environments? If we start with cooperation (box 1), we
land in the thick of it. As discussed in the Boyd and Richerson chapter,
the origin of cooperation is a deep puzzle in evolutionary theory. If this
was social, a sort of enforced amity or Hobbesian contract, the answer
might lie outside a theory of biological evolution. Yet the chapters by
Gergely and Csibra, Tomasello, and Liszkowski each report very early
cooperative acts by infants, suggesting the existence of cooperative
instincts. Boyd and Richerson argue that this can only be accounted for
by group selection (box 2). But this would rest on abilities to emulate
and imitate, the learning basis for building groupwide behavioral
patterns (box 3). Byrne shows, observing gorillas, that these cognitive
preconditions for cultural learning can have simple roots in behavior
parsing. Gergely shows that human infants further analyze actions for
their means-ends rationality, and will imitate “irrational” arbitrary (and
thus potentially cultural) actions only when rational analysis fails. This
presupposes the central ability to attribute intentions, or “mind read”
(Theory of Mind, box 4). As a number of the chapters show, elements
of ToM appear early enough in human development to suggest an
instinctual basis. ToM abilities underlie joint action (box 5, see Clark),
resting in part on cooperative instincts. Joint action must have been a
crucial factor in the increased fitness of the group, which would have
incrementally established biological foundations for ToM. Once our
ancestors had developed some ToM, they would have had a basis for
designing communicative actions in just such a way as to get others to
recognize their intentions (Gricean intentions, box 4). This would then
provide foundations for a rich kind of communication system without
parallel in the animal world (see Levinson). These Gricean intentions
rely for their recognition on keeping track of shared experience and
knowledge (box 6), as discussed by Enfield. The exploitation of common
ground depends on public signaling and display, which can exploit all
the expressive modalities (box 7; see Goldin-Meadow), and indeed the
props provided by the environment (see Goodwin, Hutchins). Even
simple output systems can have rich interpretations, as demonstrated
by Goodwin. This predicts that languagelike communication can arise
without the provision of a conventional language, and this is what
indeed happens in “home-sign” systems (see Goldin-Meadow). Once
our ancestors had evolved a languagelike system of any advanced
complexity, it would by a feedback relation have greatly amplified the
power of ToM (see Astington, Pyers).
So far, we have considered a set of interlocked properties that the
human individual brings to the task of conducting social interaction,
which plausibly evolved together. But these properties are deployed in
a highly structured system of interaction (box 8), with its rapid turn
transitions, repair systems, overall sequential structure, and so forth
(see Schegloff). These seem to have a universal, cross-cultural base,
and are thus equally characteristic of human sociality. In fact, there
are multiple connections between the interaction system and ToM.
For example, the rapid turn-taking system and the repair system are
the major guarantors of shared subjective understandings (Schegloff
1992). When what I say reveals a misunderstanding of what you
said, you get an immediate opportunity to correct it. An interaction
system based on shared understandings provides an environment for
distributed cognition (box 9), that is, for the distribution of cognitive
labor that underlies effective joint action, as described by Hutchins. Such
a system has emergent properties, shifting the burden of explanation
away from properties of the individual to the shared activity. We can
“read” other minds in part because human interaction is organized so
as to engender intersubjective understanding. This fits the evolutionary
theory sketched by Boyd and Richerson: We can only have evolved
cooperative instincts in an environment in which joint action could
endow a group with selective advantages.
Those selective advantages rest on the rapid adaptability of groups to
circumstance. This entails cultural diversity (box 10). Social interaction
has a distinctly different flavor across cultures, with the central media
of communication—human languages—showing striking variation.
Cultural learning in ontogeny (box 11), with its foundations in
imitation
(box 3), allows the accumulation of specific cultural practice,
building the common ground that makes shared subjectivity possible.
Cultural diversity in ideology and practice may feed back into local
specializations of ToM and interaction practice (see Danziger, Gaskins,
Hanks, and Pyers, in particular). Cultural systems are subject to their
own evolutionary mechanisms (see Sperber) and to rapid change and
adaptation (see Keating), allowing social groups to respond rapidly to
new opportunities or challenges. This brings us full circle, back to the
group selection that could have favored cooperative instincts in the
first place.
What this exercise shows is that we will not obtain a good grasp of
the evolutionary background to our species, and the unique properties
of our social life, without understanding the links between these diverse
aspects of our psychological and behavioral makeup. They form a web of
interconnected properties that together constitute human sociality.

The Logic of this Book and its Organization


The chapters of this book explore human sociality from a range of
disciplinary perspectives. As the previous section will have made clear,
this means that organizing them into a linear order is a challenge. Our
aim in this section is to outline our solution (one of many possible),
and to weave the threads into a structured whole.
Part 1: Properties of Human Interaction
The most directly accessible manifestation of human sociality is face-to-
face interaction, unfolding in real time, in conversation or some other
type of sustained copresent engagement. What are its properties? The
chapters in part 1 deal with the organization of interaction, touching
on issues of copresence and engagement, sequence and intersubjectivity,
and coordination and commitment.

An Interaction Engine (Levinson) Levinson begins with a bird’s-eye view of


human behavior, arguing for a universal base to the species-specific way
in which humans interact with one another. The underlying principles
governing human interaction appear to be independent of specific
languages or specific cultures. Indeed, they continue to operate where
there is no shared language and culture. The language independence
of these interaction principles, along with their facilitatory effect on
language, suggests a phylogenetic priority of interaction principles over
language in the history of the species.
What are these interaction principles? Levinson suggests that one
can think of humans as being endowed with an interaction engine,
consisting
of a raft of motivations, cooperative tendencies, multimodal
communication systems, and psychological endowments. A crucial
ingredient is the mental equipment for Gricean communication—that
is, the ability to recognize intentions based on signals whose formulation
has been designed such that just those intentions be recognized. This
motivates many of the properties of interaction, including the turn-
taking machinery of verbal interaction, which effectively requires
understandings to be immediately tested and displayed. Despite this
universal base, interaction patterns can vary dramatically across cultures,
as every traveler knows. Indeed, they must do so, because they are the
carriers of culture. So how then does one reconcile these differences
with a rich universal basis to interactional behavior? Levinson explores
cross-cultural differences in the naming of persons, for example,
under taboo restrictions, and shows how local cultural constraints can
interact with universal principles (Enfield and Stivers, in press). The
suggestion is that much cultural variation can be accounted for in terms
of tweaking the interaction engine and the generic principles governing
social interaction. The overall idea then is not that the interaction
engine produces cross-cultural uniformity but that it provides generic
constructional principles on which cultural diversity may be built, in
human interaction.
Generic Problems and Their Generic Solutions (Schegloff) Schegloff’s
chapter picks up on this theme, focusing on the strikingly flexible yet
precise organization of the sequential structures of human interaction.
Schegloff reviews a set of candidate generic solutions to the basic
problems of coordination and intersubjectivity in social interaction.
For example, everywhere in the world, as far as we know, informal
talk in conversation is organized using a precise and rapid turn-taking
system; it is subject to repair or correction in similar ways; it exhibits
paired utterances like questions and answers; it has recognizable
openings and closings. Rapid alternation of turns at talking allows
misunderstandings to become clear, and the turn-taking system is so
organized as to allow them to be dealt with as near as possible to where
they occur. A conversational repair system acts as the main guarantor
of intersubjective understanding (Schegloff 1992; Schegloff et al. 1977),
playing a crucial role in any kind of human interaction, regardless of the
semiotic system employed. Generic interaction mechanisms of the kind
reviewed by Schegloff are what make interaction without conventional
language possible, as in the home-sign systems described in Goldin-
Meadow’s and Levinson’s chapters, or the interaction with an aphasic
man described in Goodwin’s chapter.

The Local Richness of Social Interaction (Goodwin) Goodwin draws


attention to the semiotically rich environment of human sociality,
the intensive mutual copresence definitive of social interaction as we
know it. Interactants provide and access information simultaneously
from a great array of sources, including lexical items, grammatical
constructions, prosody, deployment of gaze, facial expression, bodily
comportment, and hand gesture. Even the “imperfections” of natural
speech such as “errors” and their repair carry important information,
and may be strategically managed. For example, a speaker may break off
his or her own speech before completion to secure the visual attention
(i.e., eye gaze) of another. The types of linguistic break-offs and restarts
that result (and that pepper normal speech) have the effect of making
explicit their internal syntactic structure, potentially providing an
account for syntactic learning (cf. the discussion of action parsing in
Byrne’s chapter).
By Goodwin’s account, an individual’s production of talk and action
is an intrinsically public, collaborative process. His case study of
interaction
with Chil, a severely aphasic man, draws into stark relief the
kind of collaborative meaning making that is going on all the time in
“normal” conversation. Chil’s communication problems are overcome
in collaboration with his interlocutors, via common exploitation of
semiotic resources in the immediate environment. (These “resources”
include the other people in the interaction.) Despite having only three
words and some use of gesture, Chil is able to engage successfully in social
interaction. Again, as with home-sign systems used by deaf children
(see Goldin-Meadow, Levinson), the human interactive system affords
collaborative construction of meaning with very slender resources.

Collaboration and Commitment (Clark) Clark argues that the sort of


focused, sustained interaction described by Levinson, Schegloff, and
Goodwin presupposes individuals’ commitment to the interaction as
a collaborative activity. This is the kind of social commitment that
makes it difficult to get off the telephone without first getting into (and
through!) a closing sequence. Any kind of joint action requires mutual
commitment, and has to be coordinated in some way. Consider the
simple coordination of action involved in moving a table together. You
have to pick up one end, and I the other, at more or less the same time,
then I must move a bit, and you too, relating my speed to your speed.
We have to know where we are going, and mutually monitor potential
hazards like steps, and so on. Clark proposes that the commitments
required to succeed in joint action are hierarchically organized, and that
minor commitments (like “let’s lift the table now”) are subordinate to
higher-level ones (like “let’s get the table into the living room”). As his
review of the extraordinary Milgram Experiments demonstrates, making
a higher-level social commitment entails lower-level commitments
we might not have foreseen. This is because committing to the larger
activity means committing to its subcomponents. In turn, refusing
to commit to subcomponents can mean reneging on one’s existing
commitment to the entire activity. This leads us into the powerful
emotional and moral dimensions of social life.

Part 2: Psychological Foundations


The organization of social interaction and the commitments it entails
presuppose special psychological underpinnings. The ability to recognize
others’ states of mind, whether attentional or volitional, and to share
these states of mind through mutual focus in the ongoing course of
interaction, is indispensable for human sociality. Chapters in the second
part focus on the nature and development of the psychological basis
for social interaction.
Pointing, 1—The Ontogenetic Kernel of Human Sociality (Liszkowski)
Liszkowski reports on a series of experiments designed to explore
the bases of infant pointing (see also Tomasello). As in the chapters
by Goodwin, Levinson, and Goldin-Meadow, this gives us a sense of
how social interaction can work without (full) language capacities.
Prelinguistic infants make pointing gestures, but it has been unclear
whether they are doing languagelike communication with this, as
opposed to, say, spontaneously expressing their internal response to
an object or event, or simply trying to get attention. Liszkowski gives
evidence that 12-month-old infants point to communicate, taking into
account other’s goals and apparent knowledge states. If an experimenter
appears to be looking for something he had a moment ago, but which
has now gone out of his view, a one-year-old infant will point to it
as a way of telling the adult where it is. Further, if the experimenter
misunderstands an infant’s pointing gesture, the infant will try again.
This is a spectacular finding, because ToM literature standardly suggests
that the ability crucial to this account (i.e., knowing that the other does
not know something) is a much later achievement in development,
coming not at 12 months but at four years. In Liszkowski’s studies,
the child is clearly using pointing for informing, one of the main
motivations for communication. The children in these experiments
are not only informing the adult experimenters but helping them. This
is suggestive of early cooperative instincts (see Boyd and Richerson,
Tomasello), particularly as the helping uses of pointing are employed
here in interaction with people other than a main caretaker.

ToM, 1—The Suite of Capacities and the Role of Language (Astington) The
phenomenon of pointing takes us more centrally to how “mind reading”
may work. The foundations here are (1) having a grasp that others have
mental states and (2) recognizing that these may diverge from one’s own.
This must involve an awareness of one’s own kinds of mental states,
and arguably the ability to employ such mental states in explaining
the actions of others. Astington’s chapter reviews what is known about
children’s development of such a ToM. A number of researchers (see
Gergely and Csibra, Liszkowski, Tomasello) believe that human infants
first grasp the nature of the other as an intentional agent from about
nine months. But it is also widely accepted that a fully comprehensive
ToM, as indicated by false belief understanding, is slow to mature,
coming significantly after the full essentials of language are in place.
Astington argues that language plays a key role. She reviews three ways
in which this has been proposed in existing literature: knowledge and
use of mental state verbs with meanings like “want,” “think,” “know,”
and “believe”; knowledge and use of the complex syntactic structures
associated with these mental state predicates; and firsthand experience
of face-to-face conversation.
ToM, 2—Consequences of Language Deficit (Pyers) Pyers’s chapter narrows
in more closely on the relation between language and ToM, with a case
study of a Nicaraguan sign-language community. Pyers’s research reveals
startling evidence for the crucial role that language may play in acquiring
ToM capacities. In Nicaragua, a substantial Deaf population was only
in the last decades brought together into a socially networked speech
community, thanks to the establishment of educational institutions for
the Deaf. This has led to the growth of a new natural language known
as Nicaraguan Sign Language, a Creole born of many smaller home-
sign or village-sign systems. The first generation of signers learned what
was effectively a pidgin with limited expressive power. In addition,
they were late learners of language in any form. By contrast, younger
signers of the following generation have had the benefit of exposure
to a developed sign language from a young age. Pyers reports that tests
for ToM capacities show the younger signers to have a significant edge.
The older signers do not master standard false belief tasks. This is prima
facie evidence that language plays a determining role in the acquisition
and application of ToM.
Imitation and Rational Learning (Gergely and Csibra) A developmental
perspective on the question of how we read intentions into the actions
of others is pursued by Gergely and Csibra. They investigate human
infants’ imitation of adults’ actions, finding that infants do not just
copy actions, but analyze the goal directness of others’ behavior and
look for the rationale behind the means chosen for carrying out an
action, doing selective imitation accordingly. Thus, if a woman with
her hands tied turns on a light with her head, an infant imitating this
action will turn on the light with his or her hand (Gergely et al. 2002).
This imitation achieves the same goal (getting the light to go on), but
does not reproduce the means. The child surmises that the adult would
have used her hands if she could have: that is, given that her hands were
full, the woman’s unusual action of using her head is rational. But in
a different experimental condition, in which the adult has hands free
and yet turns on the light using her head, the infant will use his or her
head as well in imitating this. In this case, the woman could have used
her hands to do the action, but does not. The child extracts a different
rationale for the marked manner of action, surmising that it was this
unusual manner that was intended (i.e., here the adult chooses to use
the head and not the hands), thus being a defining and not merely
contingent part of the action the adult performed.
The possibility for rational learning of this kind is critical for the
acquisition of culture. Cultural actions have both rational means–ends
aspects (like collecting food and preparing it to eat) and nonrational,
culturally constrained aspects (like eating with a knife and fork rather
than with the fingers). Our children have to acquire both. In each case,
the process of acquisition involves intention attribution based on direct
observation of others’ actions (cf. the description of action parsing in
Byrne, and syntactic parsing in Goodwin). It happens that sometimes
part of the goal of an action is that the action be done in a specific
manner. This applies in the case of culturally stylized action.

Part 3: Culture and Sociality


A number of chapters discussed so far touch on culture. In part 1, both
Schegloff and Goodwin suggest that the nature and details of copresent
engagement have a direct bearing on the establishment of a common
worldview of interactants (see also Enfield). Levinson’s chapter lays out
the view that social interaction shapes, and is shaped by, local norms
and routines. In addition, as Goodwin and Hutchins point out, the
artifactual environment directly shapes our interactions. The cultural
shaping of the material world can thereby feed back into the nature and
organization of sociality. In part 2, Gergely and Csibra deal with one of
the ways in which such cultural identification is signaled, providing the
possibility for rational acquisition of locally specific (and “nonrational”)
manners of carrying out practical actions. The conventional manners
people actually learn are locally defined, and historically emergent.
Further, in part 5, as previewed below, the cooperative instincts of central
importance to Boyd and Richerson are linked to a drive to maintain
social identification with specific social groups, leading ultimately to the
development of distinct cultures. Selective advantages of being cultural
beings rest on the rapid adaptability of culture to circumstance, which
underlies the cultural diversity that characterizes our world.
The four chapters in part 3 explore the relation of culture to human
sociality, including implications of cultural variation.

Local Ideologies of Intention Attribution (Danziger) Danziger deals frontally


with a question of cultural ideology in the analysis and interpretation
of human sociality. Given that notions such as intention attribution
and ToM have been developed in Western society, to what extent do
these notions reflect a bias in our own cultural practices? It is not
known in what way or to what degree ToM has uniform cross-cultural
relevance. There may be cultures with distinctly different ideas about
the readability of others’ intentions. Danziger explores a reluctance to
attribute intentions beyond the literal content of what is said among
the Mopan, a Mayan Indian group of Belize. The Mopan hold that a
sincere statement that turns out to be false is a “lie,” and they do not
consider to be a “lie” an insincere statement that turns out by accident
to be true. But they are quite able to pass false-belief tasks. It is not that
the Mopan lack ToM. Rather, they place cultural limits on the inferences
one may make from behavior, including speech. Danziger concludes
that there are profound consequences of Mopan cultural ideology about
the role of intention in meaning.

Cultural Variation in Caregiver–Child Interaction (Gaskins) Although


most research on child development is conducted in Western settings,
there are significant differences across cultures of the world in the ways
in which caregivers and infants interact. These imply differences in the
socialization of children into interaction itself. Gaskins points out a
theoretical tension between two major presumptions in consideration of
this issue. On the one hand, the essential outcomes of human cognitive
development are presumed to be universal (see Astington, Tomasello).
Children everywhere become competent adults. On the other hand,
the processes that lead to successful socialization are dependent on very
distinct kinds of input for learning in highly varied cultural settings.
Gaskins’s chapter offers a review and analysis of ethnographic research
on child–caregiver interaction and socialization, detailing extreme
variation in how infants are treated across cultures. The Western
tendency for adult caregivers to try to induce focused interaction, with
its attendant motherese, peek-a-boo routines and the like, appears to
be exceptional. In many other cultures adult caregivers attempt to
forestall needs and thereby preempt interaction. Sustained interaction
with eye contact is rare. In Western cultures the emphasis is on the
adult attempting to interpret the child’s communications, whereas
in many other cultures the onus is on the child to understand the
adult. These striking contrasts raise questions about the universality of
both the developmental process and its outcomes (see Pyers’s evidence
from Nicaraguan signers). Gaskins argues, however, that no matter
how dramatic cultural variation appears to be, it must be providing an
environment in which certain fundamentals of sociality can develop.
Some concrete possibilities are suggested in chapters exploring the
early development of cognitive abilities critical to language and social
intelligence (e.g., Gergely and Csibra, Liszkowski, Tomasello), including
practices of establishing and maintaining joint attention (e.g., by finger
pointing), sufficient to impart the ability for shared intentionality.
Integrating Multiple Frames and Participant Roles (Hanks) Socialization
not only gives rise to general abilities and local retoolings of these but
it brings a mountain of shared background for organizing and framing
interaction in culture-specific ways. Culture supplies rich resources for
participants to frame their engagements, and to adopt culturally relevant
participant roles. In turn, interlocutors have to be able to recognize the
specific frames and roles that are relevant to the interaction at hand.
Further, as Hanks explores in his chapter, there may be multiple such
frames and roles, as locally specified in a given cultural setting. This
poses an integration problem for interactants. In addition, in many
situations, interlocutors have to deal with distinct discrepancies in
knowledge (or what Hanks dubs uncommon ground). Hanks explores
these themes with reference to an extended example of shamanic
curing sessions in Yucatan. These sessions have elaborate structure,
with phases of conversational exchange between shaman and patient,
phases of prayer, and phases in which the shaman addresses the divining
crystals, discerning answers to his queries as if reading the minds of the
spirits. There is a layering of interchange within interchange (talk to
the spirits within talk to the patient), as well as a layering of cultural
institutions. Hanks’s chapter raises the challenge that all accounts of
social interaction need to face: How do we integrate our general “social
instincts” with specific cultural and often multilayered settings?
Evolution of Cultural Convention Through Interaction (Keating) Although
cultural traditions like Mayan shamanism can be stable over millennia,
they can also quickly evolve. Keating’s chapter describes rapid adaptation
of a conventional communication system (American Sign Language) to
the new technology of videophone connections on computer. She shows
that signers are fast establishing conventions from the new possibilities
offered by the medium of communication. For example, one can move
the hands forward for emphasis, placing them close to the camera such
that they take up more of the visual field. (This would not work in real
signing space, i.e., during face-to-face interaction.) Similarly, using the
immediate feedback from the monitor of one’s own signing as seen by
the interlocutor, one can exploit the collapse of the third dimension on
the screen (e.g., pointing left to empty space so that it looks as if one is
pointing to someone behind and to the left). Signing is hereby acquiring
a new genre, with conventions of its own in the making. This case study
shows rapid exploitation of new affordances offered by a change in the
technological environment. Keating’s observations dovetail with those
of Goodwin and Hutchins concerning the key role of environmental
affordances in human communication and cognition. It is this kind of
potential speed of change in public conventions, compared with the
glacial pace of genetic change, that gives both culture and the particular
form of human sociality their adaptive value from an evolutionary
point of view.

Part 4: Cognition in Interaction


Chapters in part 4 focus on cognition in interaction, and its consequences,
cross-cutting the key concerns of parts 1–3—the organization of social
interaction, its psychological underpinnings, and its sociocultural
context(s). These chapters examine ways in which cognition and
interaction not only interlock, but how they can be coconstitutive.
The interactional setting is a primary context for the externalization of
cognitive processes, where the relevant “cognitive artifacts” (Norman
1991) may include graphic devices, hand gestures, and the very
people with whom we are interacting (see Goodwin). Such artifactual
externalization of cognition can have both local and global effects, with
consequences for the course of interaction itself, and for what ends
up being shared among interactants as individuals in ongoing social
relationships, and as common members of entire cultural systems.

Making Thought Public, Without Language (Goldin-Meadow) Goldin-


Meadow explores how both symbols and thoughts emerge in interaction.
First, she describes a striking example of communication working
without conventional signs or symbols: the case of deaf children who
are not exposed to a systematic conventional sign language, but instead
construct a system of manual signs de novo (a so-called home-sign
system; see Levinson). Sometimes, nondeaf parents of deaf children
address them using spoken language only. In these cases, the child will
invent a sign system of his or her own. The system is used one way,
with the parent talking and gesturing back. These home-sign systems
have languagelike properties: they show arbitrary form-meaning
mappings; they are formally categorical; scenarios distant from the
here and now can be effectively described, and so forth. Such a system
fundamentally relies on intention attribution (see Astington, Gergely
and Csibra, Levinson), together with generic mechanisms for solving
generic problems of communication and intersubjectivity (e.g., repair of
nonunderstanding; see Goodwin, Schegloff). Goldin-Meadow’s research
shows how a species that had first evolved advanced interactional
intelligence could, providing some cooperative instincts were in place,
evolve a languagelike communication system. Here, many of the issues
dealt with in the present book come together: the multimodality of
social engagement, commitment and cooperation in social interaction,
ToM and intention attribution, and emergence of convention.
The second section of Goldin-Meadow’s chapter, focussing on
gestures accompanying speech, shows how these freely inventive signals
adumbrate “liminal” thoughts, allowing interactants to bring them into
consciousness. Focusing on teacher–child interactions in arithmetic, in
which children are still struggling to understand basic operations like
subtraction, she finds that children unable yet to articulate or execute
solutions, still betray a partial understanding in their gestures. Teachers
unconsciously pick up on this inarticulate revelation of dawning
comprehension, and can build their explanations on it. The hand
betrays the thought, for gestures are cognitive artifacts (Enfield 2005b;
see Hutchins), allowing communication in interaction to proceed where
conventional language fails (as it did with the deaf children).
Online Interaction and the Emergence of Structure (Hutchins) Hutchins
proposes another way in which the interactive system derives greater
power than its structural components alone can contribute. His argument
begins from a point emphasized in Goodwin’s chapter, that there is a
great deal of information publicly available in the environment of any
given interaction. Environmentally coupled social interaction gives rise
to a higher-level or emergent shared system of cognition. Social systems
exploit this potentiality by structuring social activities such that they
will have just these emergent effects. Hutchins argues that standard
assumptions about the bases of social interaction, including ToM and
intention attribution, overestimate what the individual brings to the
task while underestimating what the task brings to the individual. This
is amply illustrated in Hutchins’s well-known example of what it takes
to navigate a battleship into harbor (Hutchins 1995). As he explicates
in his chapter, the navigation team on the bridge combine words and
gestures with a map representing their path and position, so deciding on
the bearings to use in navigating the massive vessel’s course. The rest of
the calculations are automated as it were through the highly structured
division of labor of the team and their instruments. Hutchins’s point
is that the entire overarching intelligence of the joint action cannot be
attributed to any single individual. It is not represented in any single
place but is emergent in the interactive activity. Hutchins suggests that
the key to understanding human intelligence (including ToM) and
its phylogeny (see part 5) is to see that higher-order cognition is first
instantiated in joint activity. It thereby provides a selective environment
for cognition about other minds, hence the development of ToM
abilities.
Building and Exploiting Common Ground (Enfield) The possibility of rich
interaction given scant semiotic resources, as described in chapters by
Goldin-Meadow, Levinson, and Goodwin, is caused in great part by the
presence of a massive inventory of common ground, both cultural and
personal (Clark 1996). Common ground, or mutual knowledge shared
by social associates (whether based on common experience or common
cultural background) provides premises for amplicative inference (Goody
1995; Levinson 1995). Communication constantly exploits common
ground, partly to overcome the communication bottleneck entailed by
the slowness of speech (Levinson 2000). Enfield explores the notion
that common ground is strategically exploited not only in the service
of economy of expression, but for affiliative display of social closeness.
Enfield suggests that because common ground is so crucial to effective
communication, and to the display of affiliation, people go out of their
way to augment it, as when a mother points out new things to her child
yet without obvious or immediate purpose for doing so.

From Individual Interactions via Cognition to Entire Cultural Systems


(Sperber) The chapters reviewed so far enable us to assemble a range of
components of human sociality: its observable structures, its cognitive
underpinnings, its cultural bases, and its role in the coordination of
human cognition and activity. How are we to think about the link
between cultural diversity and the presumably universal cognitive
and ethological foundations of human sociality? A number of authors
wrestle with this question, especially Gaskins, Levinson, Schegloff,
and Astington. Sperber’s chapter offers us a sustained theoretical
panorama.
He develops the idea of the Cognitive Causal Chain (CCC), a
causal sequence that includes at least one cognitive representation.
A perception is a causal relation between a thing in the world and a
mental representation; an inference is a causal relation between two
representations; an action is a causal relation between an intention and
the behavior that attempts to realize it. In social interaction the output
of one individual’s CCC is the input to another’s, and in such cases we
can talk about social CCCs. Great chains of social CCCs are possible,
ultimately passing effects across whole populations. When these CCCs
have the function of preserving either behavioral form or mental
content or both (as in a song), they become cultural CCCs. This all
leads toward a model of the distribution of cultural forms and meanings
as if they were, say, viruses in a population—that is to say, subject to
the mechanisms of evolution of traits in a population (Enfield 2003,
2005a; Sperber 1985, 1996). This suggests a Darwinian model for cultural
evolution (Levinson and Jaisson 2006; Richerson and Boyd 2004). The
relation between cognitive universals, provided by the organism, and
the variability of cultural forms is simple enough: Cognition provides
the essential filter on what can be easily transmitted through a CCC.
To get feedback from CCCs to the cognitive system requires a further
kind of evolutionary mechanism, which leads us to the chapters in the
final part, focusing on the phylogeny of human sociality.

Part 5: Evolutionary Perspectives


The chapters in parts 1–4 establish defining properties of human social
interaction, including sustained coattentional engagement, common
commitment to cooperative activity, and attribution of communicative
intentions to others. These properties are not shared by even our closest
relatives among the apes. What are the critical differences? How could
they have evolved?

Cooperative Instincts and Group Selection (Boyd and Richerson) The mutual
commitment characteristic of human interaction (see Clark, Goodwin)
points to a classic puzzle in evolutionary theory: the riddle of human
cooperative behavior. Why are people so highly cooperative, when,
for an individual, it should always pay to take the benefits of others’
cooperative acts without reciprocating? The answer supplied in Boyd
and Richerson’s chapter is that cooperative behavior is instinctual. (This
is supported by work presented in a number of other chapters: Gergely
and Csibra, Liszkowski, and Tomasello report cooperative acts by infants
of around one year of age.) Boyd and Richerson discuss experimental
findings that adults, from societies of different kinds around the world,
do not maximize their own gains but, instead, feel an obligation to share
hidden benefits (Henrich et al. 2004). If our brand of cooperation is a
species-specific instinct, we then face the evolutionary puzzles: What
would have been the selective advantage of cooperative sociality for
the individual? How did the mechanisms that drive it develop?
Boyd and Richerson argue that group selection provides an account
for the evolution of human cooperative instincts. Group selection is an
unusual mechanism for evolutionary change, in which behavior shared
by a group, rather than by an individual or his or her immediate kin, gives
the entire group advantages over other groups. Because of its marginal
status as an evolutionary mechanism, group selection presupposes
earlier cultural adaptations that would have given sufficient adaptive
advantage to the group as a whole as well as behaviors that signal and
maintain boundaries between groups. Thus, the cognitive prerequisites
for cultural learning (see Byrne, Gergely and Csibra, Tomasello) would
have been essential for the evolution of cooperative instincts.

Evolution of Action Parsing and Intention Attribution (Byrne) Many of


the chapters in this book detail the structure and nature of interaction,
showing that humans depend in interaction on the ability to segment
and interpret complex and sustained sequences of action, to recognize
routines within them, and to see the intentions behind them. How
could such skills have developed in our species? Cultural learning clearly
involves the ability to learn from watching others’ behavior. But where
this behavior is of any complexity, some kind of parsing analysis is
required. Byrne describes how some groups of gorillas share techniques
for nettle stripping that are transmitted by cultural learning. He proposes
that a simple statistical and structural analysis of observed behavior
allows a novice not only to extract the essentials of the technique (see the
parallel account for syntactic parsing of speech in Goodwin’s chapter),
but to grasp its goal-oriented nature (cf. Gergely and Csibra’s discussion
of rational imitation). This provides a glimpse into the phylogenetic
precursors of intention attribution, human imitation, and learning. It
also has implications for our understanding of intention attribution in
modern human social interaction.
Like a number of other contributors (see Goodwin, Hutchins, Scheg-
loff), Byrne cautions against overestimating the degree to which people
explicitly model others’ mental states in interaction. The explicit
mentalism implied by much ToM research can be minimized by
behavior-based, statistical means for interpretation of others’ action.

Pointing, 2—The Phylogenetic Kernel of Human Sociality (Tomasello) Several


contributors discuss the importance of pointing in human interaction.
Goodwin, for example, describes the critical role of pointing in
communicating
when language is unavailable (see also Goodwin 2003).
Tomasello’s chapter puts the theme in a phylogenetic perspective. He
starts from the observation that apes, our nearest relatives, not only lack
language but they do not point or comprehend pointing. (This claim
has been contested—De Waal 2001; Veà and Sabater-Pi 1998—but as
Tomasello points out, the reported empirical observations have not been
replicated; cf. Povinelli et al. 2003.) Although apes do monitor others’
eye gaze, and seem to understand that others might see what they
cannot see, they do not seem to grasp the idea that an interactant might
be trying to get them to shift their attention. Underlying this failure is
the absence of efforts to establish joint attention, and the absence of
complex collaborative action. Experiments by Tomasello and colleagues
show, by contrast, that human infants of 14 months systematically
distinguish between what an adult has already seen from what is new for
that adult. Tomasello argues that what distinguishes humans from other
apes are instincts for helping and sharing, manifest in collaborative
interactions based on “shared intentionality” (i.e., joint intentions and
joint attention). These instincts are manifest in the humble pointing
gesture, which despite being well under control by a one-year-old
human (see Liszkowski), is never convincingly comprehended by any
other great ape. It is this gesture, Tomasello submits, that provides a
foundation for the evolution and acquisition of language, culture, and
the full richness of human sociality.

Concluding Remarks: Toward a Synthesis


A Framework for Integrating the Different Levels of Phenomena in
the Domain
Here, we propose a synthesis of the ideas aired in this book, to show how
the contributing concepts, which relate to distinct levels of phenomena,
fit together to yield an integrated perspective on human sociality. The
framework helps us see the essential roles different disciplines play in the
study of this domain, and how they might better inform one another
in future work. We distinguish three distinct levels of phenomena:
1. Interaction Engine (individual level): The individual brings to
interaction an “interaction engine,” consisting of ToM abilities and
communicative capacities built on them, biological constraints, and
ethological proclivities (as outlined in Levinson and in part 2 of this
book). Crucial elements of this include the ability to recognize others’
intentions through modeling the minds of others in real contexts
(and to anticipate their modeling of our anticipation of their intention
attribution!). These elements together form the essential equipment
for formulating and interpreting actions in an interactional setting.
We think it likely that the foundations of the engine are biologically
endowed, or at least unfold in human development in comparable ways
given local parallels in interactional organization. But such development
depends on an interaction matrix, so that the engine may be fine-tuned
to a local cultural frame.
2. Interaction Matrix (interpersonal level): The “interaction matrix”
in which the interaction engine is deployed has special and peculiar
emergent properties, potentially accounting for the universality of
its inventories of turn-taking systems, repair mechanisms, sequential
organizations, and the like (issues explored largely in parts 1 & 4). An
interaction is a sequential, contingent structure in which what
happens
next is as much determined by other parties as by oneself. There
need not be any particular prearranged plan or direction (as in casual
conversation). As yet, there is no adequate formal theory of this kind of
contingent interaction with shifting goals (despite game theory being
able to capture situations in which goals are zero-sum or fully shared).
The most complex properties of human interaction are emergent.
Consider a soccer team working together. The overall flow of movement
of the ball stems from the individual players’ movements and local
intentions, but the entire pattern cannot be coherently reduced to any
one player’s individual intentions, tacit understandings, or actions.
The emerging pattern depends on actual outcomes, overall sequence
and timing.
3. Sociocultural Frame (social-cultural level): The interaction matrix
provides the building blocks of social organization and its constituent
institutions, which constrain interaction within specific cultural frames
(focal in part 3). These are the frames in which the business of society is
conducted, whether they are legal hearings, gossip on the street corner,
or infant–caretaker settings. Social institutions are often robust, with
deep histories, and their fortunes are subject to patterns of cultural
evolution on a time scale different from the ephemeral interactions
that nevertheless instantiate them. This is the level to which the bulk
of ethnographic description and analysis has been devoted.

Consequences of this Framework


Within this framework, we can restate a number of interesting
propositions
arising out of the work summarized in this book. Consider the
following points.
On Human Phylogeny A central key to human evolution lies in
understanding
the relation between phenomena of these three different scales
and ontological types (central issues in part 5). The interaction engine
is adapted to the interaction matrix, for the engine’s function is to
conduct mentally mediated interaction at the interpersonal level. The
interaction matrix of our forebears was the selecting environment for the
biological and ethological roots of the interaction engine—for example
ToM, or the foundations of language. In turn, the interaction matrix is
built out of the raw potential that the engine supplies. Limits to, say,
speed or complexity of communication are inherited from the engine’s
properties. Again, the interaction matrix is adapted to conducting the
business of higher-order social organizations. Among the properties
of these higher-order units are those that endowed groups with the
adaptive cultural edge over other groups, allowing group selection to
play a role in human evolution.

On Language Language plays a central role in human social life, as


suggested by its ubiquity, dominance, and elaboration. But the work
assembled in this book suggests that language itself rests on other
abilities that are ontogenetically, phylogenetically, and logically prior
—in particular, the ability to attribute action, meaning, and intention in
structured sequences of interaction. Communication is possible without
fully fledged language (as in home-sign systems or in interaction with
infants), operating on a basis of reflexive or anticipatory intention
attribution, which is always at work even in the use of languages
with full expressive power. Thus, the evolutionary basis for language
must be sought in the mutual adaptation of the interaction engine
to the interaction matrix, and to the sociocultural level in which it is
embedded. To be sure, the combination of the interactional engine
and a full preconventionalized symbolic system like a language yields a
quantum leap in expressive and computational power in the interactional
domain. The structured representational system of language also appears
to retool the ToM (see Astington, Pyers), allowing richer and more
complex inferences. But in the end, it seems, although language is
transformative
of our cognitive and interactional powers, it rests on a more
fundamental cognitive specialization that appears earlier in human
phylogeny (and ontogeny; see below).

On Cultural Evolution Cultural diversity arises from the relative success


of particular institutions in local contexts, together with random effects
like drift (see Sperber). Particular cultural patterning of interaction reflects
feedback to the interaction matrix from specific forms of organization
in the sociocultural frame, and then to the local tuning in ontogeny
of the interaction engines of individuals. For example, in a society like
Java with social hierarchy and courtly traditions, decorum will specify
the proper deployment of body position, gesture, and honorific levels
in language, behavior inculcated during child development.
On Human Ontogeny The phylogenetic and historical perspectives are
complemented by an ontogenetic perspective. The interaction engine is
not literally delivered with the infant at birth, although core biological
and ethological constituents certainly are. The engine has to unfold
through experience in the interaction matrix, which will cause the
developing child’s interaction engine to inherit cultural specializations.
Gaskins’s catalogue of cultural differences in child rearing suggests,
however, that the initial ingredients are robust enough to give us
universal outcomes regardless of experience (as in the cross-cultural
parallels in Goldin-Meadow’s home-signing children). We suggest that
culture can reach deep down into the details of interaction, but only
by modulating tendencies that are universal or default.

Conclusion
The kind of synthesis we propose offers a closer integration of the
contributing research traditions. So psychological approaches will
benefit from expertise at the level of the interaction matrix. For
example,
work on infant pointing gestures (see Liszkowski, Tomasello)
should be alert to the sequential contexts in which they occur and
on which their interpretation may crucially depend. Conversely, work
on the interaction matrix will be enriched by understanding what is
(psychologically) under the hood. Observational work on sequences of
interaction has revealed many kinds of contingencies between actions
in interaction (e.g., question–answer sequences or greetings), but we do
not know how some of these implicit classifications (e.g., of an action
as an X or a Y) are achieved online. We know little about the sources
and development in infancy of skills in navigating finely temporal
and contingent interactional sequences such as conversational turn
taking. Do such skills have an instinctual basis, or are they built during
development on a more primitive instinctual testing of contingencies
in the physical world? We know that interactants are highly sensitive
to others’ mental states, but we do not know how these registers of
information for potential interlocutors are constructed or assessed—
experimental techniques will be critical here.
At another level, that of the sociocultural frame, the interaction
matrix offers insights into how cultural events and processes are
actually constructed. Slight modifications of a universal generic base
for conversational organization can yield all sorts of specific speech
events. For example, restricting interchanges to questions and answers
can give us a basis for courtroom interrogation or classroom teaching—
further assigning rights to question, and the role of overhearers, can
help us distinguish the conduct of the two cultural event types. Tracing
further back, if we know the psychological or developmental sources of
those universal tendencies, we might understand universal constraints
on social organization. Conversely, the analysis of social organization
can inform the conduct of interaction in myriad ways, helping us
understand background assumptions operative within specific events,
the choice of language and social role, and the like.
This raises an apparent tension in this volume between those who
emphasize the individual’s psychological abilities and those who focus
on the emergent properties of the interaction matrix, or the way in
which social interaction is adapted to local sociocultural organization.
We do not regard this as simply border warfare, with rival definitions
of Durkheim’s “psychological” versus “social facts.” Rather, it reflects a
disagreement about the primacy of one or other of the three levels—the
individual, the interactional, and the sociocultural. When A asks B a
question, and B answers it, is this because B discerns A’s intentions (a
psychological level of explanation)? Is it because B follows the rules of the
language game (an interactional level of explanation)? Or is it because
B recognizes that A is endowed with the social rights and authority to
ask that kind of question in the current situation (a sociocultural level
of explanation)? Different researchers rightly test the power of their
own lines of explanation by pushing the limits, and they are likely to
favor one or another level of explanation. This area of research is young
enough that there is no consensus about which level should bear the
major burden of explanation for specific phenomena. Thus, although
there are substantive concerns raised in some of the chapters regarding
the applicability of terms and concepts like “intention,” “action,” and
even “cognition,” true reduction to just one level or another is not
going to work: the levels have independent properties but are also
mutually interdependent. The interpersonally emergent interaction
matrix would not be possible without the individually seated interaction
engine, but it is not “generated” by it. The interaction matrix has higher-
order emergent properties, reflected in the way that local outcomes are
contingent on the actions and responses of all the players. Likewise,
although social institutions are realized through interaction, they have
long-term historical roots and interdependence with other aspects of
culture that require an independent level of analysis. For these reasons,
this will remain an interdisciplinary domain of inquiry, requiring input
from disciplines with insights special to the different levels that make it
up. And the contributors to this project will need to learn each others’
languages if we are going to make real progress.
We thus bring to a close our preview of the range of ideas on human
sociality put forth in the chapters of this book. We hope the volume
does much to spur cross-border commerce between the different fields.
If this can be promoted, we believe that the field of social interaction
research will rightly come to be central in the human sciences, opening
fundamental insights into what kind of a beast we are, and how we
came to have our own uniquely complex form of sociality.

Notes
1. The term sociality is used with a narrower meaning than ours by Henrich
et al. (2004), to refer to cooperative and altruistic instincts, which “deviate
from an axiom of selfishness.” Sussman and Chapman (2004) use the term in
a related way to this, to refer to the orientation of individuals to group living.
Given that “group-living individuals must forgo some of their individual
freedoms in order to socialize within the ‘group,’ ” Sussman and Chapman’s
sense of “sociality” refers to “the compromises that individuals make, the
mechanisms they use, and the means by which they maintain these social
groups” (Sussman and Chapman 2004:10). Our sense of sociality includes these
features among a broader complex of psychological and social predispositions,
principles of interactional organization, and specific interactional practices.
2. Key references include Premack and Woodruff (1978), Byrne and Whiten
(1988), Astington et al. (1988), Davies and Stone (1995a, 1995b), Whiten and
Byrne (1997), and Carruthers and Smith (1996), among many others.
3. Our use of the term Theory of Mind refers more generally to the full
ensemble of “mind-reading” skills of which false-belief understanding is a
single and late-developing component.
4. Key references include Sacks (1992), Sudnow (1972), Sacks et al. (1974),
Goodwin (1981), Atkinson and Heritage (1984), Button and Lee (1987),
Schegloff (in press), among many others.
5. Interaction analysts have also invested effort in understanding the use
of gesture, gaze, and body position in social interaction (Goodwin 1981;
Schegloff 1984; see also Goodwin, Hutchins). (Psychologists, too, have been
especially interested in gesture; see Goldin-Meadow, Liszkowski, Tomasello.)
These studies underline the multimodal nature of human communication.
Again, there are clear universal tendencies here. For example, in all cultures,
as far as we know, people gesture when they talk, although the exact nature of
gesture, gaze, and body position are very much culturally constrained.

References
Astington, J. W., P. L. Harris, and D. R. Olson (eds.). 1988. Developing
Theories of Mind. Cambridge: Cambridge University Press.
Atkinson, J. M., and J. Heritage (eds.). 1984. Structures of social action:
Studies in conversation analysis. Cambridge: Cambridge University
Press.
Baron-Cohen, S. 1995. Mindblindness: An essay on autism and Theory of
Mind. Cambridge, MA: MIT Press.
Boyd, R., and P. J. Richerson. 2005. The origin and evolution of cultures.
New York: Oxford University Press.
Boyer, P. 1994. The naturalness of religious ideas: A cognitive theory of
religion. Berkeley: University of California Press.
——. 2002. Religion explained: The human instincts that fashion gods,
spirits, and ancestors. London: Vintage.
Button, G., and J. R. E. Lee (eds.). 1987. Talk and social organization.
Clevedon, UK: Multilingual Matters.
Byrne, R. W., and A. Whiten (eds.). 1988. Machiavellian intelligence: Social
expertise and the evolution of intellect in monkeys, apes, and humans.
Oxford: Clarendon Press.
Carruthers, P., and P. K. Smith (eds.). 1996. Theories of Theories of Mind.
Cambridge: Cambridge University Press.
Clark, H. 1996. Using language . Cambridge: Cambridge University
Press.
Davies, M., and T. Stone (eds.). 1995a. Folk psychology. Oxford:
Blackwell.
——, (eds.). 1995b. Mental simulation. Oxford: Blackwell.
De Waal, F. 2001. Pointing primates: Sharing knowledge without
language. Chronicle of Higher Education, January 19: B7–B9.
Dunbar, R., C. Knight, and C. Power (eds.). 1999. The evolution of culture.
New Brunswick, NJ: Rutgers University Press.
Duranti, A. (ed.). 2001. Linguistic anthropology: A reader. Malden, MA:
Blackwell.
Enfield, N. J. 2000. The theory of cultural logic: How individuals
combine social intelligence with semiotics to create and maintain
cultural meaning. Cultural Dynamics 12(1):35–64.
——. 2003. Linguistic epidemiology: Semantics and grammar of language
contact in mainland Southeast Asia. London: Routledge.
——. 2005a. Areal linguistics and mainland Southeast Asia. Annual
Review of Anthropology 34:181–206.
——. 2005b. The body as a cognitive artifact in kinship representations:
Hand gesture diagrams by speakers of Lao. Current Anthropology
41(6):51–81.
Enfield, N. J., and Stivers, T. (eds.). in press. Person reference in interaction:
Linguistic, cultural, and social perspectives. Cambridge: Cambridge
University Press.
Geertz, C. 1973. The interpretation of cultures. New York: Basic Books.
Gergely, G., Bekkering, H., and Király, I. 2002. Rational imitation in
preverbal infants. Nature, 415(6873):755.
Goffman, E. 1963. Behaviour in public places: Notes on the social organization
of gatherings. New York: Free Press.
——. 1964. The neglected situation. American Anthropologist 66(6):133–
36.
——. 1974. Frame analysis: An essay on the organization of experience.
Boston: Northeastern University Press.
——. 1981. Forms of talk. Philadelphia: University of Pennsylvania
Press.
Goodwin, C. 1981. Interactional organization: Interaction between speakers
and hearers. New York: Academic Press.
——. 1994. Professional vision. American Anthropologist 96(3):606–
633.
——. 2000. Action and embodiment within situated human interaction.
Journal of Pragmatics 32:1489–1522.
——. 2003. Pointing as situated practice. In Pointing: Where language,
culture, and cognition meet, edited by S. Kita, 217–242. Mahwah, NJ:
Erlbaum.
Goody, E. N. (ed.). 1995. Social intelligence and interaction: Expressions
and implications of the social bias in human intelligence. Cambridge:
Cambridge University Press.
Grice, H. P. 1957. Meaning. Philosophical Review 67:377–388.
1. 1975. Logic and conversation. In Speech Acts, edited by P. Cole
——
and J. L. Morgan, 41–58. New York: Academic Press.
Gumperz, J. J. 1982. Discourse strategies. Cambridge: Cambridge
University
Press.
Gumperz,J. J., and D. Hymes (eds.). 1986[1972]. Directions insociolinguistics:
The ethnography of communication. London: Blackwell.
Henrich, J., R. Boyd, S. Bowles, C. Camerer, E. Fehr, and H. Gintis
(eds.). 2004. Foundations of human sociality: Economic experiments and
ethnographic evidence from fifteen small-scale societies. Oxford: Oxford
University Press.
Hutchins, E. 1995. Cognition in the wild. Cambridge, MA: MIT Press.
Hymes, D. H. (ed.). 1964. Language in culture and society: A reader in
linguistics and anthropology. New York: Harper and Row.
Kockelman, P. 2005. The semiotic stance. Semiotica 157(1–4):233–
304.
Levinson, S. C. 1995. Interactional biases in human thinking. In Social
intelligence and interaction: Expressions and implications of the social
bias in human intelligence, edited by E. Goody, 221–260. Cambridge:
Cambridge University Press.
——. 2000. Presumptive meanings. Cambridge, MA: MIT Press.
Levinson, S. C., and P. Jaisson (eds.). 2006. Evolution and culture.
Cambridge, MA: MIT Press.
Norman, D. A. 1991. Cognitive Artifacts. In Designing interaction:
Psychology at the human-computer interface, edited by J. M. Carroll,
17–38. Cambridge: Cambridge University Press.
Povinelli, D. J., J. M. Bering, and S. Giambrone. 2003. Chimpanzees’
“pointing”: Another error of the argument by analogy? In Pointing:
Where language, culture, and cognition meet, edited by S. Kita, 35–68.
Mahwah, NJ: Erlbaum.
Premack, D., and G. Woodruff. 1978. Does the chimpanzee have a
Theory of Mind? Behavioral and Brain Sciences 1:515–526.
Richerson, P. J., and R. Boyd. 2004. Not by genes alone: How culture
transformed human evolution. Chicago: University of Chicago Press.
Sacks, H. 1992. Lectures on Conversation. London: Blackwell.
Sacks, H., and E. A. Schegloff. 1979. Two preferences in the organization
of reference to persons in conversation and their interaction. In
Everyday language: Studies in ethnomethodology , edited by G. Psathas,
15–21. New York: Irvington.
Sacks, H., E. A. Schegloff, and G. Jefferson. 1974. A simplest systematics
for the organization of turn-taking for conversation. Language
50(4):696–735.
Schegloff, E. A. 1984. On some gestures’ relation to talk. In Structures of
social action: Studies in conversation analysis, edited by J. M. Atkinson
and J. Heritage, 266–296. Cambridge: Cambridge University Press.
——. 1992. Repair after next turn: The last structurally provided defense
of intersubjectivity in conversation. American Journal of Sociology
97(5):1295–1345.
——. in press. Sequence organization in interaction: A primer in conversation
analysis, 1. Cambridge: Cambridge University Press.
Schegloff, E. A., G. Jefferson, and H. Sacks. 1977. The preference for
self-correction in the organization of repair in conversation. Language
53(2):361–382.
Schieffelin, B. B., and E. Ochs (eds.). 1986. Language socialization across
cultures. Cambridge: Cambridge University Press.
Shore, B. 1998. Culture in mind: Cognition, culture, and the problem of
meaning. Oxford: Oxford University Press.
Sidnell, J. 2001. Conversational turn-taking in a Caribbean English
Creole. Journal of Pragmatics 33(8):1263–1290.
——. 2005. Talk and practical epistemology: The social life of knowledge in
a Caribbean community. Amsterdam: Benjamins.
Sperber, D. 1985. Anthropology and Psychology—Towards an
epidemiology
of representations. Man (n.s.) 20(1):73–89.
——. 1996. Explaining culture. A naturalistic approach. Oxford:
Blackwell.
Sperber, D., and D. Wilson. 1986. Relevance: Communication and cognition.
Cambridge, MA: Harvard University Press.
Sudnow, D. (ed.). 1972. Studies in social interaction. New York: Free
Press.
Sussman, R. W., and A. R. Chapman. 2004. The nature and evolution
of sociality: Introduction. In The origins and nature of sociality, edited
by R. W. Sussman and A. R. Chapman, 3–22. New York: Aldine de
Gruyter.
Tomasello, M., M. Carpenter, J. Call, T. Behne, and H. Moll. 2005.
Understanding
and sharing intentions: The origins of cultural cognition.
Behavioral and Brain Sciences 28:675–735.
Veà, J., and J. Sabater-Pi. 1998. Spontaneous pointing behaviour in the
wild Pygmy Chimpanzee (Pan paniscus). Folia Primatologica 69:289–
290.
Whiten, A., and R. W. Byrne (eds.). 1997. Machiavellian intelligence II:
Extensions and evaluations. Cambridge: Cambridge University Press.
Zeitlyn, D. 1995. Divination as dialogue: Negotiation of meaning with
random responses. In Social intelligence and interaction: Expressions
and implications of the social bias in human intelligence, edited by E. N.
Goody, 189–205. Cambridge: Cambridge University Press.
Part 1

Properties of Human Interaction


one

On the Human "Interaction Engine"


Stephen C. Levinson

goal in this chapter is to make the case that the roots of human
Mysociality lie in a special capacity for social interaction, 1 which
itself holds the key to human evolution, the evolution of language,
the nature of much of our daily concerns, the building blocks of social
systems, and even the limitations of our political systems.
Much of the speculation about the origins and success of our species
centers on the source of our big brains, the structure of our cognition,
on the origins of language, the innate structures that support it, and
on the striking cooperative potential in the species. These are genuine
and important puzzles, but in the rush to understand them, we seem
to have overlooked a core human ability and propensity, the study of
which would throw a great deal of light on these other issues. It is right
under our noses, much more accessible than the recesses of our brains or
the fossils that track our evolutionary origins, and quite understudied.
It is the structure of everyday human interaction.
Despite the fact that it is over fifty years since human interaction was
first treated as a scientific object of inquiry deserving of a natural history
(Bateson 1955; Chapple and Arensberg 1940; see also Kendon 1990),
progress has been quite limited. One problem has simply been that
human interaction lies in an interdisciplinary no-man’s land: it belongs
equally to anthropology, sociology, biology, psychology, and ethology
but is owned by none of them. Observations, generalizations and theory
have therefore been pulled in different directions, and nothing close to
a synthesis has emerged. In this chapter, I therefore try to stand back
and extract some generalizations about the special human abilities that
seem to lie behind the structure of social interaction.
Properties of Human Interaction

Are there Special Principles of Human Interaction?


Human interaction, by comparison with what goes on in even our
nearest relatives, looks very distinctive, suggesting that there may be
specific principles or abstract properties that underlie it. One starting
point would be to ask whether there is a core universal set of proclivities
and abilities that humans bring, by virtue of human nature, to the
business of interaction—properties of interaction that are at source
independent of variations in language and culture. Although much
might be attributable to language, there are quite good prima facie
grounds for thinking that human interactional abilities are at least
partially independent of both language and culture:
Travelers to foreign lands report successful transactions conducted
without language. Captain Cook’s unintended sojourn in Cape
York is a case in point, or Thomas Henry Huxley’s journeys on
HMS Rattlesnake. The best documentary evidence is probably the
film First Contact (Connolly and Anderson 1987), incorporating
footage made by the gold prospecting Leahy brothers contacting
tribes in Highland New Guinea for the first time in the 1930s: it
is as if the basis for transactional interactions exist independently
of culture and language, and the slots can in necessity be filled
by mime and iconic gesture (see Goodwin this volume).2 Quine’s
(1960) demonstration of the impediments to “radical translation”
notwithstanding, something like it seems anyway to occur.
Infants show an early appreciation of the give and take of interaction
(Bruner 1976) long before they speak, indeed arguably at four
months (Rochat et al. 1999), only two months old (Trevarthen
1979), or even 48 hours (Melzoff and Moore 1977), depending on
the measure. By nine months old, infants are embarked on complex
triadic interactions between ego, alter, and an object in attention
(Striano and Tomasello 2001). We know that different cultures have
different infant-caretaker patterns (see Gaskins this volume), so it
is hard to rule out early cultural influence, but the infant evidence
is highly suggestive of an ethological basis on which cultures may
or may not choose to build in early infancy.
When language is lost, interaction doesn’t disappear—restricted
channels of communication, as in aphasia, can nevertheless support
rich interaction (Goodwin 2003).
There is some evidence for a distinct “social intelligence” (Gardner
1985; Goody 1995) from inherited deficits and neurological case
The Human “Interaction Engine”

studies. The study of autism and Asperger’s syndrome, in comparison


with, say, Down’s syndrome kids, suggest a double dissociation: high-
reasoning abilities, low social skills (Asperger’s), low-reasoning skills,
high social skills (high-functioning Down’s)—see Baron-Cohen 2000
and Baron-Cohen et al. 1985. Similarly, different kinds of frontal
lobe lesions induce different kinds of interactional incompetence,
for example right temporal lesions correlate with flat affect and the
loss of nonsuperficial understandings as required for jokes (Kolb
and Whishaw 1990:607ff.; see Baron-Cohen 2000:1252 for brain-
imaging evidence).
Languages can switch midstream in interaction (“code-switching”),
leaving the interactional framework undisturbed, evidence that
interaction structure is independent of the “coded” signal systems
of language (Muysken 2000).
Ethnographic reports on interaction style rarely question the
applicability
of the fundamentals. Where they do, as in Basso’s (1970)
account of massively delayed greetings in Apache, or Albert (1972)
on turn taking according to rank in Burundi, or Reisman (1974)
on “contrapuntal conversation” in the West Indies, there is reason
to believe they are describing something other than the unmarked
conversational norm (Sidnell 2001). What the ethnographic reports
nevertheless do make a good case for is cultural shaping of all the
modalities of interaction, from spacing, posture, and gesture to
linguistic form.
The small amount of work that has been done on the structure
of conversation cross-linguistically and cross-culturally (on Thai,
Japanese, Korean, Mandarin, etc.) shows remarkable convergence in
many details, supporting the idea of a shared universal framework
for verbal interaction (see, e.g., Clancy et al. 1996; Hayashi 2003;
Moerman 1989).
Humans look different from other primates in the amount of time
and effort invested in interaction—it would be interesting to see
them in a zoo. We don’t actually have good measures of this. Dunbar
(1997:116) reports a study of a New Guinea tribe (the Kapanora)
whose males spend 30 percent (the women slightly less) of daylight
time socializing (or gossiping),3 compared with 20 percent for gelada
baboons (doing grooming), but my suspicion is that such figures
hugely underestimate the amount of human social interaction
during the business of the day, not to mention the entertainments
of the night. Allowing for differences caused by population density,
age and gender, and subsistence mode (fishermen may spend the
day alone), and Hymes’s (1972) notes about cultural differences in
volubility,4 my guess is that humans on average spend somewhere
between 30 percent and 70 percent of waking hours in social
interaction, whether at work or play.5
I hope this sort of rough and ready list is enough to give the proposal
prima facie plausibility—the proposal that, from an ethological point of
view, humans have a distinctive, pan-specific pattern of interaction with
conspecifics, marked by (1) intensity and duration, (2) specific structural
properties, and (3) those properties separable from the language with
which it is normally conducted.
Scholars from some disciplines may be puzzled by the absence of
language from this catalogue of evidence for a human interactional
specialization. We are (along with the song birds) a distinctively
chattering
species. The reason for the demotion of language is that so
much attention has been given to it that we have been damagingly
distracted from the interactional underpinnings that make it possible.
Students of language usage have tried to remedy this, from Grice 1975
to Sperber and Wilson 1995 to my own earlier self—it is quite clear to
us that “Language didn’t make interactional intelligence possible, it is
interactional intelligence that made language possible as a means of
communication” (Levinson 1995:232). So language is the explicandum,
not the explicans—humans did not evolve language, then get involved
in a special kind of social life, it was just the reverse. For language must
have evolved for something for which there was already a need—that
is, for communication in interaction.
Finally, there is another striking kind of evidence for the independence
of interaction principles from the specifics of language and culture.
Around the world, children are born deaf to hearing parents, who
sometimes raise their children without access to a conventional sign
language. What emerges is called “home sign,” an expressive signing
system invented by the child to make himself or herself understood,
and that is reciprocated by other means (see Goldin-Meadow this
volume). In societies without institutional education of the deaf or a
sizeable deaf community, such “home-sign” systems can remain the
only communication system for deaf adults. I have investigated a couple
of such cases on Rossel Island, a remote island community in Papua New
Guinea. Take the case of Kpémuwó, about twenty-eight years old, born
in a village where he is the only deaf person, which is three hours walk
away from any other deaf people. One day he came to me when I was
alone and proceeded to sign. To my intense surprise, I thought I could
understand quite a bit of what he was “saying” although we shared
no language, little culture and just a bit of background knowledge.
He seemed to be communicating, by means of pointings and iconic
gestures, about a woman who was dying of cancer in the neighborhood:
There was a lot of detail about the course of her disease, her futile trip
to the mainland for treatment, the visits of her daughters, and so forth.
Then, when my hosts returned, I got them to “translate” as best they
could Kpémuwó’s message—according to them, much of what I had
inferred was correct. They obtained much further detailed explication,
for example about the cause of the impeding death, caused by the
antisorcery god Nkaa, depicted by mime as his eagle avatar (see Fig. 1.1),
and hence Kpémuwó’s reluctance to help the family of the convicted
sorcerer.6
A moment’s reflection will reveal the depth of the mystery here. How
is it possible for two people who share no language and little cultural
background (myself and Kpémuwó) to communicate at all? For Quine’s
“radical translation” to be possible after all, despite his scruples, there
has to be some powerful meaning-making machinery that we all share.
This depends, I claim, on a peculiar ability to match communicative
intentions within an interactional framework. Kpémuwó and I got as far
as we did because first he signed in such a way as to make his intentions
maximally clear to me, and then I gestured my understanding of what

Figure 1.1. Kpémuwó, deaf home signer on Rossel Island, inventing a way to
communicate about abstract ideas concerning sorcery.
he signed, and then he in response attempted to correct or narrow
my interpretation, until step by step we converged on an
understanding.
Intention recognition and the mechanics of turn taking are
deeply interlocked. The focus of this chapter is on what exactly
Kpémuwó and I share that makes it possible for us to communicate,
when we share so little other background in conventions of culture
and communication.

"Interanpcte"ution.
Core
a
of
Idea
OutEngiThe
the
What
Shows
The idea in a nutshell is that humans are natively endowed with a set
of cognitive abilities and behavioral dispositions that synergistically
work together to endow human face-to-face interaction with certain
special qualities. I call these elements collectively the human interaction
engine (which is meant to suggest both dedicated mental machinery and
motive power, i.e., both “savvy” and “oomph”). Right away, I should
underline this is not a proposal for a “social cognition module,” “a
culture acquisition device,” “cognitive culture system” or an “interaction
gene” or anything of that simple-minded sort (see, e.g., Jackendoff
1992; Pinker 1997; Talmy 2000:373ff.). Those accounts assume that
the kind of approach taken to the “language module” or “language
instinct” can be copied across into a “social–cultural module,” and I
am arguing nothing of the kind. What I am entertaining is that there
are underlying universal properties of human interaction that can be
thought of as having a cognitive-and-ethological foundation. Evolution
is “bricolage” (to use Lévi-Strauss’s term), seizing what is at hand in the
organism’s phenotype to construct an often ramshackle but adaptive
system. So an “interaction engine” could be constructed of scraps of
motivational tendencies, temporal sensitivities (reaction contingencies),
semicooperative instincts, ancient ethological facial displays, the
capacity
to analyze other’s actions through mental simulation, and so forth.
The model is a Jean Tinguely kinetic sculpture built of bric-a-brac, not
a Fodorean mental module (Fodor 1983), let alone a Chomskyan point
mutation (Bickerton 1998; Hauser et al. 2002).
Whatever your doubts, just entertain the idea for a moment (I turn
to the crucial question of cross-cultural variability in the next section).
Before we ask “What exactly are the elements of the interaction engine?”
we need to ask what it needs to account for, that is, what the crucial
properties of human interaction are. From the output, we can guess
at the properties of the machine. Here are some obvious properties of
the output: 7

(1) Responses are to actions or intentions, not to behaviors (unlike,


e.g., the defensive reaction of a snake to someone who passes too close
by). That is, the interpretation of others’ behavior is a precondition for
interaction. Interpretation involves mapping intentions or goals onto
behavior, to yield component actions, bundles of behavior and mental
instigations (a cough can be just a cough—a reflex—or an intended
signal). This parsing of the other’s behavior stream clearly presupposes
some kind of simulation of the other’s mental world.
(2) In interaction, a simulation of the other’s simulation of oneself
is also involved. This is shown most clearly by the fact that actions are
generated taking into account that they will be interpreted by a specific
other—that is, they exhibit recipient design. So I call my neighbor “Dick”
only if I think you will recognize who I mean under that appellation
(see Clark et al. 1983; Clark and Wilkes-Gibbs 1986; Schegloff 1972a,
1995). This implies that the interpretation, based as in (1) on actions
or intentions, can make use of a further principle: The action to be
interpreted can be presumed to have been designed to be transparent to
this particular recipient.
(3) Although human interaction is dominated by the use of language,
language does not actually code the crucial actions being performed—
these are nearly always inferred, or indirectly conveyed (Levinson
1983:289–94, 2000; Sperber and Wilson 1995). In addition, “nonce
signals” are easily devised, and in the case of “home sign” can even
constitute the basis for an individual’s main communication system.
This implies that the fundamental signaling mechanism is independent
of language—language just enormously amplifies its potential.
(4) Interaction is by and large cooperative. This is not a Panglossian
claim that we all get on with one another. It is, rather, the claim that
there is some level, not necessarily at the level of ulterior motivation,
at which interactants intend their actions (a) to be interpretable (the
underlying intentions to be recoverable), and (b) to contribute to some
larger joint undertaking (having a conversation, making a hut, even
having a quarrel!).
(5) Interaction is characterized by action chains and sequences
(Schegloff in press, this volume) governed not by rule but by expectation.
Thus, there is an assumption that a question expects an answer, but
there is no rule that a question must be followed by an answer: “When
are you going?” → “Where?” is as well-formed a sequence as “When are
you going?” → “Ten o’clock.”8 The outcome of a momentary interaction
is something none of the parties can plan in advance—it is a contingent
product. That is why there is no such thing as a formal grammar of
discourse.
(6) Interaction is characterized by the reciprocity of roles (e.g.,
speaker–addressee, giver–taker), and typically by an alternation of roles
over time, yielding a turn-taking structure (Sacks et al. 1974).
(7) Interaction takes place within a (constantly modulating)
participation structure (specifying who is participating, and in what
role), which in turn presumes ratified mutual access (Goffman 1979;
Goodwin 1981). We can be copresent on a bus, but not be in such a
state of incipient interaction—often rights of mutual access have to be
negotiated (e.g., by greetings—see Duranti 1997).
(8) Interaction is characterized by expectation of close timing—an
action produced in an interactive context (say a hand wave) sets up an
expectation for an immediate response.
(9) Face-to-face interaction is characterized by multimodal signal
streams—visual, auditory, and haptic at the receiving end, and kinesic,
vocal, and motor at the producing end. These streams present a “binding
problem”—requiring linking of elements which belong to one another
across time and modality (e.g., a gesture may illustrate words that come
later, a hand grasp may go with the following greeting).
(10) Interaction appears to have detailed universal properties, even
if little cross-cultural work has actually been done to establish this.
What we do know is that for a wide range of features, from turn taking,
adjacency pairs (as in question–answer sequences), greetings, and repairs
of interactional hitches and misunderstandings, the languages and
cultural systems that have been studied reflect very similar, in some
cases eerily similar, subsystems.
This list may seem too self-evident and bland to yield any far-reaching
conclusions. But there is a lot more to say under each rubric. Let us
consider (4), the cooperative nature of most human interaction (at
least in the limited sense indicated), in a bit more detail, because of the
crucial, and puzzling, role it plays in evolutionary theory (it is just very
hard to see how cooperation could ever evolve under natural selection:
see Hammerstein 1996; Boyd and Richerson this volume). There seem
to be detailed properties of interaction that reflect cooperation, and that
contrast with the properties of agonistic interaction. For example, the
kind of intended transparency noted in (2) above derives ultimately
from (4) cooperation: in antagonistic interaction (as in predator–prey
relations), intentions should be hidden, as opaque as possible (even
copresence should be disguised, of course!).
Or consider this: cooperation seems to make possible the specific
properties of (5), action chains. In antagonistic interaction, as when
tiger chases antelope, we can see interaction chains of the kind: antelope
veers right, tiger veers right, antelope wheels left, tiger wheels left, and
so forth
<1> A1 B1 A2 B2

that is, long chains of immediate responses. What we do not seem


to find is anything like the embedded structures typical of human
interaction:

An example would be the following, in which B’s response to A’s first


action is deferred until clarification has been achieved:

The temporary shelving of one interactional task to solve another that


is a precondition to it seems to presuppose cooperation. The embedded
structure in <2> has formal properties that are quite different from
the response chain in <1>: The simple response chain in <1> can be
generated by a Markov process, whereas <2> requires something with
a push-down stack like a Phrase-Structure Grammar.9 When it comes
to parsing or comprehending a behavior sequence of the kind in <2>
as opposed to that in <1>, quite different procedures have to come into
play—now a response can be to an action way back in the behavior
stream. If this generalizes, then we have a formal test for cooperative
structures of interaction—they have long-distance dependencies of this
type.
Consider another empirical finding with a bearing on the underlying
cooperation in interaction. Conversation analysts have established
that after a question, a request, offer, or the like, where a response is
immediately relevant, the response options are not equal but ranked.
Responses that are in the expected direction are immediate and brief,
responses that are in the opposite direction are typically delayed,
marked with hesitations and particles like well, and accompanied by
explanations. Thus, the absence of an immediate response after the
following indirect request apparently indicates quite clearly to the
requester that his request will be declined:
<4> C: “So I was wondering would you be in your office on
Monday (.) by any chance
(2 second silence) Probably not.” [Levinson 1983:320]
Many details of this kind of asymmetry between what are called “preferred
responses” and “dispreferred responses” show that the organization
of conversation biases actions in the preferred direction—the system
is set up so that it is just easier to comply with requests or accept
invitations than to decline them! In short, the system is biased toward
cooperation.10

Ingredients for an "Interaction Engine"


So far, we have seen that an interaction engine has to predict at least
those features of interaction listed above. Now we can ask: what kind
of a “machine” could produce those properties, that is, what does the
human interactant have to be endowed with to generate such behavior?
Let’s key the points to our numbered properties above:
(1’) To get property (1)—responses are to intentions not behaviors—
we need a “Theory of Mind” (ToM). That is, any being capable of
attributing goals and intentions to other actors must attribute a mind
to the other actor (hence have a folk theory of mind, or ToM, in the
broad sense explored by Astington this volume). ToM has typically
been operationalized as relativized belief attribution, for example as
attributing to Sam the false belief that p (Leslie 2000). But here instead
the heart of the matter is intention attribution: given the observed
behavior, the interaction engine must be able to infer likely goals
that would have motivated the behavior.11 Elsewhere, I have pointed
out that this is a highly intractable computational problem, because
it amounts to inferring premises from conclusions, which cannot be
done by any logical engine (Levinson 1995:230ff.). It could perhaps
be done on statistical grounds, using some low-level semiautomatic
simulation as in the theory of “mirror neurons.”12 That might account
for a simple class of interpretable actions, like you raising your fork to
your mouth. But it would never account for the meaningful cough,
or the ironic bow—actions whose interpretations are not in line with
the statistical associations. The solution to those must lie in having
powerful heuristics. But what exactly? This is where point (2) comes
in, the ability not only to simulate the other’s point of view but also
to imagine what he or she thinks your point of view is.
(2’) The simulation of the other’s simulation of oneself may seem
something more likely to occur in deception than in cooperation.13
But it is crucial to cooperative interaction. It was Schelling (1960) who
demonstrated the empirical power of this heuristic to solve coordination
problems implicitly: for example, offer two separated subjects $1,000
if they can both think of the same number without communicating—
they can beat the odds (they are likely to assume each will find 1,000
the salient solution). Or ask them to each go to where they think the
other will go in a crowded department store (they may fixate on the
“lost and found”). Exactly how it works has been much discussed, but
clearly it involves a special kind of reflexive thinking: thinking what you
would be thinking I would be thinking when I did the action. This
coordination ability presupposes the notion of mutual knowledge (or
common ground)—the things that I know you know, you know I know,
and I know you know I know. But it also involves a notion of mutual
salience—what leaps out of the common ground as a solution likely to
independently catch our joint attention (the number 1,000 or the lost
and found). This is what is involved in recipient design, the choice of just
that phrase that will allow you to find the unique thing I am referring
to, when it could be referred to in 1,000 myriad ways, none of them
uniquely referring (Clark et al. 1983). A nice example of this is the use
of phrases like “The what do you call it,” which seem typically used
where the speaker estimates the addressee can guess what the speaker
has in mind (Enfield 2003). These are mental coordinations, meetings
of the mind, in what I shall call the “Schelling mirror world.”
(3’) Our property (3) was a fundamental underlying signaling system
independent of language or conventional code. This is provided
by Grice’s (1957) theory of meaning, which holds that a signaler S
communicates z by behavior B if S intends to cause a recipient R to
think z, just by getting R to recognize that intention. In other words,
a communicative intention is one that achieves its goal as soon as it
is recognized (the action B has no other instrumental efficacy). One
way of thinking about this is: S tosses behavior B into the Schelling-
mirror world, implying “I bet you can figure out why I did this, just
by knowing that I know you can.”14 Grice’s theory gives us an account
both of how we can communicate without conventional signals at all
(as in First Contact or when I met Kpémuwó, the deaf man without a
language), and of how we can communicate something distinct from
what the conventional signals actually mean (as in irony, metaphor,
hints, etc.).
This is how we can understand the meaningful cough, or the ironical
bow, in which statistical inference will only allow the attribution of
the reflexive cough and the genuine bow. And that is why intention
attribution in interaction is altogether a different thing than intention
attribution outside interaction. This is the difference between meaning
attribution in the Gricean sense and mere action interpretation by
an observer. Compare: I appear to smooth down my hair—I could be
making the action to smooth down my hair, or I could be signaling “Your
hair is standing up.” The same behavior has distinct interpretations
in (third-person) action-interpretation simpliciter, and interaction
interpretation.
(4’) Cooperative interaction—our property (4)—differs from
antagonistic
interaction in precisely the same way: antagonistic interaction only
requires mere intention attribution (you’d better believe that however
quietly the tiger sneaks up on you, he’s out to eat you!),15 cooperative
interaction requires the much more heady reflexive thinking: if we are
going to carry out a joint action, say building something together, each
contributive action has to be so designed that the other can see, just
by how it is done, that it is intended to achieve the contributing role
it is meant to play (Clark 1996:191ff.).16 That is another reason, in case
there were not enough, why cooperative social systems occupy a remote
corner of evolutionarily possible design space (Dennett 1995)—you
have to have minds capable of simulating other minds simulating your
own.
(5’) What accounts for the fact (our property 5) that interaction is
(a) composed of action sequences, and (b) governed not by rules but
only by expectations? In principle, actions in interaction could be
simultaneous, if complementary, as in duets (see Clark 1996). But they
are typically chained, one after the other. It might be thought that,
for communication anyway, simultaneous broadcast would mask the
message, but that does not deter the cicadas, and it does not explain
the human case either, because we can listen and speak at the same
time, as in simultaneous translation. No, there must be a reason for the
alternation. One fundamental motivation is that, given that what I say
has been designed for you to be able to see what I mean, it would be
a good idea to see whether my design was actually as good as I hoped
it was, which your response will make clear (Sacks et al. 1974). At the
birth of cognitive science, Miller et al. (1960) suggested that the Test–
Operate–Test–Exit (TOTE) unit should replace stimulus–response as the
basic theoretical unit of human behavior: we test to see if the intended
goal was achieved, if not, operate on it and try again. In cooperative
interaction, the only way to test is to see what the other person made
of our actions. This is part of the motivation for taking turns, and it
motivates too the priority accorded in interaction to correction and
repair sequences—that is why “When are you going?” “Where?” is
“well formed,” or more properly, interpretable.
Linguists and anthropologists had hoped that there might be rules
of conversational sequences, like rules of grammar, but the search was
fundamentally misguided (Levinson 1983:286–94).17 There are indeed
templates that specify sequences of actions, but these do not have the
status of sequential rules (a few rituals excepted). The initial actions in
these templates introduce a mental scenario as it were—consider “Are
you doing anything tonight?” which introduces the frame Preinvitation
Go ahead Invitation Acceptance as in: 18
A: “Are you doing anything tonight?”
B: “No, why?”
A: “I was wondering if you’d like to catch a movie.”
B: “Sure, what’s on?”
The whole sequence can gracefully abort at the second turn, without
the invitation ever surfacing—a sequential template is a mental entity,
around which actions will be directed and interpreted, however the
actual sequence transpires.
Which brings us back to Schelling, who pointed out that we can tacitly
coordinate by each thinking what the other would think (so, e.g., when
inadvertently separated in a department store, we both go to the last
place we saw each other). Sequential templates hang in this mirror world
of simulated mental spaces, and the ability to traverse those spaces was
the theme of (2’). In sum, TOTE gives us sequences, and the Schelling
mirror world gives us guiding expectations instead of rules.
(6’) The sequential taking of turns (our property 6) may be partially
motivated by TOTE, but turn taking itself does not necessarily imply
alternation of roles in a single joint activity (compare taking turns at the
gas pump). But some forms of cooperation at least require sharing bites
of the same cherry. Sharing a resource is an elemental symbol of sociality,
as in commensality. Informal human interaction is characterized by
a conversational mode of exchange, in which the erstwhile speaker
becomes a listener, and the erstwhile listener becomes a speaker, the
valued commodity apparently being speaking while attentive others
hold their tongues. This alternation of roles seems to be universally
built into the deictic system of languages (“I” refers to the current
speaker, “you” to the current addressee, and my “I” becomes your
“you”). Many human societies have asymmetrical assignments of roles
and elaborate divisions of labor, but in all of them informal interaction
seems to be built on the alternation of conversational roles. Given
that human language processing is obligate and automatic (hearing
you speak English, I automatically comprehend even if I would rather
not), the alternation of listening roles implies an obligatory inhabiting
of others’ mental worlds. So it seems that cooperative sharing of the
communicational resource guarantees our mutual sharing of the
Schelling mirror world.
(7’) Interaction is organized around a participation structure (our
property 7). Part of this is a byproduct of turn taking: in a two-party
conversation, the alternation of speaker–addressee roles may exhaust
the roles involved. But clearly this is not so when we have three or more
participants, then we can have speaker, addressee, and nonaddressed
participant. And in addition to passive participants there can be
bystanders, nonparticipants with access to the interaction. Much finer
discriminations of roles are also possible (Goffman 1979; Goodwin 1981;
Levinson 1988). Interaction always presupposes a participation structure,
which itself presupposes a distinction between being copresent but not
in interaction versus copresent and participating. This distinction is
precisely what motivates “access rituals” like greetings (Duranti 1997).
And this distinction maps once again onto our two kinds of action–
interpretation: (a) nonparticipants are engaged in action–interpretation
simpliciter, without directly being able to benefit from recipient design,
(b) full participants can presuppose that each inhabits the shared Schelling
mirror world, allowing each to assume that the other has constructed
his or her actions to be interpretable to the intended participants. Thus,
participation structure seems to be a property of human interaction
that emerges from a number of other properties—turn taking and the
mental simulations behind recipient design amongst them.
(8’) The close-timing characteristics of human interaction (property
8) may be partially attributed to some independent ethological source,
some kind of mental metabolism as it were.19 But they partly follow
from the design-and-test characteristic mentioned in (5’): a response
that indicates how the prior action was understood, needs to be adjacent
to that action—and given the free turn competition, the only way to
ensure that, is to be immediately next (otherwise another participant
may take the next turn, or the prior speaker resume speaking). Hence,
the timing properties of human interaction can be partly attributed to
the turn-taking properties discussed in (5’) and (6’).
(9’) Humans are not unique in using multimodal signal streams in
interaction (our property 9): many animal displays have these
characteristics,
and since Darwin (1872) there have been many attempts to
understand the evolutionary and ethological background. Simultaneous
display of bared teeth, flared nostrils, and a growl can serve to signal
aggression. But in the human case, at least, the whole display is not
just a gestalt: the multimodal behavior streams have to interlock at
many minute borders and boundaries. There is no doubt that the digital
nature of language is partly responsible for this microstructuring of
the signal stream. Yet careful inspection of video records shows that
synchrony alone will not do the trick of hooking up the bits in the
different signal streams: gestures, facial expressions, nods, and the like
can come earlier or later than the words they go with. There seems
to be a significant “binding problem” in hooking up the signals that
go together. If temporal binding is not sufficient, what will do the
trick? Suppose I bow low to the Dean and then wink at you—to see
if and how the two signals link requires an analysis of what might
have driven each behavior, and how those two intentions might be
put together. It is, if one likes, a problem in meaning composition, or
goal analysis—the mental reconstruction of communicative intentions
expressed through clues which are designed to be just sufficient. Human
multimodal communication is as much artifice as ethology, and the
capacities that drive it will crucially include the mental simulation of
the other (2’ above) and the reconstruction of motives or intentions
(3’ above).
(10’) What could be responsible for the apparent universals of
interaction, like the turn-taking and repair systems of casual
conversation?
At least some of the properties seem to follow from, or be
motivated by, features we have already considered—thus, turn taking
may partly derive from the cooperative sharing of a common resource,
and the need to test interpretations that are in a way only guesses
in a Schelling mirror world. But it seems likely that many aspects of
human interaction (turn taking among them) have a long phylogenetic
history. Face and voice recognition, known to implicate specialized
circuitry in the brain, are preconditions to any social interaction, in the
sense of interaction tailored to specific social persons. Language is also
biologically underpinned in many complex ways. The simultaneous
activation of gesture, gaze, posture, and paralanguage—the particular
channels of human multimodal interaction—all contribute to the
distinctive ensemble of interactional signals typical of human ethology,
picked up as the flotsam of evolution.20
Let me summarize. A review of properties of human social interaction
suggests that the core interaction engine consists of a bunch of
ingredients,
but crucially:

&#x25FB; Attribution of intention, or “mind-reading” in a broad sense, is a


crucial precondition (call it level 1), but is itself nowhere near
the abilities needed to generate human interaction. Here, some
automatic link from action–perception to the action–production
system, as with mirror neurons, together with statistical induction
(as Byrne this volume implies) may be enough to map behaviors
onto intentions or goals.
A crucial additional level is the ability to enter Schelling mirror
worlds, to do the mental computations that allow us to simulate
the other simulating us. Here, we have the ingredient of mutual
salience for us right now (reliant on common ground). This allows
mental coordination without communication. Now we can beat the
statistical odds hands down. The distinction between level 1 above
and this second level correlates with antagonistic versus cooperative
interaction: level 2 is not necessary for antagonistic interaction
(although it may be ruthlessly exploited in it), but it is necessary
for cooperative interaction.
A third crucial level is having Gricean intentions, intentions that
drive behaviors whose sole function is to have an effect by virtue
of having their intentions recognized. It is this level 3 that makes
high-level communication possible, and on its foundation language
has evolved, and still relies on nearly every occasion of use.
Woven in and out of this is the cooperative nature of human
interaction—there would not be any point of getting into Schelling
mirror worlds unless cooperation was a reasonable presumption
(once there, sure, Machiavellian exploitation of the system will be
a thing to guard against).
There is a set of empirically observable practices—turn taking,
sequence templates and repair among them—which look universal
and are only partly derivable from other features. These form part
of a raft of ethological proclivities which help to account for the
species-specific character of human multimodal communication.
Note that this analysis, if correct, has some utility for the comparative
study of interaction in phylogeny and ontogeny, for it predicts an
ordered series of steps toward human interaction, from level 1–3, while
noting that ethological patterns like turn taking may develop earlier
and independently.

Culture and the "Interaction Engine"


Many sociocultural anthropologists may react with hostility to the ideas
so far discussed—the direction of argument may seem to belittle the
role of the cultural construction of social life, like so many ideas in
sociobiology, human ethology, or evolutionary psychology. But that
would be the wrong reaction. We are trying to probe what underlies
that cultural construction, what makes cultures get off the ground so to
speak—what makes them learnable, and what provides the framework
within which the cultural can do its work. The positive reward for
speculating about common human potentialities is that we may
understand much better what generates the diversity itself.

Cultural Variation
Interaction is shot through and through with culture. It had better be,
because it is the vehicle of culture—without it, there would not be any.
Even though culture conditions and shapes private acts—the way we
urinate or defecate, for example, or even the way we walk—it is through
public, and especially interactive, acts that culture propagates itself.
And every anthropologist, indeed every traveler, has been impressed
with differences in interactional mores. Just to mention a few of my
own observations, consider:
(1) In rural Tamilnadu, in a typical village of 18 castes, who can
interact with whom, and in what ways, is elaborately specified in a
mental 17 × 17 matrix (Levinson 1982). One indelible memory is of a
high-caste foreman arriving on bicycle at a building site, engendering
the total cessation of works as all the low-caste workers scramble down
the scaffolding so that they can receive instructions while not having
their heads higher than their caste better.
(2) In Cape York, the aboriginal speakers of Guugu Yimithirr incorporate
gestures into their verbal interaction in a much more fundamental way
than Europeans do. For example, a negative gesture preceding a positive
assertion signals a negative proposition, or the subject and object of
a verb may be omitted but indicated by gesture. The great majority
of gestures are intended to have directional veracity—no mere hand
waving here (see Levinson 2003).
(3) In Chiapas, Mexico, Tzeltal-speaking Tenejapans are peasants who
maintain a decorum appropriate to a royal court: Long and elaborate
greeting sequences specify whether the intruder is merely passing by
(and if so in the same or different direction as the intruded, or past the
intruded’s home base) or arriving to visit (Stross 1967). Once begun,
interaction is properly conducted sitting side by side with the minimum
of mutual gaze, each assertion being partially repeated by the recipient,
with long sequences of the kind: “I’ve come to visit you” “You’ve
come to visit perhaps” “I have come” “You have indeed” “Indeed
I have.” . . . (Brown 1998; Levinson and Brown 2005).
(4) On Rossel Island, Papua New Guinea, interaction is typically
dyadic, squatting eyeball to eyeball, with sustained mutual gaze, and
incorporating many facial displays, and eye-pointings. Fast, informal,
with much mutual touching, two big bankers of shell-money can
conduct important business for the whole island with a nod and a
wink, making a striking contrast to the apparent Tenejapan formality
of interaction over matters much more trivial (Levinson and Brown
2005).
These observations, and a thousand like them, raise the question:
What sense does it make to talk about a core interaction engine as if
it was a universal property of mankind, given all this rich texture of
cultural diversity?
The answer is that the interaction engine is not to be understood
as an invariant, a fixed machine with a fixed output, but as a set of
principles that can interdigitate with local principles, to generate
different local flavors. Let me outline just one example of the kind of
interplay between the universal and the culturally particular I have
in mind (the details appear in Levinson 2005). Sacks and Schegloff
(1979) suggested that two principles govern the reference to persons
in English conversation: a preference for using a minimal form (e.g.,
a name), and a preference for using a form (a “recognitional”) under
which the referent can be recognized by the recipient. Usually these two
preferences can be satisfied simultaneously. But sometimes they come
apart. For example, if the speaker is unsure whether the recipient will
recognize the referent under a single name, he may try it out, marking
the “try” with rising intonation—if there is no uptake, a second name
may then be introduced, also with a “try” intonation, then a description,
and so on. So we get a sequence like this:
<5> A: .. . well I was the only one other than the uhm tchFords?, (1)
Uh Mrs Holmes Ford? (2)
You know uh the the cellist? (3)
[
B: Oh yes. She’s she’s the cellist (4)
A: Yes well she and .. . . . . . . .
[Sacks and Schegloff 1979:19]

At (1) the speaker tries a single name, upgrading at (2) with a second
name and a title, and at (3) with a description, whereupon getting
acknowledgement of recognition at (4), the speaker proceeds. What the
sequence displays is that recognition takes priority, the minimization
being successively relaxed till recognition is achieved (common ground
established). It shows that a minimal clue to a Schelling solution is
tried first.
Very similar sequences can be found in other quite unrelated languages
I have worked on, including Guugu Yimithirr, Tzeltal, and Yélî Dnye
(the language of Rossel Island). One has to allow for the fact that
upgradings might take different forms (e.g., identifying conventions
might employ place of origin specifications), and even that different
modalities might be involved (e.g., pointings, eye glances at places of
origin)—but allowances made, the sequences are eerily familiar. Here
is one from Guugu Yimithirr:
<6> B: ngayu nubuun nhaaway waami dyibaalu warra Milga-mul? 1
1s one there found to.South old ears-without
“I came across one fellow there to the South, old ‘without ears’?”
<points>
(0.3) 2
R: aa 3
“Oh”
(0.4) 4
B: oo Tommy Confen? 5
“old Tommy Confen?”
R: ee 6
“ah”
B: nyulu nhamuun bamaal nganhi wangaarmun nhaathi durrginbigu
gaadariyga bada
“That fellow saw me, as I was coming down Indian Head”
[Revgest 00:17:01]
At line 1, B tries Tommy Confen’s nickname, namely “Without Ears”
(he was deaf), with intonational rise on “Ears.” Now critically, he has
supplemented this reference with an earlier quick pointing gesture to
where the Confen household used to be, coinciding with the underlined
word -nubuun—but unfortunately R was not looking (see Fig. 1.2 [a]).
B therefore has reason to doubt that R has got the reference: he gazes
straight at R throughout this sequence until point 5, to assess whether
recognition has occurred (see Fig. 1.2 [b]). R’s response at point 3 is
slightly delayed, and has a form (indicating “news”) suggesting that
it could be a response to the earlier part of what B said. B therefore
tries again at 5 with rising intonation, with both English names of the
referent. R responds positively, with mutual gaze, and B then turns
away and resumes the story.
That suggests that cross-culturally there seem to be the same two
preferences, they seem to have the same ranking, and when they cannot
be satisfied simultaneously, minimization is successively relaxed. Let
us take this, on the basis of parallels in four unrelated cultures, as a
candidate universal, acknowledging that we would need a lot of further
evidence to firm this up.
On Rossel Island, there is an additional wrinkle, a cultural taboo on
naming that interacts with these preferences. The taboo specifies that
one may not name close in-laws or relatives recently deceased. How does
this then interact with the candidate universal preferences? Let us take a
look. In the following excerpt, J out of the blue refers to someone as “that
(distant) girl,” pointing <7> south up over the mountain (utterance [1]).

Figure 1.2. Guugu Yimithirr person reference sequence: (a) frame showing
unobserved quick point to referent’s home base; (b) frame showing mutual gaze
at point at which recognition is achieved.
Figure 1.3. Rossel Island name-avoidance sequence: (a) points up on “that
thing”; (b) points over mountain on “that girl”; (c) points W on second “that
girl”; (d) ditto on “you see,” widening eyes; (e) recipient gives eyebrow flash; (f)
recipient says “ah” in overlap.
The utterance is “try-marked” with rising intonation, and the gesture is
held while looking at R (Fig. 1.3[a], [b]). R does not respond in the gap
(2). J upgrades the description in (3), not by adding a verbal description
or name but by pointing West while widening his eyes and gazing at
R (Fig. 1.3[c], [d]). At this point R responds with an eyebrow flash (a
local “yes, continue” marker), followed by a verbal acknowledgment,
and recognition achieved, J continues.
<7> J: mu kópu mwo a pyaa wo, mu dmââdî ngê? (1)
that affair over.there happened that girl topic
<—points South over mountain————holds point, looking at R>
“That thing (pointing) that happened a while ago, that girl?”
(.) (2)
mu dmââdî ngê? cha w:ee? (3)
that girl topic you understand
<—opens eyes wide, points West>
“that girl, you see?”
[
R: (eye-brow flash) éé (4)
“right, yes”
J: yi dmââdî pi kuu, yed:oo nipi nmî dmââdî cha w:ee (5)
“that girl is our affair, she’s one of ours”
[Rossel Island R02_V4 00:03:27]

The odd thing about the episode is the reference at (1) to a new referent
with such a general description (“that girl”) with the presumption
nevertheless that the referent is recognizable. In holding his gesture,
waiting, repeating the description with a new gesture, J is clearly
persisting in seeking recognition. Nevertheless, he systematically avoids
a name, instead using the same general description but providing two
distinct gestural clues, first over the mountain to his own village where
the girl was raised, and then West where she has just died (see Fig.
1.3). The recent death (itself only alluded to by “that thing”) requires
the name avoidance. Thus, R is faced with a Schelling problem: a very
general description (“that girl”) supplemented with gestural clues, and
with the background knowledge that one reason for not naming a person
is their recent death. The clues evidently prove sufficient, as R claims to
have recognized the referent at (4).
Notice how the culturally specific rule (a name taboo) folds into
our candidate universal preferences. The speaker goes for recognition.
He is blocked from using a name, but uses a brief general description,
satisfying minimization, with a gestural clue. When this is not sufficient,
he is again blocked from using a name, and tries an upgrade using a
second gestural clue, while claiming with wide eyes (see Fig. 1.3 [d])
and intonation that the addressee can locate the referent in Schelling
space. All three preferences are interlocked: do not name, yet go for
recognition, while seeking minimal reference. Further cases of name
taboo on Rossel show similar patterns: nonverbal upgrades are preferred
to verbal ones, as they better satisfy the ban on speaking of the taboo
person. Space precludes extensive discussion of this theme, but the
point is that the culturally specific does not necessarily eclipse the
(candidate) universal procedures—they are woven together to make a
coherent local practice.
The identification and naming of persons is, if anything is, a
cultural
matter, and yet it seems to mesh seamlessly with the universal
systematics of interaction. The hypothesis is that the interaction engine
will be most recognizable in informal, everyday conversation, which
forms the normal matrix for language acquisition and socialization.
The ethnography of speaking has long established that when we look
at special, ritual or institutional speech events, we find ourselves in the
culture-specific territory of séances, ceremonies, investitures, political
oratory, and the like (Bauman and Sherzer 1974; Duranti 2001). Even here,
though, the interesting suggestion emerging from work in conversation
analysis is that specialized speech events are built by tweaking the rules
and principles governing informal conversation. Thus, the differences
between a press conference and a classroom can be partly captured by
considering both the similarities (multiple persons, but only two parties,
one singular—teacher or press officer) and the differences (questioning
assigned to the party with the multiple persons, as in press conferences,
or to the singular party, as in classrooms; see Schegloff 1987).
This idea—that the local, cultural specialization is a variation off a
universal theme—is potentially powerful, because as we learn more about
conversational organization we see that there are relatively few, crucial
organizing principles. For example, ringing the changes on different
possible systems of turn taking, participation–structure, and action
sequences will give us many key aspects of culture-specific speech events.
We also see that at a finer level of structure, the modulation of the way in
which actions are expressed (e.g., directly vs. indirectly, with or against
preference organization) conveys the qualities of social relations (Brown
and Levinson 1987). Conversation analysts have therefore sometimes
taken a “constructionist” view of social organization (see again Schegloff
1987): you are, as it were, what you say. This does not always accord
with the anthropological experience (Levinson 2005): it may work in
New Guinea, but in India you are what you are born. However, viewed
as a system of principles that predicts, for each possible manipulation
of the systematics, what the consequences will be, it promises to be a
powerful tool for understanding cross-cultural variation.
The idea, then, is not that the interaction engine produces cross-
cultural uniformity but, rather, that it provides the building blocks for
cultural diversity in social interaction. Or in a less crude analogy, it
provides the parameters for variation, with default values that account
for the surprising commonalities in the patterns of informal interchange
across cultures.
One reason that sociocultural anthropologists should be interested in
grasping the nature of these parameters is that interactional principles
clearly play a central role in higher level social processes. This is entirely
transparent in tribal societies, where since Sir Henry Maine (1861) it
has been appreciated that larger entities like descent groups act like
individuals, contracting marriages and alliances or conducting feuds.
Less obviously, politics and diplomacy among modern nation states
has much the same character, of a conversation conducted according
to the principles of interaction, albeit between representatives of huge
agglomerates. We attribute intentions to political maneuvers as if states
were individuals, instead of the rambling conglomerates with different
factional interests that they really are (Levinson 1995:225).
In short, the analysis of interaction could and should play a major
role in our analysis of social institutions and international politics.
Humans come natively equipped for interacting with conspecifics. We
use this interpretive apparatus for understanding large scale polities of a
kind that we have only recently innovated in our evolutionary history,
and for which they may be inappropriate. For, however inappropriate,
whatever other natural model would we have?

Conclusion
My thesis has been that the notion of a core interaction engine driving
human social life makes eminently good sense. There is good prima
facie evidence for it, and work in psychology, linguistics, anthropology,
sociology and philosophy all point toward it. It is not easy to isolate the
critical features of such an ability, because they range from the abstract
mental simulations of Schelling mirror worlds, to the concrete problems
of binding across multimodal signals, or the processes generating striking
cross-cultural parallels across procedures for person reference. But the
effort has to be worth it. Progress promises the key to understanding
human evolution, and it offers to shed light on human ontogeny,
higher level social processes, and the limitations of a mentality forged
in face-to-face contact in the present world of nation states, superhuman
agglomerations endowed by us with personal attributes they mostly do
not have. It is an effort in which anthropology should have a central
role to play.

Notes
1. This chapter takes off from the position paper authored by Nick
Enfield and myself, and precirculated in June 2004. Thanks are owed to the
participants to the Wenner-Gren symposium in which these ideas were first
aired and discussed. I am also grateful for help from Penelope Brown.
2. See Connolly and Anderson 1987. It could have been a chance match
of cultural mores, the raw greed of colonial mercantilism happening to meet
its match in Melanesian exchange—see Strathern 1992.
3. Dunbar’s point is that human verbal interaction replaces primate
grooming, and he is therefore especially interested to find that 60 percent of
conversation concerns social relationships and person topics (1997:123).
4. Hymes (1972:40) mentions a number of cases in which the ethnographers
have noticed the extreme taciturnity of the people—he cites Gardner for
example on the Paliyans of South India, who “communicate very little at all
times and become silent by the age of 40. Verbal, communicative persons are
regarded as abnormal and often as offensive.”
5. There are problems quantifying what counts as interaction. Are
nonaddressed listeners interacting? Perhaps onlyif they are ratified participants
(see Goffman 1979; Levinson 1988). Is talk essential? No, signs, winks, and
nods will do—we are interested in mutual, interlocking sequences of actions
(see below), which are not dependent on language. Is a mother rocking a baby
“interacting” in the favored sense? Yes, if in response to baby’s actions, but no
if baby is asleep.
6. My neighbors got further than I did for a number of reasons. First,
although Kpémuwó’s village is some distance away, he is familiar to them.
Second, they shared much more background knowledge of the situations
being described. Third, their signing was more perspicuous to Kpémuwó than
mine because it made use of conventional elements of the gesture system—
the spoken language is accompanied by a rich set of conventional gestures or
“emblems.”
7. This list, derived from the empirical literature, is not so far removed
from the philosophical view derived by H. P. Grice, whose theory of “meaning”
(1957) covers points (1) and (2), and whose “maxims of conversation” (1975)
cover (3), (4) and (7—“relevance”) at least.
8. Conversational analysts have introduced the technical term “conditional
relevance” for this expectation (see, e.g., Schegloff 1972b).
9. On the hierarchy of grammars modeling behavior sequences see Partee
et al. 1990:433ff.
10. More strictly, the system is asymmetrically structured in such a way
that interactants can deploy it to try and extract cooperation (thanks to
Tanya Stivers for helping me see this connection between preference and
cooperation). See Levinson 1983:332ff. and Schegloff in press for exposition
of “preference.”
11. False belief tasks are not mastered by normal Western children until
almost four years old, but by that age children are experienced interactants.
Leslie 1994 suggests that action interpretation begins at around eight months
without the notion of propositional attitudes essential to attributions of belief,
which begins only at 24 months. Mastery of false belief requires, he argues, a
further special kind of inhibition not available for another two years (Leslie
2000:1242).
12. See Rizzolatti and Arbib (1998) on the discovery that specialized neurons
fire when the same action is both perceived and executed—suggesting a low-
level solution to “reading” other minds. But this correlation is learned, as
shown by recent experiments, so there still has to be a higher-level mechanism
relating action and perception.
13. In ToM models, this is often called “second-order belief” (what A
believes B believes about p: see Baron-Cohen 2000). Here, though, we are
actually interested in something that has some of the properties of potentially
infinitely nested beliefs: what A believes B believes that A believes. . . about
p. Although that is not psychologically plausible, there are psychologically
plausible heuristics that approximate it—see Clark 1996:92ff. for review.
14. Usually this has been thought about the other way around, with
Schelling processes embedded in Gricean intentions, rather than the reverse
as here suggested.
15. Antagonistic interaction can be Machiavellian, that is, designed to look
cooperative but with hidden ulterior motives. In that case it is exploiting
cooperative interaction—in a trivial sense, every cooperative interaction can
be embedded in a Machiavellian one. The point here is reflexive thinking
is not an essential feature of antagonistic interaction as it is of cooperative
interaction. See following note on the definition of “interaction” here.
16. Why, one might ask, is all this mentalism necessary? Symbiosis after all
has two forms, mutualism and parasitism, and both forms, cooperative and
antagonistic, can occur without minds. But here I am using “interaction” in a
special sense, in terms of sequences of actions, where by definition an action
is a pairing of a mental intention or goal and the behavior designed to achieve
it.
17. There may be ritual sequences, like greetings and partings, that allow
a rule-governed treatment, as in Irvine (1974), but these do not cover the
central business transacted in between.
18. Fabricated data for reasons of compression—see Levinson (1983:345);
and Schegloff in press.
19. Conversational analysts have noted (of English conversation) that
pauses or gaps of between one-tenth to two-tenths of a second—roughly the
duration of an unstressed syllable—can often be treated as significant failures
to respond. Psycholinguists have tried to link the duration of the segment, the
syllable, and the word to the temporal binding properties of the brain—a real
temporal metabolism.
20. Earlier attempts to build a science of human ethology (Eibl-Eibesfeld
1989; von Cranach et al. 1979) have largely petered out. Current evolutionary
psychology seems headed quite elsewhere, away from the observation of
natural human behavior.

References
Albert, E. 1972. Culture patterning of speech behavior in Burundi. In
Directions in sociolinguistics, edited by J. J. Gumperz and D. Hymes,
72–105. New York: Holt.
Baron-Cohen, S. 2000. The cognitive neuroscience of autism:
Evolutionary
approaches. In The new cognitive neurosciences, edited by M.
Gazzaniga, 1249–1257. Cambridge, MA: MIT Press.
Baron-Cohen, S., A. Leslie, and U. Frith. 1985. Does the autistic child
have a “theory of mind”? Cognition 21:37–46.
Basso, K. 1970. To give up on words: Silence in the western Apache
culture. Southwestern Journal of Anthropology 26(3):213–230.
Bateson, G. 1955. A theory of play and fantasy. Approaches to the
study of human personality. American Psychiatric Association Report
2:39–51.
Bauman, R., and J. Sherzer (eds.). 1974. Explorations in the ethnography
of speaking. Cambridge: Cambridge University Press.
Bickerton, D. 1998. Catastrophic evolution: The case for a single
step from protolanguage to full human language. In Approaches to
the evolution of language: Social and cognitive bases, edited by J. R.
Hurford, M. Studdert-Kennedy, and C. Knight, 341–358. Cambridge:
Cambridge University Press.
Brown, P. 1998. Conversational structure and language acquisition:
The role of repetition in Tzeltal adult and child speech. Journal of
Linguistic Anthropology 8(2):197–221.
Brown, P., and S. C. Levinson. 1987. Politeness. Cambridge: Cambridge
University Press.
Bruner, J. 1976. From communication to language—a psychological
perspective. Cognition 3:255–287.
Chapple, E. D., and Arensberg, C. M. 1940. Measuring human relations.
Genetic Psychology Monographs 22:3–147.
Clancy, P., S. Thompson, R. Suzuki, and H. Tao. 1996. The conversational
use of reactive tokens in English, Japanese, and Mandarin. Journal of
Pragmatics 26:355–387.
Clark, H. 1996. Using language. Cambridge: Cambridge University
Press.
Clark, H., R. Schreuder, and S. Buttick. 1983. Common ground and the
understanding of demonstrative reference. Journal of Verbal Learning
and Behavior 22:245–258.
Clark, H., and D. Wilkes-Gibbs. 1986. Referring as a collaborative
process. Cognition 22:1–39.
Connolly, B., and R. Anderson. 1987. First contact. New York: Viking.
[Book accompanying film First Contact, Bob Connolly and Robin
Anderson, dirs. 54 min. Film Makers Library. New York.]
Darwin, C. 1872. The Expression of the emotions in man and animals.
London: John Murray.
Dennett, D. 1995. Darwin’s dangerous idea: Evolution and the meanings
of life. London: Penguin.
Dunbar, R. 1997. Grooming, gossip and the evolution of language. Harmonds-
worth: Penguin.
Duranti, A. 1997. Universal and culture-specific properties of greetings.
Journal of Linguistic Anthropology 7:63–97.
——, (ed.). 2001. Linguistic anthropology: A reader. Oxford: Blackwell.
Eibl-Eibesfeld, I. 1989. Human ethology. New York: Aldine de Gruyter.
Enfield, N. 2003. The definition of what-d’you-call-it. Journal of Pragmatics
31:101–117.
Fodor, J. 1983. Modularity of mind. Cambridge, MA: MIT Press.
Gardner, H. 1985. Frames of mind. New York: Paladin.
Goffman, E. 1979. Footing. Semiotica 25:1–29.
Goodwin, C. 1981. Conversational organization. New York: Academic
Press.
——, (ed.). 2003. Conversation and brain damage. Oxford: Oxford
University Press.
Goody, E. (ed.). 1995. Social intelligence and interaction. Cambridge:
Cambridge University Press.
Grice, H. P. 1957. Meaning. Philosophical Review 67:377–388.
——. 1975. Logic and conversation. In Syntax and semantics 3: Speech
Acts, edited by P. Cole and J. Morgan, 41–58. New York: Academic
Press.
Hammerstein, P. 1996. The evolution of cooperation within and between
generations. In Interactive minds, edited by P. Baltes and U. Staudinger,
35–58. Cambridge: Cambridge University Press.
Hauser, M. D, N. Chomsky, and W. T. Fitch. 2002. The faculty of
language:
What is it, who has it, and how did it evolve? Science 298:1569–
1579.
Hayashi, M. 2003. Joint utterance construction in Japanese conversation.
Amsterdam: Benjamins.
Hymes, D. 1972. Models of the interaction of language and social life.
In Directions in sociolinguistics, edited by J. J. Gumperz and D. Hymes,
35–71. New York: Holt.
Irvine, J. 1974. Strategies of status manipulation in the Wolof greeting.
In Explorations in the ethnography of speaking, edited by R. Bauman and
J. Sherzer, 167–191. Cambridge: Cambridge University Press.
Jackendoff, R. 1992. Is there a faculty of social cognition? In Languages
of the mind, edited by R. Jackendoff, 69–91. Cambridge, MA: MIT
Press.
Kendon, A. 1990. Conducting interaction. Cambridge: Cambridge
University Press.
Kolb, B., and I. Whishaw. 1990. Fundamentals of human neuropsychology.
New York: Freeman.
Leslie, A. 1994. ToMM, ToBy, and Agency. In Mapping the mind: Domain
specificity in cognition and culture, edited by L. Hirschfield and S.
Gelman, 119–148. Cambridge: Cambridge University Press.
——. 2000. “Theory of mind” as a mechanism of selective attention. In
The new cognitive neurosciences, edited by M. Gazzaniga, 1235–1247.
Cambridge, MA: MIT Press.
Levinson, S. C. 1982. Caste rank and verbal interaction in Western
Tamilnadu. In Caste Ideology and Interaction, edited by D. McGilvray,
98–203. Cambridge: Cambridge University Press.
——. 1983. Pragmatics. Cambridge: Cambridge University Press.
——. 1988. Putting linguistics on a proper footing. In Erving Goffman,
edited by P. Drew and A. Wootton, 161–227. Cambridge: Polity
Press.
——. 1995. Interactional biases in human thinking. In Social intelligence
and interaction, edited by E. Goody, 221–260. Cambridge: Cambridge
University Press.
——. 2000. Presumptive Meanings. Cambridge, MA: MIT Press.
——. 2003. Space in language and cognition: Explorations in cognitive
diversity. Cambridge: Cambridge University Press.
——. 2005. Manny Schegloff’s dangerous idea. Discourse studies 7(4–
5):431–453.
Levinson, S. C., and P. Brown. 2005. Comparative response systems. Paper
presented at the 104th Annual Meeting of the American
Anthropological
Association, Washington, DC, November 30–December 4.
Maine, Sir Henry. 1861. Ancient law. London: Murray.
Meltzoff, A., and M. Moore. 1977. Imitation of facial and manual
gestures by human neonates. Science 198:75–78.
Miller, G., E. Galanter, and K. Pribram. 1960. Plans and the structure of
human behavior. New York: Holt.
Moerman, M. 1989. Talking culture: Ethnography and conversation analysis.
Philadelphia: University of Pennsylvania Press.
Muysken, P. 2000. Bilingual speech: A typology of code-mixing. Cambridge:
Cambridge University Press.
Partee, B., A. ter Meulen, and R. Wall. 1990. Mathematical methods in
linguistics. Dordrecht, the Netherlands: Kluwer Academic.
Pinker, S. 1997. How the mind works. New York: Penguin.
Quine, W. V. 1960. Word and object. Cambridge, MA: MIT Press.
Reisman, K. 1974. Contrapuntal conversations in an Antiguan village.
In Explorations in the ethnography of speaking, edited by R. Bauman
and J. Sherzer, 110–124. Cambridge: Cambridge University Press.
Rizzolatti, G., and M. Arbib. 1998. Language within our grasp. Trends
in Neuroscience 21(5):188–194.
Rochat, P., J. Querido, and T. Striano. 1999. Emerging sensitivity to
the timing and structure of protoconversation in early infancy.
Developmental Psychology 35(4):950–957.
Sacks, H., and E. A. Schegloff. 1979. Two preferences in the organization of
reference to persons in conversation and their interaction. In Everyday
language , edited by G. Psathas, 15–21. New York: Irvington.
Sacks, H., E. A. Schegloff, and G. Jefferson. 1974. A simplest systematics
for the organization of turn-taking for conversation. Language
50:696–735.
Schegloff, E. A. 1972a. Notes on a conversational practice: Formulating
place. In Language and social context, edited by P. Giglioli. Pp. 95–135.
Harmondsworth: Penguin.
——. 1972b. Sequencing in conversational openings. In Directions in
sociolinguistics, edited by J. J. Gumperz and D. Hymes, 346–380. New
York: Holt.
——. 1987. Between micro and macro: Context and other connections.
In The Micro-Macro Link, edited by J. Alexander, 207–234. Berkeley:
University of California Press.
——. 1996. Some practices for referring to persons in talk-in-interaction:
A partial sketch of a systematics. In Studies in anaphora, edited by B.
Fox, 437–485. Amsterdam: Benjamins.
——. in press. Sequence organization. Cambridge: Cambridge University
Press.
Schelling, T. 1960. The strategy of conflict. Cambridge, MA: MIT Press.
Sidnell, J. 2001. Conversational turn-taking in a Caribbean English
creole. Journal of Paragmatics 33:1263–1290.
Sperber, D., and D. Wilson. 1995[1986]. Relevance. 2nd edition. Oxford:
Blackwell.
Strathern, M. 1992. The decomposition of an event. Cultural Anthropology
7:244–254.
Striano, T., and M. Tomasello. 2001. Infant development: Physical
and social cognition. In International Encyclopedia of the Social and
Behavioral Sciences. N. J. Smelser and P. Baltes, eds. Pp. 7410–7414.
Oxford: Pergamon.
Stross, B. 1967. Tzeltal greetings. Language Behavior Research Lab,
University of California at Berkeley. [Mimeograph]
Talmy, L. 2000. The cognitive culture system. In Towards a cognitive
semantics, vol. 2, 373–416. Cambridge, MA: MIT Press.
Trevarthen, C. 1979. Instincts for human understanding and for cultural
cooperation: Their development in infancy. In Human ethology:
Claims and limits of a new discipline, edited by M. von Cranach, K.
Foppa, W. Lepenies, and D. Ploog, 530–571. Cambridge: Cambridge
University Press.
von Cranach, M., K. Foppa, W. Lepenies, and D. Ploog (eds.). 1979.
Human ethology: Claims and limits of a new discipline. Cambridge:
Cambridge University Press.
two

Interaction: The Infrastructure for


Social Institutions, the Natural
Ecological Niche for Language,
and the Arena in which Culture is
Enacted
Emanuel A. Schegloff

The central theme of my contribution to this volume is that interaction


is the primary, fundamental embodiment of sociality—what I have
called elsewhere (1996d) “the primordial site of sociality.” From this
point of view, the “roots of human sociality” refers to those features of
the organization of human interaction that provide the flexibility and
robustness that allows it to supply the infrastructure that supports the
overall or macrostructure of societies in the same sense that roads and
railways serve as infrastructure for the economy, and that grounds all
of the traditionally recognized institutions of societies and the lives of
their members.
If one reflects on the concrete activities that make up these abstractly
named institutions—the economy, the polity, and the institutions for
the reproduction of the society (courtship, marriage, family,
socialization,
and education), the law, religion, and so forth, it turns out that
interaction—and talk in interaction—figure centrally in them. When
the most powerful macrostructures of society fail and crumble (as, e.g.,
after the demise of the communist regimes in Eastern Europe), the
social structure that is left is interaction, in a largely unaffected state.
People talk in turns, which compose orderly sequences through which
Interaction and Sociality

courses of action are developed; they deal with transient problems of


speaking, hearing or understanding the talk and reset the interaction
on its course; they organize themselves so as to allow stories to be told;
they fill out occasions of interaction from approaches and greetings
through to closure, and part in an orderly way. I mention this here to
bring to the forefront of attention what rests on the back of interaction:
the organization of interaction needs to be—and is—robust enough,
flexible enough, and sufficiently self-maintaining to sustain social order
at family dinners and in coal mining pits, around the surgical operating
table and on skid row, in New York City and Montenegro and Rossel
Island, and so forth, in every nook and cranny where human life is to
be found.
Accordingly, my plan is to sketch the contours of half a dozen
generic
organizations of practice central to the conduct of interaction,
and, more specifically, that form of interaction that is distinctive to
humans—talk in interaction.1 By referring to them as generic, I mean to
convey that where stable talk in interaction is sustained, solutions to
key organizational problems are in operation, and these organizations
of practice are the basis for these solutions. After sketching some of
these basic organizations of practice, I turn to some of the queries our
editors have put before us; in my case, each of them would require a
book and will need to be treated in a few paragraphs.

Generic Problems and Practice(d) Solutions


Although it is almost certainly the case that many important
organizational
problems of talk in interaction and their solutions are as yet
unknown, let alone understood, it appears that the following ones will
have a continuing claim on researchers’ attention.2

The “Turn-taking” Problem: Who should Talk or Move or Act Next


and when should they do so? How does this Affect the Construction
and Understanding of the Turns or Acts Themselves?
So far it seems to be the case that wherever investigators have looked
carefully, talk in interaction is organized to be done one speaker at a
time.3 Achieving and maintaining such a state of talk may prompt the
invocation of conventionalized arrangements like a chair to allocate the
turns, or mapping the order of turn allocation onto ordered features of
the candidate participants such as relative status (Albert 1964). But the
first of these marks the setting as institutionally or ceremonially distinct
Properties of Human Interaction

from “ordinary talk,” and the latter engenders a range of problems that
make it unsustainable as a general organization of interaction. What
is at stake in “turn taking” is not politeness or civility, but the very
possibility of coordinated courses of action between the participants
(e.g., allowing for initiative and response)—very high stakes indeed.
Even with just two participants, achieving one at a time poses a
problem of coordination if the talk is to be without recurrent substantial
silences and overlaps: how to coordinate the ending of one speaker
and the starting up by another. If there are more than two “ratified
participants” (Goffman 1963), there is the additional issue of having
at least one of the current nonspeakers, and not more than one of the
current nonspeakers, start up on completion of the current speaker’s
turn. One can imagine a variety of putative solutions to these problems
of coordination, but none of them can be reconciled with the data of
actual, naturally occurring ordinary conversation (Schegloff n.d.a)
The simplest systematics for turn taking article by Sacks et al. (1974)
sketches an organization of practices that works well, and has led to
nonintuitive enhancements (Schegloff 2000b, 2002). It describes units
and practices for constructing turns at talk, practices for allocating
turns at talk, and a set of practices that integrates the two. So far this
account works across quite a wide range of settings, languages, and
cultures. Departures from interactional formats familiar to Western
industrialized nations involve what might be called “differences in the
values of variables”—for example, different lengths of time that count
as a silence, rather than differences in the underlying organization of
practices.
To give a brief example, there may be differences between cultures or
subcultures in what the unmarked value of a silence between the end of
one turn and the start of a next should be. Leaving less than the normative
“beat” of silence or more than that can engender inferences among
parties to the conversation; starting a next turn “early” or starting a next
turn “late” are ways of doing things in interaction, and conversation
between people from different cultural settings can result in misfiring
with one another. For example, one difference often remarked on by
urban, metropolitan people about rural or indigenous people is that
they seem to be dimwitted and somewhat hostile; comments range from
Marx on the “idiocy of the rural classes” to Ron Scollan and Suzanne
Scollan’s work (1981) on the relation between migrants from the “lower
48” states in the United States and Alaska Natives. Having asked them
a question, the urbanites—or should I say urbane-ites—find themselves
not getting a timely reply and sense resistance, nonunderstanding,
nonforthcomingness, and so forth. Often they break what they perceive
as “the silence” that greeted their question with a follow-up question,
which may be taken by their interlocutor to exemplify the high-pressure
aggressiveness of “city slickers.” But what differs between them is not
that their turn-taking practices are different or differently organized,
but the way they “reckon” the invisible, normative beat between one
turn and the next.
I have just pointed at the organization of turn taking; an account
of what that organization is, and how it works, will have to be sought
out in the by-now substantial literature addressed to those matters (cf.
esp. Lerner 2003).

The “Sequence-organizational” Problem: How are Successive Turns


or Actions Formed up to be “Coherent” with the Prior one (or some
Prior one) and Constitute a “Course of Action”? What is the Nature
of that Coherence?
The most common way researchers have addressed actual spates of talk
has been to ask what it is about, and how movement from one “topic” to
another occurs, and what it reveals about the intentions and meanings
being conveyed by the speaker or the several participants. Talking about
things—“doing topic talk”—is surely one observable feature of talk in
interaction. But it is only one of the things people do in talk in
interaction.
We would do well to open inquiry to the full range of things that
people do
in their talking in interaction—asking, requesting, inviting,
offering, complaining, reporting, answering, agreeing, disagreeing,
accepting, rejecting, assessing, and so forth. Indeed, doing topic talk is
itself largely composed of such doings—telling, agreeing, disagreeing,
assessing, rejecting, and so forth. Proceeding in this way treats action
and courses of action as the more general tack and doing topic talk as
one of its varieties.
A comparable contrast surfaces in contributions to this volume
between what might be called “mentation” on the one hand and
“action” or “practices” on the other hand. The discourse is full of terms
like understanding, knowing, inferring, reasoning, establishing common
ground, intention, motive, construal, theories (e.g., Theory of Mind [ToM]),
and so forth. But the central question is whether human sociality is a
matter of knowing together or of doing together.
At least part of this contrast turns on the terms of description
used in inquiry. For example, one account of work on ToM describes
an experimental setting in which infants figured out which of two
previously unknown objects is being named by determining which
one the investigator–speaker was looking at. But we might well ask why
this is treated as a theory of mind, rather than a theory of interactional
practice: speakers can indicate what they are talking about by looking
at it, and recipients can therefore look in the direction of the speaker’s
gaze to find what to look at to determine what the speaker is referring
to. If this question and the issue underlying it have any cogency, they
should prompt examination of the conversational and interactional
settings in which so-called ToMs develop: what is said to the children?
What is being done by what is being said to them? What do the children
say back? What are they doing by saying that? Almost certainly what
the children are learning is what others are doing and what they should
do in turn. If there are theories like ToM, they are built up from and
for contingencies of interaction and these are contingencies of action
or conduct, not contingencies of theorizing. It is to the organization of
action, and action realized through talk, that sequence organization
is addressed.
The stance embodied in the foregoing remarks resonates harmoniously
with the contributions of Goodwin and of Hutchins to this volume—
most obviously in the shared preoccupation with praxis and the
action–implicated and publicly displayed features of knowledge. But
they are not incompatible in principle with psychological inquiry, only
in currently predominant outlook and expression. For example, I take
the contribution of Gergely and Csibra in this volume to be about
action and courses of action and practices and interaction—which runs
through the account of the experimental procedure as beginning with
what conversation analysts call a summons–answer sequence (Schegloff
1968) to attract the infants’ attention, and that is what their invocation
of “pedagogy” introduces; the very notion of pedagogy is through and
through interactional in character.
If we ask how actions and courses of action get organized in talk in
interaction, it turns out that there are a few kernel forms of organization
that appear to supply the formal framework within which the context-
specific actual actions and trajectories of action are shaped. By far the
most common and consequential is the one we call “adjacency pair
based” (Sacks 1992, vol. 2:521–569; Schegloff in press; Schegloff and Sacks
1973). The simplest and minimal form of a sequence is two turns long:
the first initiating some kind of action trajectory—such as requesting,
complaining, announcing, and the like; the second responding to that
action in either a compliant or aligning way (granting, remedying,
assessing, and the like, respectively) or in a misaligning or noncompliant
way (rejecting, disagreeing, claiming prior knowledge, and the like,
respectively).
Around and inside such “simple” pairs of actions, quite elaborate
expansions can be fashioned by the participants. There are, for example,
expansions before the first part of such a pair, such as “preannouncements”
(“Didju hear who’s coming?”), “preinvitations” (“Are you doing
anything this weekend?”), and the like. Or, to cite actual data of a
preinvitation:
(1) CG,1 (Nelson is the caller; Clara is called to the phone)

1 Clara:Helo
Nelson:
2 Hi.
3
Clara:Hi.
Nelson:
4 dW-oh>aitnc'.
5
Clara:mN-u>ocht.
Nelson:
6 dY'rwinak?
7
Clara: Yeah.
8
Nelson:Okay.

And of a preannouncement:
(2) Terasaki (2004):207
1 Jim: -> Y’wanna know who I got stoned with a few(hh) weeks ago? hh!
2 Ginny: -> Who.
3 Jim: Mary Carter ‘n her boy(hh)frie(hh)nd. hh.

Notice that these “pre”s themselves make a response relevant, and so


themselves constitute an adjacency pair, and can therefore themselves
be expanded (e.g., “Hey Steve,” “Yeah?” “Didju hear who pulled out of
the conference?” “No, who?”).
And there can be expansions after the first action–turn in an adjacency
pair and before the responding second part—an inserted sequence. For
example:
(3) Schegloff, Jefferson, and Sacks (1977):368

1 B:->Fb Was last night the first time you met Missiz Kelly?
2 (1.0)
3 M:->FiMet whom?
4 B:->Si Missiz Kelly.
5 M:->Sb Yes.
Again, notice that if a first pair part is not followed by an action–turn,
which could be its second pair part, then what occurs in its place is
itself a first pair part and requires a response, so it too is an adjacency
pair and it too can get expanded.
And after the response to the initiating action–turn there can be
further talk that clearly is extending that trajectory of action. Sometimes
that can be a single turn, which does not make a response to it relevant
next, as at lines 3 and 8 in the following specimen, which has two such
sequences.
(4) HG, 16:25-33
1 Nancy: =·hhh Dz he av iz own apa:rt[mint?]
2 Hyla: [·hhhh] Yea:h,=
3 Nancy: -> =Oh:,
4 (1.0)
5 Nancy: How didju git iz number,
6 (·)
7 Hyla: I(h) (·) c(h)alled infermation’n San Fr’ncissc(h)[uh!
8 Nancy: -> [Oh::::
9 (·)

But it can also be something that does make a response to it relevant


next; so it too is itself an adjacency pair and can take the kinds of
expansions I have been sketching here.
(5) Connie and Dee, 9
1 Dee: Well who’r you workin for.
2 Connie: ·hhh Well I’m working through the Amfat Corporation.
3 Dee: -> The who?
4 Connie: -> Amfah Corpora[tion. T’s a holding company.
5 Dee: ->> [Oh
6 Dee: ->> Yeah.

Note here that the question–answer sequence at lines 1–2 is expanded


after the answer by another at lines 3–4 (addressing a hearing or
understanding problem), and that the latter is expanded by a single
turn expansion, first at line 5 (where the “got it”-registering “oh” is
caught in overlap) and then again at line 6 (with the now “knew it”-
registering “yeah”).
I hope that it is clear that what started as a simple two turn–action
sequence can be a framework that “carries” an extensive stretch of talk.4
There are some deep connections between what are nonetheless largely
autonomous organizations of practice—the organization of turn taking
and the organization of action sequences. Just as interaction cannot
do without practices for allocating opportunities to participate and
practices for constraining the size of those opportunities—that is, an
organization of turn taking, so it cannot do without an organization
of practices for using those opportunities to fashion coherent and
sustained trajectories or courses of action—sequence organization.

The “Trouble” Problem: How to Deal with Trouble in Speaking,


Hearing, or Understanding the Talk or Other Conduct such that
the Interaction does not Freeze in Place; that Intersubjectivity is
Maintained or Restored; and that the Turn, Sequence, and Activity
can Progress to Possible Completion.
If the organization of talk in interaction supplies the basic infrastructure
through which the institutions and social organization of quotidian
life are implemented, it had better be pretty reliable, and have ways
of getting righted if beset by trouble. And so it is. Talk in interaction
is as prone as any organization is to transient problems of integration
and execution; speakers cannot find the word they want, find that
they have started telling about something that needs something else
to be told first, hear that they articulated just the opposite word from
the one they are after, find that another is talking at the same time
as they are, and so forth. And talk in interaction is as vulnerable as
any activity is to interference from altogether unrelated events in its
environment—overflight by airplanes, an outburst of traffic noise, or
other ambient noise that interferes with their recipient’s ability to hear,
and so forth.
For such inescapable contingencies there is an organization of
practices for dealing with trouble or problems in speaking, hearing, and
understanding the talk. It turns out that this organization—which we
term an organization of repair—is extraordinarily effective at allowing
the parties to locate and diagnose the trouble and, in virtually all cases,
to deal with it quickly and successfully.
The organization of repair differentiates between repair initiated and
carried through by the speaker of the trouble source, on the one hand, and
other participants in the interaction, on the other hand. The practices of
repair are focused in a sharply defined window of opportunity in which
virtually all repair that is initiated is initiated. (Schegloff et al. 1977). This
“repair initiation opportunity space” begins in the same turn—indeed,
in the same turn-constructional unit (TCU)—in which the trouble source
occurred and extends to the next turn by that speaker.5 The consequence
is that the initial opportunity to initiate repair falls to the speaker of the
trouble source, and a very large proportion of repairs are addressed and
resolved in the same turn, and same TCU, in which the trouble source
occurred (“same-turn repair”), or in its immediate aftermath (“transition
space repair”). These largely involve troubles in speaking, but can also be
directed to anticipatable problems for recipients—problems of hearing
or understanding. The “preferences for self-initiation of repair and self-
repair” have as one of their manifestations that recipients of talk that
is for them problematic regularly withhold initiating repair in the next
turn to allow the trouble-source speakers an additional opportunity to
themselves initiate repair. If they do not do so, the next opportunity
for addressing the trouble falls to recipients—ordinarily in the next
turn. Finally (for our purposes), a speaker may have produced a turn at
talk and had a recipient reply to it with no indication of trouble, only
to find that the reply displayed what is to the speaker a problematic
understanding of that turn. Then, in the turn following the one that
has displayed the problematic understanding, the speaker of what now
turns out to have been a trouble-source turn may take the next turn
to address that problematic understanding (the canonical form being
“No, I didn’t mean X, I meant Y”; cf. Schegloff 1992b).
As the talk develops through the repair space, there are fewer and
fewer troubles or repairables that get addressed. Most are dealt with
in the same or next turn, and these range from production problems
(such as word selection, word retrieval, articulation, management of
prosody, etc.) and reception problems (hearing and understanding
of inappropriately selected usages, such as person reference terms,
technical terms, complicated syntax, etc.) to issues of intersubjectivity
and strategic issues of delicateness. It is hard to say which are more
important: without virtually immediate resolution of the production
and reception problems, the interaction can be stalled indefinitely with
unpredictable consequences; without ways of spotting departures from
intersubjectivity and restoring it, the shared reality of the moment is
lost, again with unpredictable consequences.
It is hard to imagine a society or culture whose organization of
interaction
does not include a repair component, and one that works more
or less like the one I have sketched. We know that details may vary in
ways linked to the linguistic structure of the language spoken—either
its grammatical structure (cf., e.g., Fox et al. 1996) or its phonological
inventory (cf., e.g., Schegloff 1987b). But the structure of the repair
space and the terms of its differentiation between same and other repair
are likely not to vary. For, among its other virtues, it is the availability
of the practices of repair that allows us to make do with the natural
languages that philosophers and logicians have long shown to be so
inadequate as to require the invention of artificial, formal ones. It is
repair that allows our language use not only to allow but to exploit many
of the features that have been treated as its faults—ambiguity, polysemy,
contradiction, and so forth. Designed not for automatic parsers but for
sentient beings, should these usages not be transparently solvable, the
practices of repair are available to get solutions (Schegloff 1989).
The practices of repair and their ordered deployment are probably
the main guarantors of intersubjectivity and common ground in
interaction. Intersubjectivity can, therefore, not require grounding
in static bodies of shared knowledge or common ground—grounding
that, if taken strictly, has often been found unattainable in any case
(see, e.g., Garfinkel [1967:24–31, 35–103] for one demonstration of
this). The practices of repair make intersubjectivity always a matter of
immediate and local determination, not one of abstract and general
shared facts, views, or stances. Built off the basic interactivity of ordinary
talk, each next turn displays some understanding of the just prior or
some prior other talk, action, scene, and so forth, or it displays the
problematicity of such understanding for its speaker. Intersubjectivity
or shared understanding are thereby always addressed for practical
purposes about some determinate object at some here and now, with
resources—practical resources, that is, resources that are practices—for
dealing with the trouble and restoring intersubjectivity. The practices
of recipient design (see below) get the talk designed for its current
recipients, which serves to minimize the likelihood of trouble in the
first instance, and the practices of repair provide resources for spotting,
diagnosing, and fixing trouble that somehow occurs nonetheless. The
reader may wish to explore the resonances between this account of
repair and intersubjectivity and the chapters of Enfield, Goodwin,
Hanks, and Levinson in this volume.

The Word Selection Problem: How do the Elements of a Turn


get Selected? How does that Selection Inform and Shape the
Understanding Achieved by the Turn’s Recipients?
Turns are composed of TCUs—sentential, clausal, phrasal, and lexical,
in English and a great many other languages. 6 But of what are TCUs
composed? Referring to this generic organization as “word selection” is
a vernacular way of putting it, or perhaps a linguistic or psycholinguistic
one for some varieties of those disciplines. And sometimes it is a relevant
way of putting it in conversation–analytic work. But here I want to call
attention to the interactional practices that are only incidentally lexical
or about words. These are practices of referring, or describing, or—
perhaps most generally—practices of formulating. In talk in interaction,
participants formulate or refer to persons (Sacks 1972a, 1972b; Sacks and
Schegloff 1979; Schegloff 1996c), places (Schegloff 1972), time, actions,
and so on. The use of particular formulations cannot be adequately
understood simply by reference to their correctness. The person writing
this (and that is one formulation already) is not only a sociologist; he
is also (as the pronoun inescapably revealed) male, Californian, Jewish,
and so forth. The place I am writing is not only my office, it is in Haines
Hall, at University of California, Los Angeles, in Los Angeles, on the
west side, in the United States, and so forth. And although I already
formulated my current activity as “writing this,” it is also typing, rushing
to finish before a student arrives, and so forth. That is, “correctness”
will not do as the grounds for using this or that formulation, because
there are always other formulations that are equally correct. What is
central is relevance (not, obviously, in the sense of Sperber and Wilson
1986)—what action or actions the speaker is designing the utterance
to embody.
Consider, for example, this bit of interaction. Hyla has invited Nancy
(the two of them college juniors in the early 1970s) earlier in the day
to go to the theater that night to see a performance of The Dark at the
Top of the Stairs (Inge 1958), and they are talking on the phone in the
late afternoon about that upcoming event (among other things). After
a brief exchange about when they will meet, Nancy asks,
(6)
Hyla & Nancy, 05:07-39

1
Nancy: How didju hear about it from the pape[r?
2
Hyla: ['hhhhh I sa:w-

(0.4)
3
4
Hyla: -> A'right when was:(it,)/(this,)
(0.3)
5
Hyla:
6 -> The week before my birthda:[y,]
7
Nancy: [Ye]a[:h,
8
Hyla: -> [I wz looking in the Calendar
->
9
section en there was u:n, (·) un a:d yihknow a liddle:: u-

thi:ng,
10 .hh[hh
11
Nancy: [Uh hu:h,=
12
Hyla: =At- th'-th'theater's called the Met Theater it's on

Point[setta.]
13
14
Nancy: [The Me]:t,
15 (·)
16 Nancy: I never heard of i[t.
Hyla: 17 [I hadn’t either..hhh But anyways,-en
18theh the moo- thing wz th’↓Dark e’th’ ↓Top a’th’ ↑ Stai[:rs.]
19 Nancy: [Mm-h]m[:,
20 Hyla:[En
21 I nearly wen’chhrazy cz I [: I: lo:ve ]that] mo:vie.]
22 Nancy: [y:Yeah I kn]ow y]ou lo:ve] tha::t.=
23 Hyla: =s:So::,.hh an’ like the first sho:w,=
24 Nancy: =M[m hmm, ]
25 Hyla: [wz g’nna] be:,
26 (·)
27 Hyla: on my birthday.=
28 Nancy: =Uh hu[h, ]
29 Hyla: [I’m] go’[n awhh whould hI love-
30 Nancy: [(So-)
31 (·)
32 Hyla: yihknow fer Sim tuh [take me tuh that.]
33 Nancy: [ Y a y u : : h, ]

I want to call attention here to only two bits of Hyla’s responsive talk
starting at line 8: the time formulation “the week before my birthday,”
and the activity formulation “I was looking in the Calendar section” (an
ethnographic note: the “Calendar” section of the Los Angeles Times is
the Culture and Entertainment section). First note that Hyla conducts
an out-loud search for “when it was”; she is taking care with this time
formulation. There are many other ways of referring to the time in
question: how many weeks ago; which week of the month; the date; and
so forth. She chooses “the week before my birthday.” And now “I was
looking in the Calendar section”: not “reading the paper”; not “looking
at the Calendar section”; not the “I saw” with which she had initially
begun (at line 8) and so forth. By coselecting these two formulations,
she is “doing” a description of “I was looking for what to do on my
birthday” although not articulating that description.
So, in turns at talk that make up sequences of actions, the elements
of the talk are selected and deployed to accomplish actions and to
do so recognizably; and recipients attend the talk to find what the
speaker is doing by saying it in those words, in that way. Using “words”
or “usages” or “formulations” is a generic organization of practices for
talk in interaction because that talk is designed to do things, things
that fit with other things in the talk—most often the just preceding
ones. Talk in interaction is about constructing actions, which is why
it does not reduce to language; treating talk in interaction only for
its properties as a system of symbols or a medium for articulation or
deploying propositions does not get at its core. And the actions that
are constructed by talk and other conduct in interaction compose, and
are parts of, trajectories or courses of action, which is why a pragmatics
that does not attend to the sequential organization of actions is at risk
for aridity.

The Overall Structural-organization Problem: How Does an


Occasion of Interaction get Structured? What are those Structures?
And How Does Placement in the Overall Structure Inform the
Construction and Understanding of the Talk and Other Conduct as
Turns, as Sequences of Actions, and so Forth?
Some actions are positioned not with respect to turns or sequences
(although they are done in turns and sequences) or the repair space
but by reference to the occasion of interaction as a unit with its own
organization. Greetings and good-byes are the most obvious exemplars,
being positioned at the beginning and ending of interactional occasions,
respectively. Less obvious, perhaps, is that greetings are just one of a
number of action sequence types that may compose an opening phase of
an interaction (Schegloff 1986), and good-byes are the last of a number
of components that make up a closing section of an interaction. What
happens in between can take either of two forms (as far as we know
now)—a state of continuously sustained talk and what we can call a
continuing state of incipient talk (Schegloff and Sacks 1973). The latter
term is meant to refer to settings in which the parties talk for awhile
and then lapse into silence (silence that does not prompt a closing of
the interactional occasion), at any point in which the talk may start up
again. Characteristic settings in contemporary industrial societies might
be families or roommates in the living room in the evening, occupants
of a car in a carpool or a long journey, seatmates on an airplane, diners
at table, coworkers at a workbench, and so forth. In some societies, this
may be the default organization of talk in interaction.
Although greetings and good-byes are pretty much tied to their
positions in the overall structural organization, other types of action
may take on a distinctive character depending on where in the overall
structural organization of a conversation they occur. Some types of
action are commonly withheld from occurrence early in a conversation;
“requests” are a case in point. Doing a request early in the organization
of an interaction can be a way of marking its urgency, or some other
feature known to be recognizable to the recipient(s). By contrast, many
kinds of “noticings” are ordinarily meant to occur as soon as possible
after the “noticeable” is detectable. Withholding the noticing from early
enactment can be taken as failing to have noticed the noticeable, or as
treating the noticeable as negatively valenced.
The generic character of the overall structural organization of the
unit “a single conversation” consists straightforwardly in its provision
of the practices for launching and closing episodes of interaction with
the commitments of attention that they place on their participants. If
talk in interaction is going on, the parties will find themselves to be
someplace in it by reference to this order of organization.

Interactional Practices at the Roots of Human


Sociality
Contributions to this volume explore the relations, if any, between the
variability of human culture and language, the workings of human
cognition, and the organization of human interaction. Disciplinarily,
this amounts to a reconciliation of anthropology, ethology, psychology,
and sociology—not a small undertaking. Research on interaction
suggests a number of beginning steps.

Candidate Universals in Human Interaction and Cultural


Variability
As I have intimated, if not stated explicitly, in the preceding sections,
I take the generic orders of organization in talk in interaction to be
candidate universals. Other social species display an organization
of interaction with conspecifics, and there is no compelling reason
that I am aware of for doubting that this holds true for humans. The
capacity of travelers, missionaries, conquerors, and so forth to get on
with host populations they are visiting while ignorant of the language
and culture—both historically and contemporaneously—is, at the very
least, commonsense grounds for this as a starting position. Its import is
that interaction in societies and cultures that appear different from our
own be examined for their solution to what I have termed the generic
organizational issues: how do they allocate opportunities to talk in
interaction and constrain the duration of the talk in those opportunities?
How is the talk in turns designed to embody actions and how are those
actions combined to form courses of action across speakers and other
participants? How are problems of speaking, hearing and understanding
the talk managed? What practices underlie the formulation of what
people talk about—persons, places, actions, and whatever else enters
into their talk? How are occasions of interaction launched (or avoided),
how are they ended, and how is the continuity or noncontinuity of
talk within some occasion organized?
The import of the claim that these organizations are generic is not
that the way talk in interaction is done in the United States, or modern
industrialized societies, is generic; it is that the organizational issues
to which these organizations of practice are addressed are generic.
Conversation analysis is not averse to finding, indeed celebrating, what
appear to be differences in interaction in other cultures, societies, and
languages. In some instances, the differences are readily understood by
references to differences in the linguistic or cultural resources of that
population; in others, they serve to trigger a search for a more general
and formal account, under which our previous understanding and the
newly encountered one are both subsumable.
Here is an example of the first of these (see Schegloff 1987b). Some
years ago a graduate student working in the highlands of Guatemala
in a village where Quiche was the language reported that same-turn
repairs were initiated very differently there than they were in the several
languages that she knew (Daden and McClaren n.d.). What is most
familiar to speakers of Indo-European languages are cutoffs (e.g., glottal
or dental stops) and sound stretches. But in Quiche, both stops and
stretches were phonemic, and, accordingly, not used by speakers to
initiate repair on the talk earlier in their turn. Long stretches, which
were not phonemic for Quiche, were used to initiate same-turn repair.
On the one hand, the variation in practice could be straightforwardly
traced to differences in the phonemic inventory of the languages; on the
other hand, our understanding of the practices of repair was reinforced
by finding in this very different linguistic and cultural environment a
“place” findable only by reference to the organization of repair—the
initiation of same turn repair.
Sometimes what appeared to be a major difference in the practices
of talk in interaction turns out, on closer inspection made possible by
modern technology, not to be different at all. For example, a classic
chapter by Reisman (1974) described what he called a “contrapuntal
conversational” system that, in effect, was without any turn-taking
organization at all. Subsequently, Sidnell’s (2001) examination of video-
recorded data from the same area revealed a turn-taking organization
virtually identical to the one described in Sacks et al. (1974).
The second of these examples appears to involve the technology of
observation more than issues of universality or variation; it was the
possibility of examining and reexamining the data at a level of detail
not accessible to one exposure in real time that allowed specification of
where and when each participant began and stopped talking. The first
of the preceding examples, however, is one sort of instance of this issue,
and it exemplifies a familiar polarity in inquiry—a preoccupation with
variation versus a preoccupation with generality. Both are important,
but in the domain we are concerned with, generality seems to me to have
the priority (although not exclusivity). For the dimensions on which
variability is observed and rendered consequential are framed by the
dimensions of generality that render the comparison relevant to begin
with. If I ask you how a pear is different from honesty, you will think
I have a joke or a clever riddle up my sleeve; they lack the common
class membership that renders comparison relevant. The generic
organizations of talk in interaction offer some proposed dimensions of
relevance for talk in interaction per se; languages, cultures and societies
can be examined by reference to these organizations; whether what
is found will be best understood as variability and differences, or as
variations on a same underlying solution to a generic problem, remains
to be found out.
Aside from these organizations of practice, or rather by virtue of them,
certain other features of talk in interaction are plausible candidates for
universal relevance and merit mention here.
One is minimization. For various of the domains we have studied,
the default or base form is the minimal form. For example:
When a party begins talk in a turn, they have initially the right
(and responsibility) to produce one TCU to possible completion
(Sacks et al. 1974). Getting to produce more is contingent on the
conduct of the speaker and of the coparticipants (cf. Schegloff 1982)
to overcome a minimization constraint embodied in the transition
relevance of possible turn completion.
The basic form of a sequence is two turns—the minimum for it
to be a sequence (Schegloff in press; Schegloff and Sacks 1973);
additional turns represent expansions, inspectable for what they
are being used to do.
In referring to someone, there is preference for minimization—that
is, for a single reference form (Sacks and Schegloff 1979); anything
more is marked and is examinable for what else, over and above
simply referring, it is being used to do.
In all of these domains of practice, and others, we find this transparently
simplest design: the minimal form is the unmarked default; special
import is attached to expansions of it.
A second feature is the special character of “nextness,” or next–prior
positioning, operating at various levels of granularity (Schegloff 2000a).
For example:
The turn-taking organization serves to allocate one turn at a time—
next turn.
Absent any provision to the contrary, any turn will be heard as
addressed to the just prior, that is, the one it is next after.
The production and parsing of a turn at talk is by reference to a
succession of “next elements,” where elements can be words, parts
of words, or sounds. This holds as well for the deployment of self-
initiated repair, which turns out to be regularly placed by reference
to “next word” or “next sound” of word.
“Nextness” can operate for sequences; if a sequence type can be
reciprocal (i.e., after Alan initiates to Bill, Bill reciprocates to Alan),
then the default position for the reciprocal is next sequence (most
familiarly in “Alan: How are you, Bill: Fine, and you?”); or, if a
presequence is done (e.g., a summons making an answer relevant
next), then if the response gives a go-ahead, then the base sequence
should occur next (cf. Schegloff 1968).
Most fundamentally, the basic place to look to see how someone
understood a turn is to see what they produced in next turn. In other
words, overwhelmingly talk in interaction is locally organized—one
turn at a time, one sequence at a time, and so forth.
A third feature is a preference for progressivity, again, at work at
various levels of granularity.
Recipients orient to each next sound as a next piece in the developing
trajectory of what the speaker is saying or doing; pauses, cutoffs,
repeats, in-breaths, and the like all involve some interference with
progressivity, and are examinable for what import they have in the
production and recognition of what is going on.
Other initiations of repair are understood as stopping the course of
action that was in progress to deal with some problem in hearing
or understanding the talk, are on that count dispreferred, and may
serve as harbingers of other dispreferred conduct in the offing.
There is plainly a relationship between these three features: progressivity
is realized when some trajectory of action moves from the last-reached
point to the next, delay means something occurs next other than what
was due next; expansion of some unit—a turn, a sequence, a person
reference—beyond its default, minimal realization can constitute a loss
of progressivity, and so forth. The formality of these observations makes
possible examination of a variety of cultural and behavioral settings
as a way of assessing the degree to which, and the levels at which, the
undergirdings of human sociality are species-generic or variable.

Implications for Human Cognition: Action Recognition and ToM


A good starting point for exploring the fit with conversation analysis
is to remark on the obvious point that, whatever is to be found in the
cognitive apparatus, it is not working on a blank field in its engagement
with the world. As central a theme as any in the preceding sketch
of conversation analysis is that talk in interaction is about action
and courses of action (requesting, complaining, asking, answering,
(dis)agreeing, correcting, aligning, etc.). The talk speakers do is designed
to embody one or more actions and to do so recognizably; the uptake
coparticipants manage is designed to recognize what the speaker (and
other coparticipants) mean to be doing with their conduct so as to
underwrite an appropriate next action. (Note the resonance here with
the early discussion on the parsing of action in the chapter by Byrne.) A
ToM has in the first instance to be furnished with methods for designing
talk to do recognizable actions and methods for recognizing the actions
so designed by coparticipants. In a nutshell, that is a large chunk of
what conversation analysis is about. Evidences of this are scattered in
the preceding pages.
Presequences like preinvitations, preannouncements and the like
are designed to be recognizable to recipients as foreshadowing doing
an invitation or an announcement unless the recipient discourages
doing so in their reply. “Are you doing anything tonight?” “Yeah,
I’ve got a paper to write” warns the prospective inviter that an
invitation will be rejected. That is what it is designed to do and do
recognizably. A recipient hears it as something asked not for itself,
not in its own right, but as a harbinger of something contingently to
follow, depending on the response. That is why a question like “Are
you doing anything tonight?” is often met with a return question,
“Why?” The “why” askers know they are not being asked for a
behaviorally accurate account; they are being asked about their
availability. I take it that this is one sort of thing that ToM studies
are interested in. Getting at them will, I think, require knowing
about the organization of adjacency pair based sequences and their
expansions.
How do ordinary sentences accomplish actions recognizable as
requests, announcements, complaints, and so forth? As with
virtually
everything in talk in interaction, it is a matter of position and
composition—how the talk is constructed and where it is. Consider,
for example, this exchange when an undergraduate student—
Carol—comes back to her room with her roommates and friends
there.
(7)
SN-4, 5:1-13

Sherie:
1 Hi Carol.=
Carol:
2 =H[i: .]
Ruthie:
3 [CA:RO]L, HI::
Sherie:
4 -> You didn' get en icecream sandwich,
Carol:
5 I kno:w, hh I decided that my body didn't need it,
Sherie:
6 Yes but ours di:d=
7
Sherie: =hh heh-heh-heh [heh-heh-heh [.hhih
(??):
8 [ehh heh heh [
(??):
9 [( )
10
Carol: hh Awright gimme some money en you c'n treat me to one an

11 buy you a:ll some [too.]


I'll
12
Sherie: [I'm ] kidding, I don't need it.
13
(0.3)

It matters that Sherie’s turn at line 4 is a noticing. Noticings are meant


to be done as early as possible, and one place that qualifies is just after
coming into mutually visible copresence; here it is done directly after
the exchange of greetings. But to leave it at that would be to miss the
boat.
This is a “possible complaint,”7 and the sequence continues past the
point at which I have ended the transcript, the participants working
it through as a complaint sequence. How is it a complaint? It is not a
matter of divining intentions. Designing one’s talk by formulating an
absence is a way of doing a possible complaint; it is a practice by which
complaining can get done and done recognizably. Not any absence,
of course, and more needs to be said, but this is one direction that
conversation analytic work pursues: how recognizable actions get done
and get recognized as such; here it is the negative formulation that is at
the heart of the practice—a practice for doing “possible complaint.” I
take it that this is another sort of thing that ToM studies are interested
in.

And more generally, formulations are part of the design of some


talk to do some action. For example, referring to a person by name
or by what we call a recognitional description, a speaker can build
into a turn designed to do something else an invitation or demand
to a recipient to recognize who is being referred to as someone that
they know. Or the speaker can refer to that person as “this guy” and
convey that this is not a person the recipient should try to recognize.
Here again, practices of talking build into the talk something for
the recipient to find in it.
This last point exemplifies another practice so central to talk and other
conduct in interaction that it is as compelling a practice as any for
universal status, and that is the practice of recipient design. The things
one talks about with another are selected and configured for who that
other is—either individually or categorically. And how one speaks about
them—what words, reference forms, and so forth are to be used is also
shaped by reference to who the recipient relevantly is at that moment,
for this speaker, at this juncture of this interaction. The centrality of
recipient design may have a profound bearing on ToM and on human
cognition more generally, for what persons are required to deal with in
the mundane intercourse of ordinary interaction is not the broad range
of things that could possibly occur, could possibly require immediate
understanding, and so forth but, rather, a presorted set of elements of
interaction designed for who they relevantly are at that moment in that
interaction. Talk in interaction is, in other words, designed for accessibility
to its recipient, and overwhelmingly successfully so. This is the first line
of defense of intersubjectivity and common ground. The demands on
cognition—at least for interaction—are thereby substantially reduced
and shaped. It is because the conditions of language use in ordinary
interaction are very different from those in the discourse of logic and
science that the problems that natural language poses for logic and
science do not arise in quotidian talk in interaction. The relevant ways
of studying human cognition may, therefore, not be ones designed for
anonymous “subjects,” because that is not what human cognition for
interaction is designed to deal with.
Closing
Let me end by repeating some of the final words of Erving Goffman’s
Presidential Address to the American Sociological Association, “The
Interaction Order” (1983:17). He wrote:
For myself, I believe that human social life is ours to study
naturalistically,
sub specie aeternitatis. From the perspective of the physical and
biological sciences, human social life is only a small irregular scab on the
face of nature, not particularly amenable to deep systematic analysis.
And so it is. But it’s ours.

And, one might now add, it is only this species’ social life that has
made possible those physical and biological sciences, and the very
notion of “deep systematic analysis.”
Although Goffman was virtually apologetic for the stature of interaction
studies when put next to traditional studies of social structure, this was
a comparison forced on him by a career in sociology and a presidential
address appropriately shaped for practitioners of its entire reach. In the
present context, interaction studies need no apology, nor is it necessary
to eschew the possibility of deep, systematic analysis. Such studies offer
the possibility of connecting the disparate threads of anthropological,
ethological, linguistic, psychological, and sociological inquiry, bringing
us closer to an understanding of human sociality, and, with it, of what
makes us distinctively human in the first place.

Notes
1. I mean to include under this term “talk” implemented by sign
language
and other forms of communication in interaction that share the basic
characteristics of vocalized talking; so telephone conversation but not
computer
chats, for the former are synchronous moment to moment and the
latter are not. It should go without saying (although the contemporary use of
the term multimodal interaction suggests otherwise) that “talk in interaction”
should be understood as “talk and other conduct in interaction,” that is, as
including posture, gesture, facial expression, ongoing other activities with
which the talk may be cotemporal and potentially coordinated, and any other
features of the setting by which the talk may be informed and on which it
may draw.
2. Ideally this account would be supplemented by empirical exemplars of
the several organizations of practices that are here discursively described, but,
with a few exceptions, this is not possible within our space limitations. It will
have to suffice to refer the reader to the works in which these organizations
have been introduced: Schegloff and Sacks (1973) on overall structural
organization; Sacks et al. 1974 on turn taking; Schegloff et al. 1977 on repair;
Schegloff 1996d on turn organization; and Schegloff and Sacks 1973, Sacks
1992, vol. 1:521ff., and Schegloff n.d.b, in press, on sequence organization.
Some works in which further specification of practices within these domains
has been advanced are: Schegloff 1982 and Lerner 2002 on turn taking;
Schegloff, 1979, 1992b, 1997, 2000c on repair; Lerner 1991, 1996 on turn
organization; and Schegloff 1996a on action formation. Work designed as
exercises displaying how the conduct of analysis works, and how it supports
the stances adopted in this kind of inquiry are Schegloff 1987a and 1996b.
3. Two sorts of exception should be mentioned here. One involves the
claim that there is a place in which talk in interaction is not so organized, as
in Reisman’s (1974) claim for “contrapuntal conversation” in Antigua; Sidnell
(2001) casts considerable doubt on Reisman’s account. The other involves
specifications of where in conversation the “one at a time” claim does not hold,
for example Lerner (2002) on “choral co-production” or Duranti (1997) on
“polyphonic discourse”; here the phenomenon being described is virtually
defined as an object of interest by its departure from the otherwise default
organization of talk. Work on “overlapping talk” (e.g., Jefferson 1984, 1986,
2004; Schegloff 2000b, 2002) locates the topic by reference to its problematic
relation to the default one-at-a-time organization.
4. For an analysis of quite an elaborate sequence—125 lines of transcript
composing a single sequence, see Schegloff 1990.
5. The way repair is organized can have the consequence that it is sometimes
initiated at a greater “distance” from the trouble while still being within the
boundaries that can here be only roughly characterized. For an account of
this, see Schegloff 1992b.
6. To conserve time and space, I have omitted the practices of turn
construction
as a generic organization in talk in interaction, although it has a key role
in the organization of turn taking, on the one hand, and the organization of
sequences, on the other hand (cf. Schegloff 1996d).
7. This sequence is explicated in some detail in Schegloff 1988:118–131.
It may be useful to clarify the usage here and in some other conversation-
analytic writing of the term format “a possible X,” as in the text’s “a possible
complaint.” What follows is taken from Schegloff 1996d:116–117 n. 8:
The usage is not meant as a token of analytic uncertainty or hedging.
Its analytic locus is not in the first instance the world ofthe author and
reader, but the world of the parties to the interaction. To describe some
utterance, for example, as ‘a possible invitation’ (Sacks 1992: I: 300–2;
Schegloff 1992a:xxvi–xxvii) or ‘a possible complaint’ (Schegloff 1988:
120–2) is to claim that there is a describable practice of talk-in-interaction
which is usable to do recognizable invitations or complaints (a claim
which can be documented by exemplars of exchanges in which such
utterances were so recognized by their recipients), and that the utterance
now being described can be understood to have been produced by such
a practice, and is thus analyzable as an invitation or as a complaint. This
claim is made, and can be defended, independent of whether the actual
recipient on this occasion has treated it as an invitation or not, and
independent of whether the speaker can be shown to have produced
it for recognition as such on this occasion. Such an analytic stance is
required to provide resources for accounts of ‘failures’ to recognize an
utterance as an invitation or complaint, for in order to claim that a
recipient failed to recognize it as such or respond to it as such, one must
be able to show that it was recognizable as such, i.e. that it was ‘a possible
X’—for the participants (Schegloff n.d.b, to appear [sic; in press]). The
analyst’s treatment of an utterance as ‘a possible X’ is then grounded
in a claim about its having such a status for the participants. (For an
extended exploration of how a form of turn construction—repetition—
can constitute a practice for producing possible instances of a previously
undescribed action—‘confirming allusions,’ cf. Schegloff 1996b.)

References
Albert, E. 1964. “ Rhetoric,” “Logic,” and “Poetics” in Burundi: cultural
patterning of speech behavior.” In The ethnography of communication.
Special issue of the American Anthropologist 66:6, vol. 2, edited by
J. J. Gumperz and D. Hymes, 35–54. Menasha, WI: George Banta
Publishing.
Daden, I., and M. McClaren. n.d. Same turn repair in Quiche (Maya)
Conversation: An initial report. Unpublished paper, Department of
Anthropology, University of California, Los Angeles.
Duranti, A. 1997. Polyphonic discourse: Overlapping in Samoan
ceremonial greetings. Text 17:349–381.
Fox, B. A., M. Hayashi, and R. Jasperson. 1996. Resources and repair: A
cross-linguistic study of syntax and repair. In Interaction and grammar,
edited by E. Ochs, E. A. Schegloff, and S. A. Thompson, 185–237.
Cambridge: Cambridge University Press.
Garfinkel, H. 1967. Studies in ethnomethodology. Englewood Cliffs, NJ:
Prentice-Hall.
Goffman, E. 1963. Behavior in public places: Notes on the social organization
of gathering. New York: Free Press.
——. 1983. The interaction order. American Sociological Review 48:1–17.
Inge, W. 1958. The dark at the top of the stairs. New York: Random
House.
Jefferson, G. 1984. Notes on some orderlinesses of overlap onset. In
Discourse analysis and natural rhetorics, edited by V. D’Urso and P.
Leonardi, 11–38. Padova: CLEUP Editore.
——. 1986. Notes on “latency” in overlap onset. Human Studies 9:153–183.

——. 2004. A Sketch of some orderly aspects of overlap in ordinary


conversation. In Conversation analysis: Studies from the first generation,
edited by G. Lerner, 43–62. Amsterdam: Benjamins.
Lerner, G. H. 1991. On the syntax of sentences-in-progress. Language
in Society 20:441–458.
——. 1996. On the “semi-permeable” character of grammatical units
in conversation: Conditional entry into the turn space of another
speaker. In Interaction and grammar, edited by E. Ochs, E. A. Schegloff,
and S. A. Thompson, 238–276. Cambridge: Cambridge University
Press.
——. 2002. Turn-sharing: The choral co-production of talk in interaction.
In The language of turn and sequence, edited by C. A. Ford, B. A. Fox,
and S. A. Thompson, 225–256. Oxford: Oxford University Press.
——. 2003. Selecting next speaker: The context-sensitive operation of
a context-free organization. Language in Society 32:177–201.
Reisman, K. 1974. Contrapuntal conversations in an Antiguan village.
In Explorations in the ethnography of speaking, edited by R. Bauman
and J. Sherzer, 110–124. Cambridge: Cambridge University Press.
Sacks, H. 1972a. An initial investigation of the usability of conversational
data for doing sociology. In Studies in social interaction, edited by D.
N. Sudnow, 31–74. New York: Free Press.
——. 1972b. On the analyzability of stories by children. In Directions
in sociolinguistics: The ethnography of communication, edited by J. J.
Gumperz and D. Hymes, 325–345. New York: Holt, Rinehart and
Winston.
——. 1992. Lectures on Conversation, edited by G. Jefferson, 2 vols. With
introductions by E. A. Schegloff. Oxford: Blackwell.
Sacks, H., and E. A. Schegloff. 1979. Two preferences in the organization
of reference to persons and their interaction. In Everyday language:
Studies in ethnomethodology, edited by G. Psathas, 15–21. New York:
Irvington Publishers.
Sacks, H., E. A. Schegloff, and G. Jefferson. 1974. A simplest systematics
for the organization of turn-taking for conversation. Language
50:696–735.
Schegloff, E. A. 1968. Sequencing in conversational openings. American
Anthropologist 70(6):1075–1095.
——. 1972. Notes on a conversational practice: Formulating place. In
Studies in social interaction, edited by D. N. Sudnow, 75–119. New
York: Free Press.
——. 1979. The relevance of repair for syntax-for-conversation. In Syntax
and semantics 12: Discourse and syntax, edited by T. Givón, 261–288.
New York: Academic Press.
——. 1982. Discourse as an interactional achievement: Some uses of “uh
huh” and other things that come between sentences. In Georgetown
University roundtable on languages and linguistics 1981; Analyzing
discourse: Text and talk, edited by D. Tannen, 71–93. Washington,
DC: Georgetown University Press.
——. 1986. The routine as achievement. Human Studies 9:111–151.
——. 1987a. Analyzing single episodes of interaction: An exercise in
conversation analysis. Social Psychology Quarterly 50(2):101–114.
——. 1987b. Between macro and micro: Contexts and other connections.
In The micro-macro link, edited by J. Alexander, B. Giesen, R. Munch,
and N. Smelser, 207–234. Berkeley: University of California Press.
——. 1988. Goffman and the analysis of conversation. In Erving Goffman:
Exploring the interaction order, edited by P. Drew and A. Wootton,
89–135. Cambridge: Polity Press.
——. 1989. Reflections on language, development, and the interactional
character of talk-in-interaction. In Interaction in human development,
edited by M. Bornstein and J. S. Bruner, 139–153. Hillsdale, NJ:
Erlbaum.
——. 1990. On the organization of sequences as a source of “coherence” in
talk-in-interaction. In Conversational organization and its development,
edited by B. Dorval, 51–77. Norwood, NJ: Ablex Publishing.
——. 1992a. Introduction. In Lectures on Conversation, vol. 1. H. Sacks,
author, edited by G. Jefferson, ix–lxii. Oxford: Blackwell.
——. 1992b. Repair after next turn: the last structurally provided place
for the defense of intersubjectivity in conversation. American Journal
of Sociology 95(5):1295–1345.
——. 1996a. Confirming allusions: Toward an empirical account of
action. American Journal of Sociology 102(1):161–216.
——. 1996b. Issues of relevance for discourse analysis: Contingency in
action, interaction and co-participant context. In Computational and
conversational discourse: Burning issues—An interdisciplinary account,
edited by E. H. Hovy and D. Scott, 3–38. Heidelberg: Springer
Verlag.
——. 1996c. Some Practices for Referring to Persons in Talk-in-Interaction:
A Partial Sketch of a Systematics. In Studies in anaphora, edited by B.
A. Fox, 437–485. Amsterdam: Benjamins.
——. 1996d. Turn organization: One intersection of grammar and
interaction. In Interaction and grammar, edited by E. Ochs, E. A.
Schegloff, and S. A. Thompson, 52–133. Cambridge: Cambridge
University Press.
——. 1997. Practices and actions: Boundary cases of other-initiated
repair. Discourse Processes 23:499–545.
——. 2000a. On granularity. Annual Review of Sociology 26:715–720.
——. 2000b. Overlapping talk and the organization of turn-taking for
conversation. Language in Society 29(1):1–63.
——. 2000c. When “others” initiate repair. Applied Linguistics 21(2):205–243.
——. 2002. Accounts of conduct in interaction: Interruption, overlap
and turn-taking. In Handbook of sociological theory, edited by J. H.
Turner, 287–321. New York: Plenum Press.
——. in press. Sequence organization in interaction: A primer in conversation
analysis I. Cambridge: Cambridge University Press.
——. n.d.a. An Introduction to Turn-Taking. Unpublished MS,
Department of Sociology, University of California, Los Angeles.
——. n.d.b. Sequence organization. Unpublished MS, Department of
Sociology, University of California, Los Angeles.
Schegloff, E. A., G. Jefferson, and H. Sacks. 1977. The preference for
self-correction in the organization of repair in conversation. Language
53(2):361–382.
Schegloff, E. A., and H. Sacks. 1973. Opening up closings. Semiotica
8:289–327.
Scollon, R., and S. Scollon. 1981. Narrative, literacy and face in interethnic
communication. Norwood, NJ: Ablex.
Sidnell, J. 2001. Conversational turn-taking in a Caribbean English
Creole. Journal of Pragmatics 33(8):1263–1290.
Sperber, D., and D. Wilson. 1986. Relevance: Communication and cognition.
Cambridge, MA: Harvard University Press.
Terasaki, A. K. 2004. Pre-announcement sequences in conversation.
In Conversation analysis: Studies from the first generation, edited by G.
Lerner, 174–223. Amsterdam: Benjamins.
thre

Human Sociality as Mutual


Orientation in a Rich Interactive
Environment: Multimodal Utterances
and Pointing in Aphasia
Charles Goodwin

primordial site for thestudy of human sociality can be found in a


A situation in which multiple participants are carrying out courses of
action together, frequently through use of language. 1 These situations
are not only pervasive, but in their intricacy, their processes of dynamic

change, and the range of resources they draw on, quite unlike anything
else found in the animal kingdom (although building from processes
found in other animals). The practices used to build collaborative action
frequently encompass a range of quite diverse phenomena including
language structure, gesture, participation frameworks, practices for
seeing and formulating structure in the environment, and embodied
action and tool use. This diversity has frequently obscured the intrinsic
organization of the process itself. For example, in part because of the way
in which the human sciences have each claimed distinctive phenomena,
language structure was treated as the special domain of linguistics, and the
organization of action through language was not a focus of mainstream
sociology (despite most important work by the Prague school, Boasian
linguistic anthropology, Bakhtin and his followers, Mead, Goffman,
and Bateson, and most recently conversation analysis).
To build collaborative action, each party must in some relevant sense
understand the nature of the activities they are engaged in together.
The accomplishment of joint action is also a central environment for
cognitive activity. The ability of participants to publicly scrutinize both
Properties of Human Interaction

what each other is doing, and the unfolding structuring of events is


central to this process. Note that in many cases what must be attended
to extends far beyond talk, to encompass, for example, the embodied
activity of hearers. Following Wittgenstein (1958), this suggests a public
order of multimodal sign use lodged within action.
In this chapter, I examine such issues in the following way. First,
language provides a central resource for the organization of action
within human interaction. Drawing on some of my earlier research
(Goodwin 1981), I begin by investigating the production of individual
utterances as multiparty activities, something done through the
collaborative actions of both a speaker and a hearer. During this process
hearers are largely (although not completely) silent. They display how
they are participating in the activities of the moment through use of
their visible bodies. The construction of an utterance in face-to-face
interaction is not only a multiparty activity, but also a multimodal
one, something that is accomplished through the joint interplay of
structurally different kinds of sign systems, including both the language
of the speaker, and the embodied displays of both the hearer and the
speaker. Our default practices for representing such events, especially
writing (but also parties’ own later reports about what happened in an
encounter, i.e., they talk about what others “said”), typically privilege
one component of this process, language, that is what was said, while
rendering other embodied displays, and just about everything the hearer
did, invisible. This leads quite easily to an ideology in which language
is conceptualized as an isolated self-contained system, the outcome
of private psychological processes situated within a single individual,
the speaker, rather than as a form of public practice lodged within the
organization of action within human interaction.
I then look at the pointing activities of Chil, a man with very severe
aphasia. Despite his almost complete lack of productive language, he
nonetheless acts as a powerful speaker in conversation. He accomplishes
this by using a range of meaning-making practices beyond language
itself to bring phenomena to the attention of his interlocutors who
attribute relevant communicative intentions to his actions and who
work hard to figure out what he wants to tell them. This calibration
of meaning is accomplished through the sequential organization of
talk in interaction. The way in which Chil uses systematic practices to
get others to produce the language he needs again demonstrates the
relevance of focusing on the public organization of collaborative action
within interaction. This analysis also contributes to the set of chapters
that focus on pointing in this volume.
Human Sociality as Mutual Orientation

Embodied Hearers and Language Structure


In an important recent volume, Tomasello (2003) offers a detailed,
usage-based theory of language acquisition as an alternative to, and
critique of, Chomsky’s theory of an innate language module. Central to
Tomasello’s argument is a distinctively human form of intentionality,
the ability to recognize in actions embedded communicative intentions,
a process that is lodged within the common ground provided by joint
attentional frames (Tomasello 2003:22–26). Tomasello describes the
mutual attentiveness of speaker and addressee within a framework
that focuses primarily on the mental life of the actors. It is, however,
possible to investigate crucial aspects of this process as forms of public
practice. The gaze direction of an actor (which is typically displayed
not only by the eyes themselves but also through the head and postural
configuration) allows others to make inferences about what that party is
attending to. Goodwin (1981, 2000b) finds that speakers who discover
that they do not have the gaze of their addressee interrupt the utterance
in progress. Fig. 3.1 provides an example.
Marking the sentence in progress as defective precisely at the point
at which the absence of a hearer is discovered provides some evidence
first, that speakers treat mutual attentiveness as something that is

Figure. 3.1. Requesting the gaze of a hearer.


demonstrated through public, embodied sign use, and second that
the visible coparticipation of a hearer is central to the constitution
of an utterance. Further support for this is provided by what happens
next. Independent of content, the marked interruption of the emerging
utterance’s prosodic contour before it has reached a point of possible
completion produces a very salient signal in the stream of speech.
Immediately after hearing this hearers typically start to move their
gaze to the speaker, who now produces a version of the utterance that
is visibly being attended to by a hearer. In essence the very noticeable
phrasal break acts as a request for a hearer. When the speaker actually
gazes at the nongazing hearer the utterance in progress at that point is
typically abandoned, and a new sentence begun after the phrasal break.
However, if the speaker has not actually gazed at the nonattending
hearer, what follows is frequently a pause, a silence during which
the hearer moves gaze to the speaker. When this happens the unit in
progress is continued.

Repairs and the Display of Language Structure


Such public practices for negotiating a state of mutual attentiveness are
relevant to another issue raised by Tomasello (2003:38–39), that of how
someone (such as a baby) who does not yet know a language can figure
out how to segment the stream of speech into relevant subunits (see also
Pinker 1994:267 for the argument that this problem demonstrates the
necessity of innate linguistic knowledge). Despite persistent claims by
linguists and psycholinguists that people are unaware of the pervasive
“performance errors” in their actual speech, the way in which parties
who have not been attending the speaker immediately start to shift
their gaze after such a phrasal break demonstrates that they are not only
heard but treated as making relevant particular kinds of subsequent
action (see also Clark and Fox Tree 2002). Moreover, insofar as the
phrasal breaks used to request the gaze of a hearer involve not only a
rupture in the emerging syntax of the utterance in progress but also
a very noticeable cutoff of the current prosodic contour, they can be
recognized even by someone who has not yet mastered the structure
of that language. Could they be in any way useful to someone in such
a position? Precisely because of the way in which such speech errors
disrupt smooth syntactic flow they provide crucial information about
the structure of the language in progress. Consider Fig. 3.2.
As is characteristic of repair in conversation (Schegloff 1979, 1987;
Schegloff et al. 1977) the talk that follows the cutoff reuses, although
Figure 3.2. Displaying slots and alternatives.

with significant changes, some of the initial talk. Such repetition


has the effect of delineating the boundaries and structure of many
different units in the stream of speech (for a related argument about
the importance of the visibility of such parsing see Byrne this volume).
Thus, by analyzing what is the same and what is different in these
examples one is able to discover: first, where the stream of speech
can be divided into significant subunits; second, that alternatives are
possible in a particular slot; third, what some of these alternatives are
(here different pronouns); and fourth, that these alternatives contrast
with each other in some significant fashion, or else the repair would
not be warranted. In essence, these repairs provide a distributional
analysis of relevant phenomena in the stream of speech, and, indeed,
their form is in many respects analogous to techniques developed by
linguists, such as elicitation frames and minimal pairs, for determining
structure in the stream of speech.
Repairs in other examples not only delineate basic units in the stream
of speech (e.g., noun phrases) but also demonstrate the different forms
such units can take, and the types of operations that can be performed
on them (see Goodwin 1981:170–173). Consider Fig. 3.3.
The repair in this utterance provides a range of information about
structures utilized in the language. First, it separates out a relevant unit,
a noun phrase, from the stream of speech. Second, it shows where that
unit can itself be subdivided (see Byrne this volume). Third, it provides
an example of the type of unit, an adjective, that can be added to the
noun phrase. Fourth, it locates at least one place in the noun phrase in
which such an addition is permitted. Finally, in the contrast between the
first and second version of the noun phrase, the repair shows that such
an addition is optional. Thus, insofar as repairs provide for significant
Figure 3.3. Decomposing a noun phrase.

differences in form to be displayed within a context of repetition, they


give clear information about contrasts within the language that are
significant to its users, as well as information about how the stream of
speech is divided into appropriate units, the operations that are possible
on those units, and the combinations they can form.
Repairs further require that a listener learn to recognize that not all of
the sequences within the stream of speech are possible sequences within
the language, for example that in Fig. 3.2 “I” does not follow “to” in “We
went t- I went to. . . .” To deal with such a repair, a hearer is required to
make one of the most basic distinctions posed for anyone attempting
to decipher the structure of a language: to differentiate what are and
are not possible sequences in the language, that is between grammatical
and ungrammatical structures. The fact that this task is posed may be
crucial to any learning process. If the party attempting to learn the
language did not have to deal with ungrammatical possibilities, if, for
example, he or she were exposed to only well-formed sentences, he or
she might not have the data necessary to determine the boundaries, or
even the structure of the system. Chomsky’s (1965) argument that the
repairs found in natural speech so flaw it that a child is faced with data
of very “degenerate quality” is unwarranted. Rather, it might be argued
that if children grew up in an ideal world in which they heard only well-
formed sentences they would not learn to produce sentences themselves
because they would lack the analysis of their structure provided by
events such as repair.
These practices also contain, as part of their organization, a public
structure of intentionality, a displayed reason for why the speaker is
repairing the talk in progress. Thus, in Fig. 3.2 the party being talked
about has been misidentified, and this is remedied by the change in
pronouns. Noticeably missing is any indication that the talk is being
disrupted to request the orientation of a hearer. Consider what would
happen if the addressee’s disattention were officially and explicitly
noted, for example with a request such as “Look at me.” The talk in
progress would shift from what the speaker had been in the process
of saying to talk about the current orientation of the participants
toward each other. This would be a very poor way to get the hearer to
listen to what the speaker had been in the process of saying. By way of
contrast the salient repair with its visible reorganization of the emerging
utterance draws heightened focus to the details of the talk in progress.
It displays a reason for its occurrence that is functionally adapted to
the specific tasks it is accomplishing (see also Goodwin 1987). There is
visible motivation for the speaker to perform this action now (to repair
something in the talk), and for the hearer to give the talk in progress
heightened attention.
Speakers’ ongoing analysis of their hearers can have a range of other
significant consequences for the content and organization of the talk
in progress. Goodwin (1981) describes how a speaker who addresses
three separate hearers during a single sentence by moving his or her
gaze from one to the other, changes the emerging content and structure
of the sentence in progress at each gaze shift so as to maintain the
appropriateness of the talk of the moment for its current addressee.
The sentence that finally gets spoken is not the one that the speaker
began with. What seems crucial in such a process is not the syntactic
organization of the final sentence, a single complex tree structure for
example, but, rather, the way in which each emerging unit of talk
projects a constrained but nonetheless variable range of possible next
units that might follow it. Parties building action through use of units
with these properties are working with resources that simultaneously
provide both rich structure and enormous, although constrained
flexibility. This can be exploited on a moment by moment basis to
adapt in a relevant fashion to changing circumstances while still visibly
remaining within the framework provided by an existing course of
action (such as an emerging sentence). Moreover this is not simply a
linguistic or a symbolic process. When the actions of the hearer are taken
into account it constitutes a distinctively human form of collaborative
social organization.

Pointing in Aphasia
The pointing activities of Chil, a man with severe aphasia, will now
be examined. This phenomenon is relevant to study of the interactive
infrastructure of human sociality, and to other work in this volume, in
a number of different ways. First, it provides a particularly clear, indeed
dramatic, example of how human meaning and action are constructed
through systematic processes of human interaction. Chil requires
others to produce the words he needs to say something meaningful.
Second, pointing is a topic of a number of other chapters in this volume
including Tomasello’s on pointing (or rather its absence) in apes,
Liszkowski’s on pointing in infants and Goldin-Meadow’s on pointing
in both deaf children communicating through home sign and learners
and teachers working on math problems. Chil’s situation provides yet
another perspective on pointing, a phenomenon that has emerged as
an interesting subtheme in this volume and the conference that led to
it. Simultaneously the methods used by Chil and his interlocutors to
construct meaning together are instances or variations of more general
practices described by Schegloff, such as repair (Schegloff et al. 1977), for
the organization of action in talk in interaction and require the recipient
design and mutual relevance noted by Levinson, Enfield, and others.
The way in which Chil draws heavily on structure that is already present
in his lifeworld is relevant to Clark’s discussion of common ground.
In 1979, when Chil was 65 years old a blood vessel in the left
hemisphere of his brain ruptured. He was left completely paralyzed on
the right side of his body and with a vocabulary that consisted of only
three words: yes, no, and and. Despite this, he continued to function as
a powerful actor in conversation, and indeed had an active social life
in his community, going by himself to a coffee shop in the morning,
doing some of the family shopping, and so forth.
Chil was my father. I visited him several times a year from the time of
his stroke in 1979 until his death in 2000. In 1992, I began to videotape
him, eventually recording approximately 210 hours of interaction in
which Chil was a participant. None of the recordings were in clinical
environments. Most took place in his home, although a few were made
in settings such as stores where Chil was shopping. The sequences to
be examined here were recorded in 1995 and 1997, 16–18 years after
his stroke, when Chil was in his early eighties. In most of them Chil is
sitting in his kitchen talking to me, his son Chuck, who was then in
his early fifties.
How is it possible for someone with a vocabulary of three words to say
something relevant and perform complicated action in conversation? In
brief, by creatively using the sequential organization of action in human
interaction Chil got others to produce the language he needed to say
what he wanted to say (Goodwin 1995, 2003b, 2004). Despite his lack
of productive language, Chil possessed a wide and important range of
communicative resources that could be used to guide his interlocutors.
First, his ability to understand what others were saying was excellent.
Second, he was able to use prosody to display both affect and a range of
subtly differentiated stances toward talk, other participants and events.
Indeed, what appear in a printed transcripts as strings of nonsense
syllables (e.g., “ih dih dih dih dih:::!”) frequently function as carriers
for subtle and quite expressive prosodic tunes (Goodwin et al. 2002).
When placed precisely with reference to the actions of others such
expressive prosody could create a variety of locally relevant actions.
Moreover by producing single words or strings of his three words—for
example “No no no”—with varying prosody he could use his limited
vocabulary to create a range of quite different actions with varying
meaning (see also Stivers 2004). Indeed, when the unit being analyzed
includes both his words and their intonation, it would be accurate to
say his vocabulary was larger than written versions of his semantic
repertoire would indicate.
Third, Chil was able to produce many different kinds of gestures. In
Fig. 3.4 during line 1 Chil points toward a bagel he has just tasted.
In line 2 Chuck responds to the pointing gesture (indeed his deictic
“it” ties his talk to the target of Chil’s point). Perfectly consistent with
the arguments of Tomasello (Tomasello 2003, this volume), Chuck
recognizes that with his gesture Chil is intending to communicate
something to his addressee, to focus the attention of his addressee on
something. By using Chil’s pointing gesture as the point of departure

Figure 3.4. Multimodal assessment.


for subsequent action that attempts to explicate it Chuck is treating
Chil as someone with a rich mental life, as someone who is trying to say
something relevant through his gesture, rather than as a body waving
its hand around randomly.
Such attributions of intentionality are frequently argued to be essential
for crucial forms of human action (see, e.g., Tomasello, Levinson, Enfield,
and others in this volume), and indeed what distinguishes us from other
animals. It would, however, be possible to interact with Chil and not
make such attributions, for example to treat the nonsense syllables
in line 1 as the incoherent ravings of an idiot. When Chil was in the
hospital several days after his stroke doctors inserted a catheter. As they
were doing this Chil vividly gestured and spoke, although without being
able to produce meaningful language. The doctors dismissed what he
was doing as the ravings of a man whose brain had visibly just suffered
great injury, and did not in any way treat his talk and gesture as relevant
to what they were doing. Several days later they discovered that they had
inserted the catheter inappropriately, and that Chil had been attempting
to tell them this. Attributing communicative intent to another’s sounds
and gesture and thereby treating that person as a full-fledged human
being capable of performing relevant, consequential action, and being
willing to do the work to find out what the other is saying, thus has
not only a cognitive dimension, but also a moral one.
Recognizing that Chil’s pointing finger embodies the intention to
indicate something to Chuck is not, however, sufficient for Chuck,
or an addressee in general, to grasp the action being performed by
the gesture. The addressee is not being asked to simply attend to the
indicated object, to contemplate it, but instead to construe it in a way
that is relevant to the activities in progress at the moment, and to
use the pointing gesture as the point of departure for a relevant next
move. In line 2 Chuck proposes that Chil was making an assessment,
that he was telling Chuck that he “likes” the indicated bagel that he
just tasted. What leads Chuck to understand the pointing gesture in
just this way, or more generally what are the practices through which
the intelligibility of Chil’s gesture as a specific, locally relevant form of
action is achieved?
In brief, I argue that Chil’s gesture does not stand alone as an isolated
pointing hand, but is instead elaborated by a number of other
cooccurringsigns, including a range of quite different kinds of embodied
displays. This multimodality is not specific to aphasia, but is instead
quite general in the organization of human gesture and action. However,
because of Chil’s inability to explicate his gestures with rich, explicit
language, what he is trying to say and do with his gestures is calibrated
with his interlocutors through distinctive action sequences that do
not typically occur after the gestures of fluent speakers (in essence his
addressee provides a candidate understanding after each gesture, which
Chil then accepts or rejects). Through this process very general forms of
practical logic that are central to the organization of gesture, specifically
the way in which the intelligibility of gesture is accomplished through
the mutual elaboration of the gesture and the talk that accompanies
it, are sustained, but with a significant rearrangement of participant
roles. The activity of making a meaningful gesture is here accomplished
through the collaborative work of multiple actors, with one party, Chil,
producing the gesture and someone else the talk that explicates it.
First, Chil’s gesture occurs within a joint attentional frame (Tomasello
this volume), a participation framework (Goodwin 1981, 2000a, 2002;
Kendon 1990) constituted through the embodied mutual orientation of
Chil and Chuck. Spatially Chil’s pointing gesture is organized not only
to indicate the object that is the target of the point (the bagel under his
index finger) but also with reference to the gaze of its addressee. Chil’s
pointing finger is positioned right where Chuck is looking. This is not
accidental. In other data (Goodwin 2003d) Chil can be seen actively
working to line up a recipient’s gaze before proceeding to produce a
relevant gesture.
It is common to speak, sometimes loosely, about the embodied nature
of human action, cognition, and experience. It is therefore
important
to note that Chil’s body is contributing to the organization and
intelligibility of the action in progress in a number of quite different
ways (Goodwin 2000a, 2002). His pointing gesture is bringing something
in the immediate environment to the attention of his addressee, and
is about that object. Like individual utterances in conversation, the
temporal duration of the gesture is quite short and linked to what
is being said and done at the current moment. By way of contrast,
his embodied orientation toward his addressee, and the multiparty
participation framework it helps to construct, is not about the specifics
of what is being talked about at the moment (here a particular bagel), but
instead about the relevant orientation of the participants toward both
each other and the events they are attending to in common. Moreover, it
has a quite different temporal duration. Rather than changing moment
by moment as the talk unfolds, such participation frameworks can
frame extended stretches of focused interaction, multiple topics and
so forth. Although both are displayed through the visible body, the
gesture and the participation framework are structurally different kinds
of sign systems. The action in progress is not only multimodal but also
constructed in part through the mutual elaboration of quite different
kinds of sign systems (e.g., the participation framework makes visible the
joint attentional frame that enables Chil’s pointing gesture to function
as a communicative action).
Second, as Chil points he also speaks. Although the syllables in line 1
lack semantic content, their prosody can be heard as displaying
appreciation,
indeed enthusiasm. 2 Chuck’s “Oh you like it” (line 2) proposes,
correctly (line 3), that the evaluation visible in Chil’s prosody is what
he wants to say about the bagel he is pointing at.
The use of pointing and embodied displays by persons suffering from
aphasia to compensate for limited language structure is not unique to
Chil. Wilkinson et al. (2003) describe how an aphasic man is able to
communicate quite fluently shortly after a stroke by producing limited
deictic terms in his talk while pointing toward a relevant enactment
(of someone not able to walk) being done with his feet, and how such
economy enables him to maintain crucial features of the organization
of talk in interaction.
Chil’s talk, specifically its prosody, and his gesture form a larger
package of meaning-making practices within which each elaborates
the other. Hearers can use the talk to figure out why the indicated object
is being brought to their attention, what is being said about it, and
simultaneously can use the pointing gesture to locate what the talk is
appreciating. Looking at this from a slightly different perspective, Chil’s
prosody is making a comment about the entity topicalized through this
gesture. Understanding Chil’s gesture requires not only recognizing that
it is embodying a communicative intention but also taking into account
the other meaning-making practices Chil is using to contextualize it
(Goodwin 2003b).
The mutual elaboration of talk and gesture is not unique to Chil or
aphasia, but instead is central to the organization of action in fully fluent
speakers as well. There is, however, one way in which Chil’s gestures
differ from those of normal speakers. A single individual typically
produces gesture and the language structure that contextualizes it.
Indeed the characteristic locus of these different but interrelated forms
of expressions within a single individual has formed the basis for much
important research, which explains the relationship between language
and gesture as parallel outcomes of a unitary psychological process,
as in McNeill’s analysis (McNeill 1992) of two interrelated forms of
expression that emerge from a common growth point. Chil cannot
produce the rich language structure required for this process to work
rapidly and transparently. Frequently what he is trying to say and do
with a gesture that he can contextualize only through prosody and
sequential placement is genuinely problematic. Rather than simply
decoding an utterance, his hearers are faced with the task of using the
signs he has produced as a point of departure for trying to figure out
what he is saying. Instead of advancing the conversation further, moves
following one of his gestures frequently take the form of a guess, a
candidate understanding of what Chil is trying to say, as in line 2 of Fig.
3.4. Chil then rejects or accepts this guess, as he does with his nod in
line 3. The net effect of this is that the gesture (line 1) and the talk that
explicates it (line 2) are produced by separate individuals, something
that cannot be described by focusing exclusively on psychological
processes within the individual. Instead what is at issue is an interactive
field that calibrates the psychological processes of separate individuals
within a common course of action.
Chil’s actions are constructed through a complex footing (Goffman
1981) in which he is the principal and author of the statement being made
through the gesture while his interlocutor animates the talk required
to explicate the gesture. Cast in terms of the finer distinctions offered
by Kockelman (2004:145), Chil and his interlocutor animate different
elements of the complex carrier (gesture + talk) used to construct Chil’s
action, although Chil alone is the principal who commits himself to
what is being asserted. His genuine agency arises from the way in which
he is implicated in different stages of this process and visibly responsible
for the proposition voiced by his interlocutor. Consistent with the
arguments of Hutchins (this volume), Chil’s action, and the cognitive
activities required to accomplish it, is distributed across multiple actors
and sign media (see also Hutchins 1995 and Goodwin 2000a).
Not only must hearers attribute rich communicative intentions to
Chil, but he in turn requires active, cognitively complex interlocutors to
make sense out of those gestures, that process being organized through
systematic sequences of interaction. Others produce the language
structure Chil requires to make himself understood. Description of
the forms of sociality through which his actions and meaning are
constituted requires an analytic framework that takes into account
not only the mental, cognitive, and psychological lives of individual
actors but also the public organization of the sign systems, including
language, they are using to build action together, and how these systems
are calibrated, linked to each other, and articulated in real time through
sequential organization.
Figure 3.5 provides an example of how Chil can use a complex two
part pointing gesture to initiate a new topic. Chuck and Chil have
been discussing a recent storm, with Chil in line 2 showing agreement
and appreciation of what Chuck has just said through both “yes” and
syllables that carry expressive prosody. In line 4, he produces a three-
syllable unit with a marked rise in pitch. During the first two units Chil
raises his open hand to top of his head and taps it. Then, during the
longer third unit,3 that also displays stronger appreciation, he moves his
hand forward while simultaneously changing its shape so that it ends
with his index finger pointing toward Chuck’s head. Chuck, in line 5,
immediately and correctly (line 6) sees this gesture as an appreciative
comment about Chuck’s haircut.
Chil’s action contains two quite separate, although linked pointing
gestures, a first to his own head and hair (locating just what Chil’s hand
is indicating is not automatic or transparent but a genuine, problematic
task for an addressee), and then a point to Chuck’s head and the actual
haircut that is being topicalized. Why does Chil produce two points?
In the abstract it might seem far more economical to use only the
second, especially because the ultimate target is Chuck’s hair, not Chil’s.
However, if only the second point occurred, it could be quite difficult
to locate just what Chil was indicating, and what kind of action he was
trying to invoke. Given the distance between the pointing finger and

Figure 3.5. Topic initiation.


its target, it requires work to figure out whether Chil is pointing toward
Chuck in general, to his hair, to something on his face, and so forth.
Moreover, a general point toward someone could be implicated in many
different actions, such as a request that the addressee do something. By
first indicating a particular region of his own head and then transferring
that place to his addressee (the continuity of the gestures, the way
in which they are organized as parts of a single unfolding action is a
crucial feature of their organization), Chil is able create a context that
strongly, and successfully, constrains how the final point will be seen
and interpreted (see Enfield’s discussion of “grounding for inferring”
in this volume).
Through the way in which he organizes his actions here, Chil
demonstrates
that he has a reflexive awareness of both the interpretive tasks
faced by an addressee trying to locate a relevant action in his gestures
(something quite relevant to the analysis of theory of mind offered
in other chapters in this volume), and of the limitations of the sign
displays he must use to make himself understood. Like other speakers in
conversation, the organization of Chil’s action reveals subtle attention
to issues of recipient design, and indeed as demonstrated by the chapters
of Enfield, Levinson, Schegloff, and others in this volume this seems to
be central to the interactive organization of human sociality. From a
slightly different perspective, despite his severely impoverished ability
to produce syntactic structures, here we see that in the realm of gesture
Chil can combine different signs in an ordered pattern to make visible
a particular kind of action.
Prosody also plays a significant role in the organization of this action.
As in the examples above, the evaluative stance displayed by Chil’s
voice plays a crucial role in specifying what is being said about the
object being pointed at. However, additional work seems to be being
done by the marked pitch rise. Chil’s point occurs immediately after
prior talk, but that talk should not be used as a point of departure for
its interpretation. The sudden, very noticeable pitch change seems to
create a boundary with what went to just before it, and thus to act as
a misplacement marker (Schegloff and Sacks 1984). In brief, despite
his almost complete lack of productive language, the organization of
this complex pointing action provides some demonstration that Chil,
nonetheless, retains the ability to build subtle actions with fine attention
to the tasks his addressee must perform to make appropriate sense out
of them.
The pointing gestures of Chil examined above are all “concrete” in
McNeill’s sense (McNeill 1992:18) in that they point toward objects
that can be seen by the participants in the immediate environment.
However, Chil frequently points toward objects that cannot be seen.
Figure 3.6 provides one example. Chuck suggests that they take a drive
and Chil enthusiastically agrees. Chuck then suggests a place for the
drive: “up along the river.” In line 13 Chil produces a string of syllables
that through their prosody display that is proposing an alternative. As
he does this he makes an emphatic pointing gesture, and from this
gesture Chuck is able to correctly figure out where Chil wants to go
instead: Bear Mountain, a park many miles up the river.
In common with all of the gestures examined above the vector
established
by Chil’s pointing arm correctly indicates the direction of the
target. In this respect Chil’s pointing gestures are similar to the absolute
pointing described by Levinson (1996) and Haviland (1996, 2003). The
success of his points to locations that cannot be seen in the local space
depends on Chil and his interlocutor inhabiting together an extended,
meaningful geographic and social space with features and directions
that they recognize in common, indeed quite literally what Clark (1996,
this volume) refers to as common ground. Indeed, such practices for

Figure 3.6. Pointing toward a distant alternative.


building intelligible action by embedding comparatively simple gestures
within a cognitively rich space, are structurally similar to the way in
which Chil amplifies his limited vocabulary by tying to the rich talk
of others.
How is Chil able to indicate that his addressee should not try to find
the object being indicated in the local space, but should instead extend
the vector created by the point to some considerable distance (in this
case many miles)? The embodied performance of Chil’s gesture displays
more than simply a particular direction. First, unlike, for example, the
point toward Chuck’s hair in Fig. 3.5, this gesture is made with the arm
stretched upward, indeed well over Chuck’s head (see Fig. 3.6). Second,
the point is done several times with the hand vigorously thrusting
forward toward the indicated direction. Chil inflects his directional
gesture with additional components that both intercept a default local
reading (e.g., by not dropping to relevant objects in the current space)
and visibly mark an extended distance. The addition of embodied
movements to the gesture is structurally similar to the way in which
he adds consequential prosody to his syllables. What is being indicated
is further constrained by the activity in progress (choosing a destination
for a drive), the immediately prior talk, and the way in which Chil’s
speech is proposing an objection and alternative to what Chuck has just
said. Chil’s pointing gestures are capable of some complexity, and this
can be used to extend their reference well beyond the local scene.
Despite the skill within which Chil used pointing to construct a range
of quite diverse action, the gestures so far examined are in a number of
important ways more limited than those of fully fluent speakers. For
example, unlike some of the gesturers described by Haviland (2003)
he has not constructed gestures in narrated, transposed, or laminated
spaces in which his pointing takes as its frame of reference something
other than the space of the current interaction. Thus, even when Chil
indicated something many miles distant in Fig. 3.6, he built a point that
constructed a vector from his current position to that location. In this
respect, his pointing made extensive use of what Levinson (Levinson
2000) calls an absolute frame, but not either a relative frame of reference
calculated in terms such as “left” and “right,” or an intrinsic frame
relying on properties of the objects being located, such as the “front”
of a house. Such restrictions seem quite plausible in view of Chil’s
situation. A pointing arm can easily construct absolute vectors starting
from a speaker’s current position.
However, in appropriate circumstances Chil can point within a
transposed space, and use both relative and intrinsic frames of reference
114 Properties of Human Interaction

for the organization of such points. Figure 3.7 provides an example.


After a series of unusually cold and snowy winters Chil’s house has
been leaking because of ice forming on the roof. He and Chuck have
measured relevant dimensions of the house and gone to a local hardware
and building supply store to buy electric heating cables to prevent the
ice from forming. They find the cables but discover that the package
shows that they should be installed in a series of triangles, which makes
figuring out how much cable to buy quite difficult. As Fig. 3.7 begins
Chuck is staring at his notes about dimensions and at the diagram on
the box trying to figure out how many boxes of cable to buy (A). Chil,
in line 8, image B, produces talk with a two-fingered hand gesture.
Chuck turns to him and makes a guess about what “two” might refer
to (line 10). In line 12, image C, Chil rejects Chuck’s proposal while
pointing in front of him. Chuck reads this gesture as Chil indicating
one cable for the front of the house. Chil agrees and then in line 16
image C points behind him, which Chuck correctly reads as indicating
the back of the house.
Unlike all of the examples above, here Chil’s pointing is not organized
within local or extended actual geographic space, but instead with
reference to the intrinsic properties of a particular kind of object, the
front and back of his house. Although the hardware store around him
provides an extremely rich collection of objects that could be pointed
at, his pointing activity is located instead within a transposed space,
the house he is discussing with his interlocutor.
A number of resources and practices enable Chil and his addressee
to rapidly locate the spatial organization of the nonpresent house as
the appropriate ground for the interpretation of his pointing. First, the
activity they are explicitly engaged in is buying cables for the roof of that
house. This is why Chuck is staring so intently at the diagrams on the
package that show a house roof using the cables, and musing aloud about
what they mean for what they should buy. Thus, although the house is
not physically present, relevant features of its spatial organization are
what the participants are looking at in a generic image, and also what
they are talking about. By virtue of such sequential placement the space
of the house is the most salient and relevant frame for the organization
of the pointing gestures that occur here.
Second, to draw attention to the intrinsic features of the house Chil
makes use of the orientation of his own body. Thus, the point in front
of his body is interpreted as a point toward the front of the house
being talked about, and then a contrasting point behind him locates
the back of the house. The intrinsic spatial organization of the house
Figure 3.7. A complex gesture sequence grounded in a nonpresent space.

is laminated on the intrinsic organization of Chil’s body. As in Fig. 3.5


where Chil pointed first to his own hair to topicalize Chuck’s haircut,
Chil repetitively relies on the intrinsic properties of his own body as a
resource of indicating something else, and indeed the use of such local
metrics appears to be a quite general practice for rendering nonpresent
events within talk in interaction (Goodwin 2003c).
Despite Chil’s inability to produce linguistic syntax, the sequence of
gestures that occurs here do not simply occur one after another as isolated
single actions, but instead have a systematic, complex organization. At
B in Fig. 3.7 Chil shoves a hand with two fingers between Chuck and
the box he is staring at. This gesture accomplishes two actions that are
crucial to the subsequent organization of the sequence. First, it secures
Chuck’s gaze, and creates a relevant joint attentional frame in which
Chuck is looking at Chil’s hand when the gestures that follow are made.
Second, it can be seen as prefacing, and indeed projecting, the two-
item list (front of the house and then the back) constructed through
Chil’s two subsequent points. The two gestures that follow are built as
a systematic contrast, both spatially in the salient difference between
points toward the front and back of his body, and sequentially with the
first point providing a frame for the contrasting second. Chil’s “yes!” in
line 19, in which he enthusiastically accepts Chuck’s gloss of what he
has been saying through his gestures, marks the end of the projected
two part list. Here both participants shift their gaze from each other
back to the box and notes they have been working with. Chil’s gestures
here are recognized as a complex three part action, with the two-finger
handshape at B projecting a two part list, the point at C providing the
first item, and that at D a contrasting second, at which point both
parties bring the activity to completion.
Chil is able to accomplish meaning and action through pointing
precisely because his gesturing hand does not stand alone as a complete,
self-contained sign or action, but is instead embedded within a
constellation
of other semiotic activities and meaning-making activities. These
include among others (1) multiparty embodied participation frameworks
that create shared attentional frames within which his gestures can
be both seen and treated as relevant to the organization of current
activities; (2) the way in which those interacting with Chil treat his
hand movements as embodying a relevant communicative intention,
and indeed work hard to figure out what he is trying to say and do by
pointing; (3) existing structure in his environment, including a world
full of meaningful objects, social organization, space, and a local context
that is continuously being sustained, modified and updated by the
unfolding activities, including talk, that he is engaged in with others;
(4) the sequential organization of talk in interaction, and action more
generally, which provides both a crucial contextual point of departure
for the interpretation of his gestures and limited talk, and sequential
structures after a gesture that allow specification and calibration of what
he is saying with it as others propose possible readings that he can then
accept, reject, or further specify.
In many respects these resources are quite unremarkable, and not
in any way specific to aphasia. With a few significant exceptions, such
as the way in which Chil relies on others to produce the words he
needs to explicate his gestures, these same practices are central to the
organization of talk and action by fully fluent speakers as well. However,
appreciation of their importance requires an analytic framework that
takes into account the social and multimodal organization of human
language, cognition and action, and indeed this has recently become the
focus of research by a number of scholars in different parts of the world
(Goodwin 2003a; Wilkinson 1999). By way of contrast the vast majority
of research on aphasia has taken processes inside the brain of the speaker
as the primary object of interest (e.g., attempts to correlate damage to
a particular area of the brain with a specific language deficit). To study
this rigorously the person suffering from aphasia is typically examined
in a laboratory setting where almost all of the resources that Chil used
to accomplish meaning and action in concert with others—the talk
of his interlocutors, meaningful objects, the material, geographic and
social structure of his home environment, and so forth—have been
systematically removed. In such a setting Chil would be rendered a far
more impoverished actor. This is not to deny the great and enduring
importance of such research, and the way in which its methods artfully
make a range of crucial phenomena accessible to study. However it
does demonstrate the importance of investigating human language as
not only a complex symbolic calculus but as itself a primordial form
of human sociality. Only within such a framework does Chil’s genuine
competence as a speaker, his ability to make relevant, consequential
moves in conversation, emerge.

Conclusion
Sitting at the center of much of what is most distinctive about human
sociality, cognition, and language use is the utterance, that is the
action through which one party says something to someone else. No
other animal is able to construct anything like human utterances.
The utterance constitutes the prototypical environment within which
language emerges in the natural world. It is a central locus for human
symbolic and cognitive activity. Moreover, as amply demonstrated by
the findings of conversation analysis (Sacks et al. 1974; Schegloff 1968;
Schegloff et al. 1977), talk in interaction constitutes a central form
of human social organization, a primordial site for human sociality.
Indeed, documenting the thoroughly pervasive practices through which
human beings build consequential action through interaction with
each other would seem to be a first task for any ethologist attempting
to provide a general description of human social behavior.
At first glance an utterance might be characterized as a strip of talk
produced by a speaker, that is as the outcome of linguistic activity by
a single individual. Analysis could, and indeed frequently does, focus
exclusively on structure in the talk provided by the utterance, and on
linguistic, psychological, and neurological processes within the mind
and brain of a speaker that might account for the production of complex
strips of talk. It might seem possible for there to be a comfortable division
of labor with linguistics and psychology describing the mechanisms
required to produce the language structure found within an utterance,
while students of social life take over at its boundaries as multiple parties
exchange talk with each other.
In opposition to such a view, I have attempted in this chapter to
demonstrate that individual utterances are intrinsically multiparty,
requiring at a minimum both a hearer and a speaker, and are built
through coordinated social action from the outset. Moreover, to
describe the social coordination that builds an utterance it is necessary
to encompass analytically not only the structure of talk but also the
visible embodied displays of hearers, and frequently structure in the
surround. Utterances are multiparty, multimodal activities constructed
through the mutual elaboration of different kinds of signs.
The social, cognitive, and multimodal organization of utterances
has been investigated here by examining two quite different, but
mutually relevant, processes. First, “performance errors” have been
argued by linguists to demonstrate that actual speech provides only
degenerate data for the analysis of language structure (although there
is very important analysis in linguistics of how such errors might shed
light on mental processes implicated the production of language, e.g.,
Fromkin 1971). Here, however, restarts were found to be systematically
used by speakers to secure the gaze and orientation of hearers. Rather
than providing evidence for a loose acceptance of flawed, fragmentary
speech in actual conversation, restarts allow a speaker to begin anew
a sentence when its hearer is at last orienting to it. They demonstrate
speakers’ precise concern for producing coherent sentences, not into
the air, but instead when their addressees are actually attending to
the speaker. Moreover, the processes of repair used to do this typically
involve recycling of a structure already produced with some significant
modification. Repairs provide within ongoing talk itself an endogenous
analysis of how the stream of speech can be divided into relevant units,
and the kinds of operations that are possible on those units. Such
performance errors are not only a locus for the ongoing achievement of
mutual orientation between speaker and hearer, that is, for constituting
through ongoing practice the multiparty participation framework that
sits at the center of human language, but also a crucial resource for the
task posed for someone who does not yet know a language of uncovering
its structure.
Second, the pointing activities of Chil, a man with very severe aphasia,
were examined. It was found that despite his almost complete lack of
productive language (his vocabulary consisted of only three words), Chil
was able to construct locally relevant meaningful utterances, and indeed
to function as a powerful actor in conversation. Again the multiparty,
multimodal, organization of utterances constructed through multiple
sign systems was central to this process.
A range of diverse factors contributed to the organization of Chil’s
pointing. First, his points frequently, although not always (see Fig. 3.5),
emerged within a local sequential context and larger activity. These
provided a detailed interpretive point of departure for what he might
be indicating through a point. Second, his points typically invoked
meaningful structure, an historically shaped common ground, that had
been sedimented into the social and physical world that he inhabited
with relevant others. He builds action within a world that has already
been shaped by the semiotic activities of others. Their actions provide
him with both a prior sequential context, and a surround filled with
meaningful structure. Third, through sequential practices following
the pointing gesture, Chil and his interlocutors could calibrate both
what he was indicating through the point, and more crucially the
action he was attempting to accomplish by pointing. Chil got others
to produce the words he needed, with the effect that his utterances
(such as a proposal to visit Bear Mountain in Fig. 3.6) were constructed
through the collaborative activities of several different participants,
within a process that included embodied participation frameworks, and
meaningful structure in the environment. The multiparty, multimodal
organization of utterances, and the way in which action is sequentially
organized within ongoing interaction, provide the crucial environments
that enable Chil to make rich meaning, and act in concert with others
despite his catastrophic loss of productive language.
Such a perspective on the practices through which utterances and
actions are built might be relevant to investigation of the roots of
human sociality in a number of different ways. First, an initial, but
most important stage in any analysis occurs when the boundaries of
the phenomenon to be studied are defined. If crucial components of
the process being examined are rendered invisible and inaccessible to
study, phenomena that might be seen as rather straightforward within
a more inclusive view become deeply mysterious. Thus, the decision
to exclude performance errors and treat only well-formed sentences
as appropriate data for the study of how grammatical structure might
be recognized leads Pinker (1994:267) and others to posit a deus
ex machina outside the system itself, an innate module, to explain
how someone using language might be able to divide the stream of
speech into relevant units. By way of contrast, consider what happens
when the analytic frame is expanded to include not only well-formed
sentences and abstract speakers but also repair and embodied hearers.
The decomposition of the stream of speech into relevant subunits, the
different ways in which these units can and cannot be arranged, and
the task of distinguishing grammatical from ungrammatical structures,
are now made visible through the endogenous practices participants use
to build action together through talk. Such autopoetic organization in
which the resources necessary to produce, sustain, and modify a system
are continuously reconstituted through the workings of the system itself,
is precisely what would be expected of any natural system built through
evolutionary processes (Favareau 2004). A framework that lodges the
production of strips of talk within the activities of multiple, embodied
actors building action together, frequently in relevant, consequential
environments, is also most relevant to the study of human sociality
in that it links the details of language use, with all of is symbolic and
cognitive import, to not only the psychology and the mental life of the
speaker, but also to elementary forms of human social organization.
Second, attempting to specify an analytic frame that does not exclude
crucial components of the phenomenon being examined might enable
us to ask more sensible questions. For example, in light of what has been
seen in this chapter, asking how language as an isolated self-contained
system might have evolved does not seem to be a reasonable question.
Clearly what must have evolved is this entire ecology of embodied
interactive practices being used by a species to build in concert with each
other the actions that make up their lifeworld (i.e., not only linguistic
structures in the stream of speech but also embodied participation
frameworks through which participants publicly display to each other
frames of mutual attention and relevance within which those units
can function as meaningful events). Sign systems do not evolve in
isolation as self-sufficient wholes, but rather through their use by agents
to accomplish relevant actions.
From this perspective it is interesting to examine the interactive matrix
that makes it possible for Chil to construct relevant meaning and action.
Consider Fig. 3.4 in which Chil accompanies a point toward a bagel
with an appreciative prosodic contour. First, these actions are lodged
within a participation framework in which he and his interlocutor are
visibly attending to each other and thus are able to take what each other
is doing into account. Second, his addressee treats Chil’s pointing as
a communicative act. Tomasello (this volume) argues that attributing
such communicative intentions to a pointing gesture is something
that distinguishes us from highly intelligent apes. Third, within this
framework Chil produces talk, although it is semantically empty and
encodes no propositional content whatsoever. If one had only the
stream of speech it would be impossible to figure out what was being
talked about. However, by virtue of the other co-occurring sign systems
within which Chil’s speech is embedded it is possible, indeed easy, for
his interlocutor to see the talk as in some way commenting on what is
being pointed at, and in Fig. 3.4 to locate a possible positive assessment
from Chil’s appreciative prosody. Chil is able to locate a topic and make
a comment about it without language. Note that this is not done entirely
through gesture alone but, rather, through the interdigitation of a number
of quite different systems (prosody, embodied participation frameworks,
pointing, sequential organization, etc.) that mutually elaborate each
other within an embodied shared attentional frame that constitutes a
primordial site for human sociality. On many, many occasions Chil’s
interlocutors have difficulty figuring out what he wants to say. However,
lack of understanding can be remedied through subsequent sequences
of action in which interlocutors propose candidate readings that Chil
then accepts, rejects or modifies.
Chil’s situation provides a tragic natural experiment that allows us to
probe taken-for-granted assumptions about the generic organization of
talk. Although Chil’s case appears exceptional the practices that make it
possible for him to build relevant meaning and action in concert with
others are central to the organization of all talk in interaction.
It is interesting to speculate how linguistic structure might emerge
within such a framework. Chil’s big problem as a semiotic actor is
that he is imprisoned in a web of intrinsically meaningful signs; his
gestures and prosody are indexical and iconic and thus capable of being
read in multiple ways. After almost every one of his utterances his
interlocutors have to check if they have correctly grasped what he
wants to say. Consider what would happen if such meaningful, analogic
displays were replaced with meaningless signs (e.g., something like the
precursors of phonetic units). It would then be necessary to operate
with conventionalized shared understandings about how to interpret
these units. Structures already in place provide the resources necessary
122 Properties of Human Interaction

to interactively organize arbitrary sounds as public, meaningful signs


and avoid the calibration sequence found after every new sign by
Chil. These resources include relevant interpretive frames created by
the local organization of collaborative action, and endogenous repair
processes that provide practices for working out misunderstandings
and calibrating meaning, Bakhtin (1999:124) suggestively alluded to a
“first speaker, the one who disturbs the eternal silence of the universe.”
However, were linguistic structure to emerge within existing frameworks
of shared intersubjectivity and action it would already be positioned
within a host of other meaning-making practices and tied to the ongoing
organization of collaborative action. It would not emerge from a prior
silence but, rather, within a world, and a framework for collaborative
action that was already rich in relevant meaning and structure. Rather
than depending on a single dramatic change in neurology, such a process
could be incremental and would from the outset not be lodged within
a single individual but, rather, be implicated advantageously in the
consequential social life of the group, setting the stage for progressive
elaboration of both the system and the neurological machinery required
to support it.

Notes
1. I am deeply indebted to Candy Goodwin and John Haviland for insight-
ful discussions about the phenomena described in this chapter, and to Nick
Enfield, Steve Levinson, and two anonymous reviewers for very helpful com-
ments on an earlier draft.
2. I recognize only too well that I am unable to adequately re-represent this
prosody on the printed page, and that unfortunately the reader will have to
accept on faith my gloss for what I hear on the tape and what Chuck heard
while he was listening. However, because the tape exists it is possible for
others to listen themselves and possibly challenge my gloss, and certainly for
phoneticians to more precisely describe what in the stream of speech leads
to such hearing. However, that is beyond my ability and the scope of this
chapter.
3. See Jefferson (1979) for three-part units, with two sames followed by a
different, in laughter.
References
Bakhtin, M. M. 1999. The problem of speech genres. In The Discourse
Reader, edited by A. Jaworski and N. Coupland, 121–132. London:
Routledge.
Chomsky, N. 1965. Aspects of the theory of syntax. Cambridge, MA: MIT
Press.
Clark, H. H. 1996. Using language. Cambridge: Cambridge University
Press.
Clark, H. H., and J. E. Fox Tree. 2002. Using uh and um in spontaneous
speaking. Cognition 84:73–111.
Favareau, D. 2004. The biosemiotic turn: Towards a natural history
of signs. Ph.D. dissertation, Department of Applied Linguistics,
University of California, Los Angeles.
Fromkin, V. 1971. The non-anomalous nature of anomalous utterances.
Language 47:27–52.
Goffman, E. 1981. Footing. In Forms of talk, edited by E. Goffman,
124–159. Philadelphia: University of Pennsylvania Press.
Goodwin, C. 1981. Conversational organization: Interaction between
speakers and hearers. New York: Academic Press.
——. 1987. Forgetfulness as an interactive resource. Social Psychology
Quarterly 50(2):115–130.
——. 1995. Co-constructing meaning in conversations with an aphasic
man. Research on Language and Social Interaction 28(3):233–260.
——. 2000a. Action and embodiment within situated human interaction.
Journal of Pragmatics 32:1489–1522.
——. 2000b. Practices of seeing, visual analysis: An ethnomethodological
approach. In Handbook of visual analysis, edited by T. van Leeuwen
and C. Jewitt, 157–182. London: Sage.
——. 2002. Time in action. Current Anthropology 43(4–5):19–35.
——, (ed.). 2003a. Conversation and brain damage. Oxford: Oxford
University Press.
——. 2003b. Conversational frameworks for the accomplishment of
meaning in aphasia. In Conversation and brain damage, edited by C.
Goodwin, 90–116. Oxford: Oxford University Press.
——. 2003c. Embedded context. Research on Language and Social
Interaction 36(4):323–350.
——. 2003d. Pointing as situated practice. In Pointing: Where language,
culture, and cognition meet, edited by S. Kita, 217–242. Hillsdale, NJ:
Erlbaum.
——. 2004. A competent speaker who can’t speak: The social life of
aphasia. Journal of Linguistic Anthropology 14(2):151–170.
Goodwin, C., M. H. Goodwin, and D. Olsher. 2002. Producing sense
with nonsense syllables: Turn and sequence in the conversations
of a man with severe aphasia. In The language of turn and sequence,
edited by C. Ford, B. Fox, and S. Thompson, 56–80. Oxford: Oxford
University Press.
Haviland, J. B. 1996. Projections, transpositions, and relativity. In
Rethinking Linguistic Relativity, edited by J. J. Gumperz and S. C.
Levinson, 271–323. Cambridge: Cambridge University Press.
——. 2003. How to point in Zinacantán. In Pointing: Where language,
culture, and cognition meet, edited by S. Kita, 139–170. Mahwah, NJ:
Erlbaum.
Hutchins, E. 1995. Cognition in the wild. Cambridge, MA: MIT Press.
Jefferson, G. 1979. A technique for inviting laughter and its
subsequent acceptance/declination. In Everyday language: Studies in
ethnomethodology, edited by G. Psathas, 79–96. New York: Irvington
Publishers.
Kendon, A. 1990. Conducting interaction: Patterns of behavior in focused
encounters. Cambridge: Cambridge University Press.
Kockelman, P. 2004. Stance and subjectivity. Journal of Linguistic
Anthropology 14(2):127–150.
Levinson, S. C. 1996. Language and Space. Annual Review of Anthropology
25:353–382.
——. 2000. Frames of spatial reference and their acquisition in Tenejapan
Tzeltal. In Culture, thought, and development, edited by L. Nucci, G.
Saxe and E. Turiel, 167–197. Hillsdale NJ: Erlbaum.
McNeill, D. 1992. Hand and mind: What gestures reveal about thought.
Chicago: University of Chicago Press.
Pinker, S. 1994. The Language instinct: How the mind creates language.
New York: HarperCollins.
Sacks, H., E. A. Schegloff, and G. Jefferson. 1974. A simplest systematics
for the organization of turn-taking for conversation. Language
50:696–735.
Schegloff, E. A. 1968. Sequencing in conversational openings. American
Anthropologist 70(6):1075–1095.
——. 1979. The relevance of repair for syntax-for-conversation. In Syntax
and semantics 12: Discourse and syntax, edited by T. Givón, 261–288.
New York: Academic Press.
——. 1987. Recycled turn beginnings: A precise repair mechanism in
conversation’s turn-taking organisation. In Talk and social organisation,
edited by G. Button and J. R. E. Lee, 70–85. Clevedon: Multilingual
Matters.
Schegloff, E., and H. Sacks. 1984. Opening up closings. In Language
in use: Readings in sociolinguistics, edited by J. Baugh and J. Sherzer,
263–274. Englewood Cliffs, NJ: Prentice-Hall.
Schegloff, E. A., G. Jefferson, and H. Sacks. 1977. The preference for
self-correction in the organization of repair in conversation. Language
53:361–382.
Stivers, T. 2004. “No no no” and other types of multiple sayings in social
interaction. Human Communication Research 30(2):260–293.
Tomasello, M. 2003. Constructing a language: A usage-based theory of
language acquisition. Cambridge, MA: Harvard University Press.
Wilkinson, R. 1999. Special issue on Conversation analysis and aphasia.
Aphasiology 13(4–5):327–343.
Wilkinson, R., S. Beeke, and J. Maxim. 2003. Adapting to conversation:
On the use of linguistic resources by speakers with fluent aphasia in
the construction of turns at talk. In Conversation and brain damage,
edited by C. Goodwin, 59–89. Oxford: Oxford University Press.
Wittgenstein, L. 1958. Philosophical Investigations, edited by G. E. M.
Anscombe and R. Rhees; translated by G. E. M. Anscombe. 2nd
edition. Oxford: Blackwell.
four

Social Actions, Social Commitments


Herbert H. Clark

Social actions are the stuff of daily life. Walking on crowded sidewalks,
working with colleagues, or eating with friends—these are activities
we cannot carry out alone. It takes coordination to avoid collisions,
negotiate business, and share food. Most of these activities are joint
activities—activities in which two or more participants coordinate with
each other to reach what they take to be a common set of goals. Without
joint activities life would be impossible. We do more than merely work
around each other. We work with each other, and on a range of common
goals. Humans come equipped for joint action—with what Levinson
(this volume) calls the “interaction engine”—and they engage in it
from infancy on (see Boyd and Richerson, Gergely and Csibra, and
Liszkowski in this volume).
Joint activities are managed through joint commitments (Clark 1996).
I can commit myself privately to doing something and then act on that
commitment. I may tell myself, “I’ll have a beer when I get home,” and
when I get home, I have a beer. But for you and me to do something
together—say, shake hands—it is not enough for me to commit privately
to grasping your hand, or for you to commit privately to grasping mine.
We must act on a joint commitment to shake hands. The argument here
is that joint commitments are essential to all true joint activities. They
are the guiding force inside Levinson’s interaction engine.
When we take on joint commitments, we ordinarily do so for the
benefits they afford—in avoiding collisions, negotiating fair contracts,
and sharing food efficiently. But joint commitments also carry risks.
Some risks come from ceding partial control over one’s actions to others.
Once you and I are committed to shaking hands, you might crush
my hand or withdraw at the last minute. Other risks come from the
Social Actions, Social Commitments

indeterminacy of joint commitments. When you and I agree “to talk,”


we may have only a vague idea of what about. Later, you may draw me
into topics I did not anticipate or want to talk about. Joint commitments
have moral and emotional repercussions. We may be happy and trusting
when they benefit us, but angry and reproachful when they do not.
The goal here is to show how joint commitments are the driving force
behind joint activities. I will illustrate with two joint activities, one quite
ordinary and the other equally out of the ordinary. The ordinary one is
of two people assembling a piece of furniture. It allows us to examine
the normal, cooperative course of establishing joint commitments. The
out-of-the-ordinary joint activity is a famous study of obedience by
Stanley Milgram (1974). That activity, in contrast, illustrates the risks
and the moral and emotional consequences of joint commitments.

Partitioning Joint Activities


People do not just happen to do things together. In shaking hands
with you, I cannot grasp your hand without some sense, belief, or trust
that you are going to do your part, and do it here and now. The idea is
that people coordinate their parts in joint activities by means of joint
commitments. To illustrate, I will begin with the cooperative assembly
of a TV stand.
Two people I will call Ann and Burton were ushered into a small
room, given the parts of a commercial kit for a wooden TV stand, and
asked to assemble the stand from its parts.1 They took about 15 minutes
and were videotaped as they worked. Consider a 20-second segment
in which they attached a crosspiece onto a sidepiece. They did this in
a sequence of five paired actions, as represented here:
Ann's action Burton's action
1 A gets crosspiece B holds sidepiece
2 A holds crosspiece B inserts peg
3 A affixes crosspiece B holds sidepiece
4 A inserts peg B holds side-, crosspiece
5 A affixes sidepiece B holds side-, crosspiece

In line 3, for example, Burton holds a sidepiece steady while Ann affixes
the crosspiece onto it. This is a joint action pure and simple. The two
of them coordinate their individual actions—Ann affixing one board
while Burton holds the other one steady—to reach a common goal, the
attachment of the two boards. Ann does what she does contingent on
Properties of Human Interaction

what Burton is doing, and he does what he does contingent on what


she is doing. Note that these five joint actions together constitute a
joint action at a higher level, “attaching the two side-pieces to the
cross-piece,” and that, in turn, is but one segment of a still higher level
joint action, “assembling the TV stand.” So, assembling the TV stand
emerges as a hierarchy of joint actions, which is typical of joint activities
(Bangerter and Clark 2003).
But joint activities take more than these joint actions. If we look
again at the 20-second segment, we discover Ann and Burton talking
about what they are doing:
(1) Ann Should we put this in, this, this little like kinda cross
bar, like the T? like the I bar?
Burton Yeah ((we can do that))
AnnSo, you wanna stick the ((screws in)). Or wait is, is,
are these these things, or?
Burton That’s these things I bet. Because there’s no screws.
Ann Yeah, you’re right. Yeah, probably. If they’ll stay in.
Burton I don’t know how they’ll stay in ((but))
Ann Right there.
Burton Is this one big enough?
Ann Oh ((xxx)) I guess cause like there’s no other side for
it to come out.
Burton M-hm.
[8.15 sec]
Burton ((Now let’s do this one))
Ann Okay

Ann and Burton’s talk is not idle. It is what allows them to arrange,
agree on, or coordinate who is to do what when and where. Here,
too, Ann and Burton carry out paired actions, but the pairs are turn
sequences like this:

(2) Ann Should we put this in, this, this little like kinda cross
bar, like the T? like the I bar?
Burton Yeah ((we can do that))
In the first turn, Ann proposes that they attach the crosspiece, and in
the second, Burton takes up her proposal and agrees to it. The two of
them proceed this way throughout the TV stand assembly. They make
agreement after agreement about which pieces to connect when, how
to orient each piece, who is to hold, and who is to attach.
As this example illustrates, joint activities ordinarily can be partitioned
into two activities: (1) a basic joint activity; and (2) coordinating joint
actions. Consider the two parts of Ann and Burton’s assembly of the
TV stand:
The basic joint activity, or joint activity proper, is what Ann and Burton
are basically doing—assembling a TV stand. It consists of the actions
and positions they consider essential to their basic goal—the assembly
of the TV stand.
The coordinating joint actions are what Ann and Burton do to coordinate
their basic activity. They consist of communicative acts about the basic
activity.
It takes both sets of actions to assemble the TV stand. The first set
effects the assembly proper, and the second coordinates the joint actions
needed to effect the assembly proper. Ann and Burton surely see these
two activities as different. What they were asked to do was “assemble a
TV stand.” If asked, “But weren’t you talking?” they might have replied,
“Oh yes. That was to figure out who was to do what.”
To complicate the picture, communicative acts are themselves joint
actions (Clark 1996). For each utterance, speakers and addressees must
coordinate the speaker’s vocalizations with the addressee’s attention to
those vocalizations, the speaker’s wording with the addressee’s
identification
of that wording, and what the speaker means with what the
addressee understands the speaker to mean. In earlier work, I have called
the process of coordinating on these points collateral communication.
So just as basic joint actions are coordinated by communicative acts,
communicative acts are coordinated by collateral acts (Clark 1996,
2004). I will say no more here about collateral acts.
It takes coordination, therefore, to carry out joint activities, and
communicative
acts to achieve that coordination. But to agree on a joint
course of action is really to establish a joint commitment to that course
of action. How is that done?

Establishing Joint Commitments


The very concept of joint commitment is a puzzle. In common parlance,
we can speak of an ensemble of people making a joint commitment, as
in “The football team is determined to play better next year,” or “The
orchestra will now play Brahms.” Yet it is only individuals who can
make commitments (or have intentions). The members of an ensemble
can each make up his or her own mind, but they can hardly make up
each other’s minds. How, then, do joint commitments get created from
individual commitments? 2
Varieties of Commitment
Individual commitments come in many types. Consider four types of
commitments to “go for coffee at Joe’s Café at noon”:3
1. Private self-commitment. Privately, without letting anyone know, I
can commit myself to myself to go for coffee at noon.
2. Public self-commitment. I can make the same commitment in front of
you—say, by telling you of my plan. It is not that I commit myself
to you that I will go for coffee at noon. It is just that I make my
self-commitment public between us.
3. Simple other-commitment. I can commit myself to you that I will go
for coffee at noon—say, by promising you I will go. Not only do
I make my commitment public between us, but I grant you the
right to hold me responsible for fulfilling it. People make other-
commitments for a range of social obligations.
4. Participatory commitment. Suppose you and I agree to meet for coffee
at noon. I commit myself to you to taking part in this meeting just so
long as you commit yourself to me to taking part in the same meeting,
and vice versa. Our individual commitments are conditional on
both of us being committed to the joint action.
A joint commitment is simply the sum of the participatory commitments
of its participants.
These four types of commitment differ in how binding they are:
1. Private self-commitment. If I commit myself privately to going for
coffee at noon, I can change my mind, or fail to get there, with no
consequences for anyone else.
2. Public self-commitment. If I tell you about the commitment and
then change my mind or fail to get there, these changes are public,
perhaps to my embarrassment.
3. Other-commitment. If, instead, I commit myself to you and then
change my mind or fail, I expect you to hold me responsible for
the consequences. Perhaps you told a friend I would be there, and
you are angry that I disappointed him or her.
4. Participatory commitment. If, finally, you and I are jointly committed
and I unilaterally change my mind or fail, I expect you to hold me
responsible not only for my individual failure but for subverting
our joint action—what we would have accomplished jointly. If you
turn up at noon for coffee, I will have wasted your time and abused
your trust.
Plainly, private self-commitments are easier to alter midcourse than
public self-commitments and other-commitments. But the most binding
of all are joint commitments.
Meeting for coffee at noon is what David Lewis (1969) called a
coordination
problem, and our agreement to meet is a solution to that problem.
With the agreement, you and I establish the mutual belief that we each
expect both of us to go for coffee at noon.
Joint commitments are subject to the sedan-chair principle. Suppose
Susan and Tom are two porters carrying Veronica in a sedan chair. They
cannot pick the chair up, or set it down, without doing it together. If
one of them tries, they risk not only spilling Veronica onto the street,
but injuring each other. Likewise, in assembling the TV stand, Ann and
Burton cannot start, or stop, without doing so together. Acting alone
risks causing harm. If Ann suddenly stops holding a sidepiece while
Burton is screwing in a screw, she may damage the sidepiece, hurt
herself, or hurt Burton.

Projective Pairs
It is one thing to characterize joint commitments, but quite another
to say how they get established. One way is with projective pairs.
After Ann and Burton attached the crosspiece to the two sidepieces,
they were at a choice point: What to do next. They needed to establish
a joint commitment to a course of action. They could not count on
such a commitment arising spontaneously and simultaneously. They
had to make it happen, and they did it this way:
(3) Burton ((Now let’s do this one)) [picking up the top-piece]
Ann Okay
Burton proposed to Ann that the two of them (“let’s”) assemble the top
piece (“do this one”) next (“now”). Ann took up his proposal by agreeing
to it (“Okay”). In just two turns, they established a joint commitment
to assemble the top piece next. They specified the ensemble (“us”) and
goal (“do this one”) as well as the commitments to do their parts in
reaching the goal.
This pair of turns is what Schegloff and Sacks (1973; see also Schegloff
this volume) called an adjacency pair. In such pairs, one person produces
the first part, and another person, the second part. The first part is
of a type for which it is conditionally relevant for the second part to
be of a type projected by the first part. Burton produced a suggestion;
that projected her consent to go ahead as the second part; and Ann
immediately gave her consent with “Okay” (see Bangerter and Clark
2003)
What is needed here, however, is the more general notion of projective
pair (Clark 2004). In Schegloff and Sacks’s account of adjacency pairs,
both parts must be turns at talk, yet in many situations, one or both
parts of analogous pairs are gestural. Later in assembling the TV stand,
Ann and Burton produce this sequence of actions:

(4) Ann [Extends hand with screw] So you want to stick


the screws in?
Burton [Extends hand to take screw]
In line 1, Ann proposes that Burton stick the screws in. In line 2, Burton
could take her up with “Okay.” Instead, he extends his hand to take
the screw. She construes that move as signaling consent roughly as if
he had said “Okay.” I will use the term projective pair to cover adjacency
pairs as well as analogous pairs with gestures.
A projective pair, then, is a proposal plus an uptake. By proposal, I
mean any signal that raises the possibility, at any strength, of a joint
action or position by the initiator and addressees. By uptake, I mean any
action that addresses that possibility. So when Ann makes her proposal
to Burton, she is simply initiating a process. Burton has the options of
accepting, altering, rejecting, or even disregarding her proposal. Here
are examples of the four options.
Full acceptance of proposal. In 4, Ann proposes, “So you want to stick
the screws in [extending her hand with a screw].” Burton could have
replied, “No, you do it” or “Hold on. I’ve got a better idea,” but instead
he takes hold of the screw and thereby accepts her proposal in full. Now,
they are jointly committed to transferring the screw from her to him.
Altered acceptance of proposal. In 5, Burton asks a question that projects
“Yes, it is” or “No, it isn’t” as uptake:

(5) Burton Is this one big enough?


Ann Oh ((xxx)) I guess cause like there’s no other side
for it to come out.
Burton M-hm.

In line 2, instead of saying yes or no, Ann accepts an altered version of


Burton’s proposal, and he accepts her alteration, “m-hm.”
Rejection of proposal. In 6, from another corpus (Svartvik and Quirk
1980), we find yet another pattern:
(6) Betty what happens if anybody breaks in and steals it, —
are are is are we covered or .
Cathy Um — I don’t know quite honestly .

Betty’s question projects an explanation for what happens if anybody


breaks in. Cathy is not able to provide that explanation, so she turns
down, or rejects, the proposal that she do so. Not only does Cathy reject
the proposal, but she gives a reason why she is rejecting it.
Disregard of proposal. In 7, Ann asks Burton a question, which projects
a yes or no in agreement:
(7) Ann They snap in? [said as she snaps the rollers in]
Burton [Silence, no visible gesture or response]
Although Burton presumably has heard Ann, he appears to disregard her
proposal by going on without addressing it. He appears unwilling to let
her engage him in the potential projective pair. He simply opts out.
Projective pairs are efficient ways of creating joint commitments.
When Burton realizes that he and Ann need to plan their next joint
action, he initiates a projective pair, “Now let’s do this one,” and Ann
completes it, “Okay.” With projective pairs people are really negotiating
joint commitments. No matter what the first person proposes, the
second person has options in taking it up. The joint commitments
that emerge are shaped by them both.

Joint Actions and Joint Positions


What do people make joint commitments about? Recall that the first
pair part of an adjacency pair is an action of a particular type (e.g., a
suggestion), which projects an action of a second type (e.g., a consent)
as the second pair part. A question projects an answer, a greeting a
greeting, a request a promise, and so on. Action types like these have
long been studied, from quite a different perspective, as illocutionary
acts (e.g., Austin 1962; Bach and Harnish 1979; Searle 1969, 1975).
Although the analysis of illocutionary acts has its problems, it still
offers useful insights.
Illocutionary acts have usually been treated as autonomous. For Searle
(1969), a question “counts as an attempt to elicit [certain information]
from [the hearer] H.” But surely, questions are more than that. In 5,
when Burton asks, “Is this one big enough?” he is not trying simply to
elicit “yes” or “no.” He is proposing that Ann join him in establishing
whether or not a particular peg is big enough. In her uptake, she offers
useful information, but without answering yes or no. Together, they
establish a joint commitment to the proposition, roughly stated, that
“the peg is probably big enough because there’s no other side for it to
come out.” This proposition is not either Ann’s or Burton’s alone. It is
their joint position—an amalgam of contributions by them both.
Projective pairs can also be used to establish joint courses of action.
Recall this exchange from the TV stand assembly:

(3) Burton ((Now let’s do this one)) [picking up the top-piece]


Ann Okay
Burton suggests a course of action with one illocutionary act, and Ann
consents to it with another. The result is a joint commitment to a
course of action.
The idea, then, is that illocutionary acts are better viewed as
participatory
acts. A question is a question because it can be the first part of a
projective pair in which the expectable second part is an assertion that
“answers” it. The projective pair establishes a joint position. Likewise, a
suggestion is a suggestion because it can be the first part of a projective
pair in which the expectable second part is consent. The projective pair
establishes a joint course of action. According to Searle (1975; see also
Bach and Harnish 1979), there are four main types of illocutionary
acts: assertives, directives (which include questions), commissives, and
expressives.4 In this system, assertives, questions, and expressives are
used for establishing joint positions, whereas directives (other than
questions) and commissives are used for establishing joint courses of
action. All are used for establishing joint commitments.
The picture so far is this. Ann and Burton need to coordinate their
actions and their positions if they are to assemble the TV stand together.
They do that largely with projective pairs in which one of them proposes
a next joint step, and the other takes up that proposal—accepting,
altering, rejecting, or disregarding it. In this way, they negotiate joint
commitments that are mutually satisfactory.

Emergence of Joint Commitments


No matter what the joint action, agreement must be reached, explicitly
or implicitly, on at least five elements:
Participants. Who are to take part in the joint action?
Roles. In what roles?
Content. What actions are they to perform, or what positions are
they to adopt?
Timing. When are the actions to take place, or the positions to be
in effect?
Location. And where?
Let me call these joint elements. Reaching agreement on these elements
tends to be incremental and hierarchical, leading to the gradual
emergence of joint activities.

Incremental Commitments
Joint elements tend to get fully specified piecemeal. When Ann and
Burton arrived at the lab room, they were asked to participate in a
psychology experiment. When they agreed, all they were committed
to was “doing some activity together in this room for the next hour.” It
was only after further instructions that this commitment got narrowed
to “assembling a TV stand together.” It got narrowed further to “doing
the top-piece together” with this adjacency pair:

(3) Burton ((Now let’s do this one)) [picking up the top-piece]


Ann Okay
With this exchange, Ann and Burton agreed on the content (“doing the
top-piece”), timing (“now”), and roles (Burton in control, Ann helping)
of their next joint action.
For these elements to be part of a joint commitment, they must
be taken as common ground (Clark 1996). There are many ways to
establish them as common ground (see Clark and Marshall 1981; Clark
and Schaefer 1989; Clark et al. 1983; Enfield this volume; Goodwin this
volume; Hutchins this volume; Lewis 1969; Schelling 1960):

Explicit commitments. Burton and Ann used all the talk in 1 to commit
explicitly to the roles, content, and timing of their next joint actions.
Joint salience. In 3, Burton picked up the top piece as he spoke, making it
obvious that he would affix the top piece to the piece Ann was holding.
That helped fix their roles and the content of their action. Indeed, they
presupposed that as they carried out the next joint action.
Precedent. Ann and Burton often established who would do what on
the basis of what they had just done. Once, for example, when Ann
had just inserted one peg, the two of them presupposed that she
would insert the second one too.
Conventional practice. Several couples assembling the TV stand
(although not Ann and Burton) presupposed that the man was in
charge, and the woman was the assistant: Building furniture was a
man’s job. This presupposition was apparently based on their idea
of conventional practice.

Hierarchies of Commitment
Most joint activities, as I noted earlier, can be viewed as hierarchies of
joint positions and actions. These hierarchies, too, emerge bit by bit,
and so, therefore, do the joint commitments that coordinate them.
Consider the assembly of the TV stand first by Peter working alone and
then by Ann and Burton working together.5
Peter assembles the TV stand more or less according to a standard
means–end analysis (Newell and Simon 1972). He begins with the
problem, “How to assemble the TV stand from its parts,” which he then
decomposes recursively into subproblems. He does the decomposition
one piece at a time. What emerges is a hierarchy of self-commitments
that can be represented as a standard outline:
1. Build TV stand
1.1. Arrange parts
1.1.1. Put sides in pile
1.1.2. Put screws, pegs in pile
1.1.3. Put wheels in pile
1.2. Assemble parts
1.2.1. Attach top-piece to side 1
1.2.1.1. Insert pegs
1.2.1.2. Affix top piece to pegs
1.2.2. Attach side 2 to top-piece
Etc.
Peter first decomposes the entire task into “arranging the parts” and
“assembling the parts.” He then decomposes “arranging the parts” into
“gathering the sides into a pile” plus “gathering the screws and pegs”
plus “gathering the wheels.” And so on. Each line represents a self-
commitment to a state or action.
Ann and Burton, too, do a means–end analysis (more or less), but
with a crucial difference: They do it together. They establish a hierarchy
of joint commitments, not self-commitments, which looks something
like this:
1. Build TV stand
1.1. Attach cross-piece to side-piece
1.1.1. Stick pegs into side-piece
1.1.1.1. Find pegs
1.1.1.2. Insert pegs into side-piece
1.1.2. Affix cross-piece to side piece
1.2. Attach top-piece to side-piece
Etc.
Ann and Burton establish most of these joint commitments by
negotiation.
They agree on 1.1, for example, by means of the adjacency pair
in (2), repeated here:
(2) Ann Should we put this in, this, this little like kinda cross
bar, like the T? like the I bar?
Burton Yeah ((we can do that))

They next agree on 1.1.1, but that takes eight more turns. And so it goes.
Although Ann and Burton assembled the same TV stand as Peter did,
they had to coordinate two people’s ideas, two people’s commitments,
two people’s actions.

Stacking and Persistence of Joint Commitments


Joint commitments are complicated, therefore, not just because they
require two decision makers—recall the sedan-chair principle—but
because their emergence is incremental and hierarchical. This leads to
two properties that I will call stacking and persistence.
People’s commitments to each other accumulate, or stack up, the
further they get into any joint activity.6 At the beginning of their task,
Ann and Burton’s joint commitment was simple: “1 build TV stand.”
Their next move added a commitment to the stack: “1.1 attach
crosspiece
to side-piece.” Then they added: “1.1.1 stick pegs into side-piece.”
If we think of each commitment as written on a sheet of paper, then
Ann and Burton stacked up more and more sheets the further they got
into the hierarchy. And to complete their task, they had to discharge
all of these commitments from the top of the stack down.
Joint commitments get added to a stack in two ways. One is vertical.
For Ann and Burton to establish “1.1.1 stick pegs into side-piece,” they
had first to establish, or presuppose, all the joint commitments in
the stack below it—“1 build TV stand” and “1.1 attach cross-piece to
side-piece.” The other way is horizontal. Once Ann and Burton were
committed to, “1.1.1.1 find the pegs,” they were also committed to the
next steps at that level, here “1.1.1.2 insert the pegs into the side-piece.”
That is, joint commitments 1.1.1.1 and 1.1.1.2 were placed next to each
other on top of the stack so far. Ann and Burton had to discharge both
of them before considering the joint commitment below it (1.1.1) to
be complete.
Joint commitments at the bottom of the stack persist even when those
on top of them are renegotiated or reneged on. For the TV stand, Ann
and Burton committed themselves first to “1 build a TV stand” and then
to “1.1 attach cross-piece to side-piece.” Next they negotiated on 1.1.1
and 1.1.2. When they negotiated the next level, they retained all of the
commitments in the stack below it—both vertically and horizontally. If
Ann had objected to sticking in the pegs and had got Burton to do it,
that would not have changed their commitment to “1.1 attach
crosspiece
to side-piece” and “1 build a TV stand.”

Entanglements of Joint Activities


Why are stacking and persistence so important? Because they help
explain how hard it is to extricate oneself from joint activities. Look
at Ann just as she begins negotiating with Burton on “1.1.1 stick pegs
into side-piece.” She is committed to being part of 1, 1.1, and 1.2, and
she is about to add 1.1.1 and 1.1.2. If she reneges unilaterally on 1.1.1,
she is letting Burton down not only on 1.1.1 but on 1.1.2—a double
injury. Even if she reneges on 1.1.1, she is still committed to 1, 1.1,
and 1.2. Plainly, once you get into a joint activity, it is hard to take
unilateral actions.
Just how hard it is, is illustrated by the closing of telephone
conversations
(Schegloff and Sacks 1973). Once two parties think they have
finished a conversation, they do not just hang up. They first reach
agreement that they have completed the last topic and then open up a
closing section. Consider the end of a telephone call in 8 (from Svartvik
and Quirk 1980, S.7.2p.1397):
NedItT1aopl(8)
kic don't know whose car it'll
[be,
Mol2y
[uhuh,
INed3 don't think it'll be Chris's,
m,
Molly4
but.
Ned5 uh I'll be there
directing traffic,
6
Pre-closing statementroMoikgahyl?ty,.
7
Responseokay,
Ned
PlFutgreat,
Molansulyre8 —

yeah, well,
.

see

you then, .

all
Ned9 right, [see you Friday
M[that's
o10ly wonderful,
tLright,
aekai11
nvge bye now,
Mol12
bye, ly
13
Terminating contactup][Bothanhg

Once Ned and Molly finish the topic about cars in line 6, Molly offers
to start closing the conversation with “right,” and Ned agrees, “okay.”
With that exchange, they begin the actual closing, in which they make
future plans, take leave (with “bye now” and “bye”), then hang up.
Closing a conversation is a joint decision, but once it is made, the two
parties still have work to do. Even routine telephone calls, like calls to
directory enquiries, have closing sections, although the closings are
briefer, reflecting the less intimate activity just completed (Clark and
French 1981; Clark and Schaefer 1987).

Risks in Joint Commitments


To enter a joint commitment is to give up a bit of one’s autonomy.
When I join Helen in juggling six pins between us, I lose some of my
options. I must work closely with her or risk hurting one or both of
us. And when I drive out into the street, I must coordinate with all the
other drivers or risk collision or injury. It is not just that I give up a
bit of my autonomy. I cede to Helen, and to the other drivers, partial
control over what I do. Normally we have good reasons for ceding
control like that. I enjoy juggling with Helen, and I want to get to my
destination safely (see Boyd and Richerson this volume, for the costs
and benefits of cooperation).
Sharing control in joint activities, however, carries risks. One risk
is exploitation. Partners are tempted to exploit the partial control they
have over us. Helen might draw me into a juggling routine that I do not
know or do not want to do. Another driver might cut in front of me and
force me to brake suddenly. Just as I could injure Helen or other drivers
if I do not cooperate, they can injure me by exploiting my cooperation.
Being drawn into unwanted, unforeseen, or regrettable actions has its
moral and emotional consequences. I may get angry at the reckless
driver and feel he was wrong. I may feel embarrassed at not knowing
the juggling routine and hold Helen responsible.
Another risk is overcommitment. Joint commitments, once negotiated,
are generally difficult to renegotiate, forcing the parties to honor their
original commitments. Some joint commitments are impossible to
renegotiate. Once I am on the road with a reckless driver, or in a difficult
routine with Helen, it is too late to change or back out. I must make
the best of the situation, however much I may resent it.
To illustrate these risks, I turn to one of the most famous studies in
social psychology in the last half century—Stanley Milgram’s experiments
on “obedience to authority.” These experiments have caused a great
stir because they are taken as evidence that people will obey authority
blindly even when that causes harm to others. The experiments may
indeed show that. But for us, they are excellent examples of the risks
of joint commitments.

The Milgram Experiments


In 1962, Milgram advertised in the newspaper for paid male volunteers
to come either to “the elegant Yale Interactional Laboratory,” or to a
modest, unaffiliated “Research Associates of Bridgeport [Connecticut],”
for a “study of memory.” The subjects ranged from factory workers to
professors. When a subject arrived at the laboratory, he and another
subject drew straws to see who would be the “teacher” and who the
“learner” in a learning experiment. The second subject was a confederate
of Milgram’s, and always became the learner. He was played by a 47-
year-old accountant. The “experimenter” was played by a 31-year-old
biology teacher dressed in a gray technician’s coat; “his manner was
impassive and his appearance somewhat stern.”
The learner’s job was to memorize a list of word pairs, and the teacher’s
job was to punish him for each wrong response. The learner was strapped
into a chair with electrodes, and the teacher sat in front of an impressive
“shock generator.” The generator had 30 switches labeled 15 to 450
volts and further labeled (in sets of four): Slight Shock, Moderate Shock,
Strong Shock, Very Strong Shock, Intense Shock, Extreme Intensity
Shock, Danger: Severe Shock, and XX. The teacher was instructed to
“move one level higher on the shock generator each time the learner
gives a wrong answer” and to announce the voltage level before each
shock.
Milgram’s interest was in how far subjects would go before opting
out of the experiment. In one experiment, the learner was in a second
room, but could be heard making an escalating series of protests as the
voltage was increased. The experimenter, sitting at a table behind the
subject, responded to the subject’s objections with prods, “using as
many as necessary to bring the subject into line.”

Prod 1: Please continue, or, please go on.


Prod 2: The experiment requires that you continue.
Prod 3: It is absolutely essential that you continue
Prod 4: You have no other choice, you must go on.

When Milgram described this experiment to various groups of


psychiatrists,
college students, and middle-class adults, each group predicted
that no one would reach the maximum of 450 volts (“XX”), and that
the average subject would max out at 135 volts (“Strong Shock”). Their
predictions were quite wrong. In fact, 62 percent of the subjects went
all the way to 450 volts, and the average subject maxed out at 368 volts
(“Extreme Intensity Shock”). We are poor in imagining how we would
behave in this experiment.
Milgram carried out 18 experiments, each with 40 participants. The
experiments varied on such features as where the subject, learner, and
experimenter sat, how they communicated, whether the laboratory was
at Yale or in Bridgeport, and who gave the orders. Only one experiment
had women as subjects, and they complied as often as the men. In most
of the experiments, a majority of subjects continued the shocks to the
maximum. Milgram described them as “obedient to authority”:
With numbing regularity good people were seen to knuckle under to the
demands of authority and perform actions that were callous and severe.
Men who are in everyday life responsible and decent were seduced by
the trappings of authority, by the control of their perceptions, and by the
uncritical acceptance of the experimenter’s definition of the situation
into performing harsh acts. [Milgram 1974:1.23]

But did these people really “knuckle under to the demands of authority”?
Did they show “uncritical acceptance of the experimenter’s definition
of the situation”?
The Milgram Experiment as Two Joint Activities
The psychology experiment, Martin Orne (1962) argued, is “a very
special form of social interaction” (p. 782). The demands it places on
experimenters and subjects are like the demands placed on the
participants
in any social interaction—in any joint activity. The Milgram
experiment is an example par excellence of Orne’s argument.
The Milgram experiment, as viewed by the subject, is really two joint
activities, one embedded within another:
Memory task. This joint activity has two participants, whose roles are
teacher and learner. The goal is for the teacher to teach the learner a
list of word pairs. The basic activity is a series of cycles of joint action:
the teacher gets the learner to learn the word pairs one by one. On each
cycle, the two of them coordinate through scripted projective pairs: the
teacher presents a test word, and the learner responds with the paired
word; the teacher does or doesn’t shock the learner, and the learner
does or doesn’t groan or yell.
Psychology experiment. The larger joint activity has three participants,
whose roles are experimenter and subjects. Their goal is to carry out a
psychology experiment. They also coordinate through projective pairs.
These include the experimenter’s instructions as well as long exchanges
between subject and experimenter.
The subjects believed that the joint activity of interest was the memory
task. But the real activity of interest was the psychology experiment: At
what point would they opt out of it?
In Milgram’s book Obedience to Authority, the first description of his
experiments was this: “A person comes into the psychological laboratory
and is told to carry out a series of acts that come increasingly into conflict
with conscience” (p. 3). But this is a quite misleading characterization
of what went on—and a good example of what Orne was speaking of.
The extensive dialogues between experimenter and subject, quoted by
Milgram, reveal something very different. The subject was not simply
“told to carry out a series of acts.” He negotiated with the experimenter
on almost every act and position he took. These negotiations were often
prolonged and intense, shaping what the subject did.

Mitigation
The experimenter relied on a range of negotiating tactics. One was
mitigation. Although the entire experiment hinged on the harm the
subjects thought their shocks were causing, what constituted harm was
negotiated by the experimenter and subject:
If the subject asked if the learner was liable to suffer permanent physical
injury, the experimenter said: “Although the shocks may be painful,
there is no permanent tissue damage, so please go on.” (Followed by
Prods 2, 3, and 4, if necessary.) If the subject said that the learner did not
want to go on, the experimenter replied: “Whether the learner likes it
or not, you must go on until he has learned all the word pairs correctly.
So please go on.” (Followed by Prods 2, 3, and 4, if necessary.) [Milgram
1974:21–22]

Many subjects were skeptical of the experimenter’s reassurances,


which were belied by the labels on the shock generator, and that led
to extended—sometimes heated—negotiations. These negotiations go
to the heart of the study. If the shocks were not genuinely harmful, the
subject had less reason to abort the experiment.
Another negotiating tactic by the experimenter was to accept
responsibility for any harm done. An exchange with a subject called
Prozi went as follows:
Subject: I mean who’s going to take the responsibility if
Experimenter:
anything
happens to that gentleman?
I’m responsible for anything that happens to him.
Continue please.
[36 turns intervening]
Experimenter: Continue. Go on.
Subject: You accept all responsibility?
Experimenter: The responsibility is mine. Correct. Please go on. [p.
74–76]7

As Milgram said about Prozi, “Once the experimenter has reassured the
subject that he is not responsible for his actions, there is a perceptible
reduction in strain” (p. 160). Another subject, called Rensaleer, was
interviewed after the experiment, “When asked who was responsible
for shocking the learner against his will, he said, ‘I would put it on
myself entirely’ ” (p. 51).
Negotiations of responsibility (as with Prozi) also go to the heart of
the study. If the experimenter was fully responsible for harm done,
then subjects had less reason to call off the experiment. But how many
subjects negotiated responsibility? Milgram did not say. Still, Rensaleer
called off the experiment midway, whereas Prozi continued his shocks
to the maximum.
Risks of Exploitation
Other negotiating tactics by the experimenter were patently exploitative.
One was to use disregard in uptake, as described earlier. Here is an
illustration:
Subject: I can’t stand it. I’m not going to kill that man in
there. You hear him hollering?
Experimenter: As I told you before, the shocks may be painful,
but—
Subject: But he’s hollering. He can’t stand it. What’s going
to happen to him?
Experimenter: (his voice is patient, matter-of-fact): The experiment
requires that you continue, Teacher. [p. 73]

In the first exchange, the subject suggests that he may “kill that man
in there” and asks “You hear him hollering?” Although the suggestion
and question are serious, the experimenter disregards both by simply
repeating a point he had made before. In the second exchange, he
disregards all three of the subject’s proposals.
To disregard a proposal is to imply that it is not worthy of consideration.
It may be unimportant. It may be irrelevant. It may be misconceived. It
may be too obvious to deal with. So when the experimenter disregards
“You hear him hollering?” and “What’s going to happen to him?” the
subject can take him as implying that the questions are misconceived or
irrelevant. These interpretations are reinforced when the experimenter
speaks “with detached calm.” Exchanges like this were common in the
transcripts quoted in Milgram’s book and sometimes lasted for 20 to
30 turns.
As for the learner, the experimenter disregarded everything he said,
even when he screamed, “Let me out of here, you have no right to
keep me here. Let me out of here, let me out, my heart’s bothering me,
let me out!” What could the subject conclude except that the learner’s
demands were of no importance or relevance?
Subjects’ decisions were, indeed, influenced by how they negotiated.
In an analysis of the complete unpublished transcripts of one of
Milgram’s experiments, Modigliani and Rochat (1995) found that
subjects who raised questions or made objections early in the session
were significantly more likely to abort the experiment early. “Evidently,”
Modigliani and Rochat argued, “certain forms of verbal resistance can
alter the dynamics of interaction sufficiently to change its future course
and facilitate escape” (p. 1.19). In another experiment, the experimenter
and subject communicated only by telephone, so their negotiations were
presumably fewer, briefer, and less intense. In the standard experiment
(in Bridgeport), 65 percent of the subjects continued the shocks to the
maximum. In the telephone experiment, only 20 percent did.

Risks of Overcommitment
The participants in the Milgram experiments created tall stacks of
joint commitments. When a volunteer arrived at the lab, he agreed
first to be in the psychology experiment, then to be in the memory
study, and then to enter each part of the memory study. Consider a
hypothetical subject named Sam who is just about to shock the learner
for failing on word pair 14. From Sam’s perspective, the hierarchy of
joint commitments at that moment looks something like this (with
the critical line in italics):

1. Enter experiment with others at Yale laboratory


1.1 Arrange roles of teacher, learner, experimenter for memory task
1.2 Establish procedure for memory task
1.3 Enter memory task proper
1.3.1 Instruction on word pair 1
1.3.2 Instruction on word pair 2
1.3.3 Instruction on word pair 3
...
1.3.14 Instruction on word pair 14
1.3.14.1 Exchange word pair 14
1.3.14.2 Exchange feedback on word pair 14
1.3.14.2.1 Teacher gives learner feedback, e.g. a
major shock
[1.3.14.2.2 Learner responds to feedback
[1.3.15 Instruction on word pair 15
...
[1.4 Exit memory task
...
[2. Exit experiment with others at Yale laboratory

By line 1.3.14.2.1, Sam has entered into and acted on 99 joint


commitments
(those up to 1.3.14.2.1), and he has entered into others to be
acted on later (1.3.15 on), as marked by left square brackets. And the
current stack is five joint commitments high. So, by this moment, Sam
and the experimenter have a long record of joint actions achieved and
a tall stack of joint commitments yet to be achieved. This leaves Sam
at this juncture with four main options:
First, Sam might refuse to deliver the shock. But to do that, he would
have to renege not just on joint commitment 1.3.14.2.1, but on all of
the joint commitments in the stack below it. And to do that unilaterally
would destroy everything he and the experimenter had accomplished
together. Taking this option, then, has costs, and few subjects took it.
Second, Sam might negotiate with the experimenter on a joint exit
from the memory task and the experiment. Many subjects tried to do
this, but the experimenter refused.
Third, Sam might try to reframe the memory task in negotiation with
the experimenter: The shocks are not really so harmful, or the shocks
are really the experimenter’s responsibility. Many subjects tried this
option and succeeded.
Fourth, Sam might simply deliver the shock. This way he would
continue the long record of achievements in their joint actions, and
he would maintain the stack of joint commitments yet to be acted
on. As Milgram showed, a majority of subjects took this option to the
maximum shock level.
Within a hierarchy of joint commitments, therefore, subjects have
sound reasons for continuing—for taking the fourth option. As joint
commitments stack up, they become harder and harder to opt out
of—even with an uncooperative partner.

Morality and Emotion


Subjects in these experiments often reacted with great emotion. As
Milgram describes it:
Many subjects showed signs of nervousness in the experimental
situation,
and especially upon administering the more powerful shocks. . . .
Subjects were observed to sweat, tremble, stutter, bite their lips, groan,
and dig their fingernails into their flesh. These were characteristic rather
than exceptional responses to the experiment. [1963:375]

But why these reactions and not others? These reactions reflect the
subjects’ anxiety about hurting the learner and not anger at, or
disappointmentwith, the experimenter. The subjects apparently took the
joint commitments as faits accomplis and focused instead on the harm
they were inflicting on the learner. These reactions might be expected
from the stacking and persistence of joint commitments.
Not all subjects reacted this way. Many confronted the experimenter
with moral issues, which then became points of negotiation. Different
subjects were quoted as saying, “I’m not going to kill that man in there”
and “You accept all responsibility” and “Surely you’ve considered the
ethics of this thing. (extremely agitated)” (p. 48). Some subjects were
placated in these negotiations, but others were not. When Rensaleer
was urged to go on (at 255 volts) by being told “You have no other
choice,” he responded (p. 51):
I do have a choice. (Incredulous and indignant:) Why don’t I have a choice?
I came here on my own free will. I thought I could help in a research
project. But if I have to hurt somebody to do that, or if I was in his place,
too, I wouldn’t stay there, I can’t continue. I’m very sorry. I think I’ve
gone too far already, probably.

Not only did Rensaleer display anger at the experimenter for trying to
draw him into this joint commitment, but he offered moral reasons
for opting out. And yet Rensaleer apologized for wrecking their session
(“I’m sorry”) and negotiated a joint exit to the experiment. Despite
everything, he took his joint commitments with the experimenter
seriously and found a satisfactory way to discharge them.

Conclusions
Sociality is not a mere abstraction. It is a feature of life that gets
played out in concrete social actions. These actions depend not only
on linguistic acts, as characterized by Schegloff (this volume), but on
extralinguistic acts. These range all the way from the pointing gestures
in Enfield’s (this volume) Laotian women, Goodwin’s (this volume)
stroke victim, Hutchins’s (this volume) ship navigators, Liszkowski’s
(this volume) infants, and Levinson’s (this volume) Rossel Islanders
to the head gestures of Gergely’s and Csibra’s parents and infants, and
the manual transfer of screws between Ann and Burton. Social actions
also take place in material locations, whether that is a ship’s navigation
room (Hutchins), a living room (Enfield, Goodwin), a lab room (Clark,
Gergely and Csibra), or an outdoor meeting area (Levinson).
Whatever the means and settings, people cannot take social actions—
they cannot carry out joint activities—without making commitments
to each other. As I argued, entering into joint commitments has
both benefits and risks. The benefits are obvious—the usual reasons
for engaging in a joint activity. Working together, Ann and Burton
were able to assemble the TV stand quickly and efficiently. But the
risks of joint commitments are just as real. Subjects in the Milgram
experiment, negotiating with the experimenter, were drawn into actions
they did not anticipate, want to do, or approve of. Such is the power
of joint commitments—the guiding force inside Levinson’s interaction
engine.

Acknowledgment
Some of the research reported here was supported by Grant
N000140010660 from the Office of Naval Research. I am indebted
to the participants at the Wenner-Gren Symposium in Duck, NC, in
October 2005 for discussions of the issues presented here. I thank Adrian
Bangerter, Eve V. Clark, Teenie Matlock, Elsie Wang, Nick Enfield, and
Steve Levinson for comments on drafts of this chapter.

Notes
1. I thank Julie Heiser and Barbara Tversky for use of their video recording
of this session.
2. This puzzle has been examined, without resolution, by philosophers
(e.g., Bratman 1992; Grice 1989, Harman 1977; Searle 1990; Tuomela 1995),
computer scientists (e.g., Cohen and Levesque 1991; Grosz and Sidner 1990),
and psychologists (Clark 1996; Clark and Carlson 1982; Tomasello et al. in
press). Still, the schema I describe later is close to a consensus solution to the
puzzle.
3. The term commitment is used in game theory (e.g., Schelling 1960) in a
sense closest to what I am calling public self-commitment.
4. Here I put aside institutionally based illocutionary acts that Searle calls
declarations.
5. For Peter, I used a video recording of a lone individual assembling the
same TV stand that Ann and Burton assembled. I thank Sandra Lozano and
Barbara Tversky for the recording.
6. See pushdown stacks in computer programming.
7. There was no mention in the guidelines that the experimenter could
negotiate responsibility for harm. Nor is this usually mentioned in discussions
of Milgram’s findings. Apparently, the experimenter improvised in other
unspecified ways, too.

References
Austin, J. L. 1962. How to do things with words. Oxford: Oxford University
Press.
Bach, K., and R. M. Harnish. 1979. Linguistic communication and speech
acts. Cambridge, MA: MIT Press.
Bangerter, A., and H. H. Clark. 2003. Navigating joint projects with
dialogue. Cognitive Science 27(2):32.
Bratman, M. E. 1992. Shared cooperative activity. Philosophical Review
101:327–341.
Clark, H. H. 1996. Using language. Cambridge: Cambridge University
Press.
——. 2004. Pragmatics of language performance. In Handbook of
pragmatics, edited by L. R. Horn and G. L. Ward, 365–382. Oxford:
Blackwell.
Clark, H. H., and T. B. Carlson. 1982. Speech acts and hearers’ beliefs. In
Mutual knowledge, edited by N. V. Smith, 1–36. New York: Academic
Press.
Clark, H. H., and J. W. French. 1981. Telephone goodbyes. Language in
Society 10:1–19.
Clark, H. H., and C. Marshall. 1981. Definite reference and mutual
knowledge. In Elements of discourse understanding, edited by A K. Joshi,
B. L. Webber, and I. A. Sag, 10–63. New York: Cambridge University
Press.
Clark, H. H., and E. F. Schaefer. 1987. Collaborating on contributions
to conversations. Language and Cognitive Processes 2(1):19–41.
——. 1989. Contributing to discourse. Cognitive Science 13:259–294.
Clark, H. H., R. Schreuder, and S. Buttrick. 1983. Common ground
and the understanding of demonstrative reference. Journal of Verbal
Learning and Verbal Behavior 22, 245–258.
Cohen, P. R., and H. J. Levesque. 1991. Teamwork. Nous 25:487–512.
Grice, H. P. 1989. Studies in the way of words. Cambridge, MA: Harvard
University Press.
Grosz, B. J., and C. L. Sidner. 1990. Plans for discourse. In Intentions in
communication, edited by P. R. Cohen, J. Morgan, and M. E. Pollack,
419–444. Cambridge, MA: MIT Press.
Harman, G. 1977. Review of “Linguistic behavior” by Jonathan Bennett.
Language 53:417–424.
Lewis, D. K. 1969. Convention: A philosophical study. Cambridge, MA:
Harvard University Press.
Milgram, S. 1963. Behavioral study of obedience. Journal of Abnormal
Psychology 67:371–378.
——. 1974. Obedience to authority: An experimental view. New York: Harper
and Row.
Modigliani, A., and F. Rochat. 1995. The role of interaction sequences
and the timing of resistance in shaping obedience and defiance to
authority. Journal of Social Issues 51(3):18.
Newell, A., and H. A. Simon. 1972. Human problem solving. Englewood
Cliffs, NJ: Prentice-Hall.
Orne, M. T. 1962. On the social psychology of the psychological
experiment: With particular reference to demand characteristics and
their implications. American Psychologist 17(11):776–783.
Schegloff, E. A., and H. Sacks. 1973. Opening up closings. Semiotica
8:289–327.
Schelling, T. C. 1960. The strategy of conflict. Cambridge, MA: Harvard
University Press.
Searle, J. R. 1969. Speech acts. Cambridge: Cambridge University Press.
——. 1975. A taxonomy of illocutionary acts. In Minnesota studies in the
philosophy of language, edited by K. Gunderson, 334–369. Minneapolis:
University of Minnesota Press.
——. 1990. Collective intentions and actions. In Intentions in
communication, edited by P. R. Cohen, J. Morgan, and M. E. Pollack,
401–415. Cambridge, MA: MIT Press.
Svartvik, J., and R. Quirk (eds.). 1980. A corpus of English conversation.
Lund, Sweden: Gleerup.
Tomasello, M., M. Carpenter, J. Call, T. Behne, and H. Moll. in press.
Understanding and sharing intentions: The origins of cultural
cognition. Behavioral and Brain Sciences.
Tuomela, R. 1995. The importance of us: A philosophical study of basic
social notions. Stanford: Stanford University Press.
Part 2

Psychological Foundations
five

Infant Pointing at 12 Months:


Communicative Goals, Motives, and
Social-Cognitive Abilities
Ulf Liszkowski

Human Communication and Infant Pointing


Human communication is special and unique in many ways. It is a
joint activity that rests on communicative intentions to both transmit
information and have a person receive the information on the basis of
a sender’s intention (Grice 1969; Sperber and Wilson 1995). Pointing,
in human sociality is a special behavior that is naturally used to
communicate, with or without speech (e.g., Kendon 2004; Kita 2003;
Werner and Kaplan 1963). Importantly, communicative pointing is
ostensive, that is, it is produced and comprehended as “for someone.”
It is not an individualistic act that manifests an individual intention
to act on an object directly, like, for example, an arm extension to
touch or grab an object. Instead, the primary function of pointing is to
indicate a referent and locally anchor it in space. Goodwin (2003, this
volume) describes pointing as “situated practice” for which meaning is
construed from its context. Importantly then, for pointing to convey
meaning beyond a vectorial indication in space, interlocutors need to
share a joint context and mutually understand the relations between
each other and the environment. Communicative pointing in a shared
context requires comprehension of the indication as referential and an
understanding of the interlocutors’ relations toward each other and a
referent. It involves a social–cognitive understanding of persons having
attentional states and attitudes toward the environment.
Psychological Foundations

In ontogeny, human infants begin pointing with an extended arm


(often with extended index finger relative to the other fingers) around
one year. Shortly later, toddlers already communicate linguistically. But
it has been argued that metarepresentational abilities that are necessary
for human communication (Sperber and Wilson 1995) emerge only
later, around age four (Perner 1991). Apparently, this developmental
gap poses a dilemma to accounts of human communication, because it
may suggest the presence of full-blown human communication in the
absence of a cognitive prerequisite (Breheny in press). A comparative
perspective might seem to support this view, because nonhuman apes
in captivity also point (Leavens et al. 2005) although they presumably
lack necessary social–cognitive abilities (Povinelli et al. 2003) and do
not understand the communicative intent of pointing (Tomasello this
volume). However, apes do not point in the wild or for each other.
They extend their arm only in captivity, and only instrumentally for
a cooperative human keeper, to solve the sole problem of obtaining
unreachable food (Leavens et al. 2005). Thus, it may be that apes do not
point in the human sense to begin with (Tomasello this volume), and so
the comparative perspective may not be analogous to the developmental
gap view.
Instead of following a comparative approach, this chapter focuses
on the ontogenetic question of what young infants do when they
have just begun pointing, and what they understand of other persons.
First, I present a theoretical framework that addresses different
sublevels
of human communication and respective levels of social–cognitive
understanding that might be involved in developmentally (and
evolutionary) simpler forms of human communication. On this
background,
I review the literature on infant pointing to evaluate current
hypotheses and empirical evidence. Then, I present and discuss in
detail three of our own recent empirical studies (Liszkowski et al. 2004,
2006, n.d.). Finally, I close with an outlook on the ontogeny of human
pointing.

A Framework for Communicative and Social Cognitive Abilities


The framework addresses three criteria of human communication and
corresponding social cognitive understanding. First, a necessary criterion
for the type of communication here is its intentional use. A second criterion,
the communicative goal, is about how intentional communication is
used, to distinguish between acts of simple behavior directing versus
influencing others’ psychological states. A third criterion concerns
Infant Pointing at 12 Months

motives in communication, to assess why intentional communication


is used, that is, whether it is rather used in an individualistic manner
for one’s one benefit only, or whether and to which extent it is a
joint, cooperative activity. The framework does not substitute existing
accounts on human communication but identifies key components,
which is useful in developmentally and comparatively investigating
levels of communicative and social–cognitive abilities.

Intentional Communication Intentional communication requires a sender


who chooses to execute a behavior with the goal of affecting a
recipient
somehow. The sender does not directly influence the recipient by
acting on him with adequate force, but indirectly through some type
of signal. Intentional communication is directed at someone and done
with persistence or flexibility when the goal of affecting the other is not
achieved. It requires at least some kind of awareness of others and may
reveal a rudimentary form of social cognition. As I show, 12-month-olds
point with the intent to communicate.
Communicative Goals It is possible to distinguish two types of a sender’s
communicative goal. A behavioral communicative goal is to affect a
recipient’s body. A psychological communicative goal is to influence a
recipient’s psychological state. The communicative goal of affecting
a recipient’s body involves simply a manipulation of another’s body
from afar, for example as a tool, like activating a robot. Such a
communicativegoal may reveal some understanding of causality and a
social–cognitive understanding of others as animate agents, capable
of self-propelled motion. Ontogenetically shaped behaviors through
processes of ritualization (Tomasello this volume) could in this way be
interpreted as intentional communication. Instead, the communicative
goal of influencing a recipient’s mind is not only about activating a
body. It is about influencing a recipient’s attentional state toward a
referent of communication. It requires an understanding of others as
agents with attentional states. As I show, 12-month-olds point with the
communicative goal of directing others’ attention.
Motives in Communication Motives, in psychology, refer to classes of
higher-order goals. Here, I distinguish simplistically between two types
of communicative motives. One such type of a motive for
communicating
is about one’s own benefit only, mainly to obtain something.
Such type of motive is rather self-centered and social only to the degree
that it involves another person. It does not necessarily require an
understanding beyond that of others as agents of action. In contrast,
other motives of communication may be more about the partner,
for example, to cooperate with or help him. Such type of motives
is prosocial, because it involves behaviors for the other or for partners
together. Grice (1969), for example, argued that human communication
rests on a cooperative principle. Prosocial motives are not so much
centered on the sender’s perspective alone and more complex than
a self-centered motive, because they involve consideration of others’
situations too. As I show, 12-month-olds point with prosocial motives,
not only egocentrically.
From a motivational point of view, Grice’s recursive element of a
communicative intention could also bee seen as a special motive,1 a kind
of “metamotive,” which emphasizes authorship. For example, a sender
may not only want a recipient to believe what he or she said. In addition,
the sender may also want to emphasize that it was him or her—not
somebody else or the recipient—who made the recipient believe it.
Sperber and Wilson (1995) describe communication also without such
recursive embedding, based on informative intentions. For example, one
may slam the door with the informative intention that the other knows
one entered, without wanting to let the other know that one had this
intention toward his or her understanding. Such type of communication,
based only on informative intention, is manifest in individual actions,
which are nonostensive and only covertly communicative. Pointing,
instead, clearly is an ostensive and overt communicative behavior and
it may be that ostensive communication is a kind of default (because
infants already communicate in this way). It is possible that a full
understanding of the recursive structure of human communication
arises only later, when children begin engaging in concealment or
deception and communicate covertly and nonostensive by actively
hiding their communicative intent. On the other hand it is currently
not known whether young infants may also actively emphasize their
authorship as a distinct motive of their communicative act. For example,
with slightly older, verbal children one study (Shwe and Markman 1997)
has shown that 2.5-year-olds emphasize their authorship by repeating
their request when an adult misunderstood their message but gave
them what they had requested anyway. And so it is possible that infant
communicative pointing involves Gricean communicative intent, even
if emphasizing authorship might not always be a strong motive of young
infants’ communicative acts.
Review of Infant Pointing
Despite the recognition of its significance in developmental research,
surprisingly little coherent evidence has been provided on precisely
what communicative and social–cognitive abilities are involved in
infant pointing when it has just emerged. Many now-classic studies
did not experimentally address or systematically test the interpretations
put forward. And so, the lack of experimental evidence has given rise
to opposing claims.
First, in an observation study, Bates et al. (1975) proposed two
different
types of infant pointing that emerged together and revealed an
understanding of causality and tool use. Imperative pointing was held to
be an act of object retrieval, using the adult as a tool to obtain an object.
Declarative pointing was interpreted as an act of directing attention to
the presence of an object, without wanting to obtain it. They suggested
that such declarative pointing was ultimately imperative, to use the
object as a tool to obtain adult attention.
Subsequently, on a “rich” interpretation, some researchers have
proposed
that pointing at its time of emergence already involves human
communicative and social–cognitive abilities (e.g., Bretherton et al.
1981; Carpenter et al. 1998; Leung and Rheingold 1981; Tomasello
1999). It has been emphasized as part of a cluster of other joint attention
behaviors such as gaze and point following, social referencing, giving,
showing and imitating, which emerge in close proximity around one
year and are claimed to involve an understanding of others’ intentions
(Carpenter et al. 1998). The rich view sees pointing as an act of directing
others’ attention, which presupposes a social–cognitive understanding
of attentional states.
Other researchers have put forward a “lean” interpretation and claimed
that pointing and other joint attention behaviors initially do not involve
an understanding of other people’s psychological states but instead lead
to such understanding later in development (Carpendale and Lewis
2004; Gomez et al. 1994; Moore and Corkum 1994). On that view, joint
attention behaviors emerge largely independent of each other through
reinforcement of objects or others’ emotional responses. Thus, infants
would point for reward, like obtaining objects or adult attention to the
self, without an understanding of others’ intentionality.
Infant imperative pointing, on a lean account, simply involves the
communicative goal of directing behavior, without understanding of
psychological states. Following the framework, it would be a simple,
ritualized act. The view is indirectly supported by comparative data
because children with autism and apes in captivity point imperatively
but presumably lack an understanding of others’ psychological states
(Baron-Cohen 1989; Tomasello and Camaioni 1997). Franco and
Butterworth (1996), however, showed pointing and reaching serve
different functions in ontogeny, which may cast doubt that pointing
is ritualized from reaching. And typically developing humans request
with an understanding of others’ psychological agency, so it is at least
theoretically possible that 12-month-olds point imperatively in this
way too, with the communicative goal of directing others’ attention,
not only their behavior. Infants’ social–cognitive abilities in imperative
pointing have not been tested experimentally, and we simply do not
know (for older children see Marcos and Bernicot 1994; O’Neill 1996).
Motivationally, however, a lean interpretation for imperative pointing
might be appropriate if it is mainly motivated for one’s own benefit, to
obtain things, rather than to cooperatively engage in joint activities.
Infant declarative pointing, on a lean account, has been argued to
become intentionally communicative only later in development, around
15 months (Desrochers et al. 1995), involve the goal of directing others’
attention only at 24 months (Moore and D’Entremont 2001), or be
motivated rather egocentrically, to obtain attention to the self. Recently,
Moore and D’Entremont (2001) claimed to have found evidence that
infants do not point declaratively to direct others’ attention and instead
only request adult attention to the self. In their experiments a caregiver
attended to an event or the infant’s face, or to one of two events in
the infant’s view, before the infant pointed. They found that young
infants pointed regardless of a person’s focus of attention. However, in
following the framework here, Moore and D’Entremont’s interpretation
is not convincing because infants point with a communicative goal for
some motive. Therefore, if the communicative goal is achieved (e.g., the
other is attending) it is still plausible to point if the motive has not
been satisfied. For example, if one intends to express and share interest
(Tomasello and Camaioni 1997), it is entirely plausible to point at
something the other person already is looking at. This is presumably what
happened in Moore and D’Entremont’s experiments. This interpretation
is corroborated by Brooks and Meltzoff’s (2002) findings that in a gaze-
following paradigm, infants first followed the adult’s attention and then
sometimes pointed at what the adult was already attending to. Even in
linguistic declaratives sender and recipient are often already mutually
well aware of a proposition uttered (see Bates et al. 1975). And so,
one question is whether infants point with the communicative goal of
directing others’ attention. In addition, from a motivational perspective,
it has remained somewhat unclear what infants want when they point
out things or events in the world. This chapter therefore reports recent
experimental results from our lab, which investigated in detail infants’
communicative goals and motives of pointing.
A somewhat intermediate position assumes that imperative pointing
can be interpreted on a lean account, and declarative pointing on a
rich account (Camaioni 1993; see also Brinck 2004). Developmentally,
Camaioni (1993) hypothesized imperative pointing to emerge shortly
before declarative pointing. Recently, Camaioni et al. (2004) reported
that infants who had just begun pointing a week ago (by parental report),
pointed more frequently in an imperative than declarative context (in
the lab). In addition, the frequency of declarative but not imperative
pointing was developmentally associated with understanding others’
intentions. However, the difference measured was in the frequency of
pointing, not the number ofinfants, and the effect remained at 18 months.
Therefore it is possible that the differences were caused by motivational
rather than cognitive factors, with infants being more prone to request
proximal interesting toys than point at distal mobiles. Currently, there
is no direct empirical evidence showing that a cognitive ability present
in one type of pointing is absent in another. If there was such evidence
we would need to know about the transition of imperative pointing
without understanding of others’ psychological states to imperative
pointing with such an understanding. Further, in a longitudinal study,
Carpenter et al. (1998) found the reverse developmental pattern with
declarative gestures emerging before imperative gestures. 2
Different theoretical positions and the lack of unequivocal empirical
evidence have given rise to different interpretations of infant pointing
when it has just emerged. The next two sections address infant pointing
with regard to the three criteria of the framework. Positive experimental
evidence of three of our own studies is presented, which shows that
infant pointing at 12 months is intentionally communicative, involves
the communicative goal of directing others’ attention and reveals
prosocial motives like sharing and helping.

Pointing at 12 months: Intentional Communication


On the basis of gaze alternation with a recipient, Bates et al. (1975)
attributed communicative intent to infants’ pointing. Gaze alternation
has been used as a criterion for intentional communication in
subsequent
research. But Bates et al. (1975) also reported pointing
without gaze alternation. They interpreted such type of pointing as
noncommunicative and as a precursor to communicative pointing. One
question is whether pointing, when it first emerges, already involves
communicative intent.
Gaze alternation may not be the best criterion to assess communicative
intent. Infants may be just checking on the other person’s behavior.
At the same time, the absence of gaze alternation cannot positively
reveal absence of communicative intent. Infants may simply assume
adults’ understanding, or rely on auditory instead of visual information.
Other criteria for pointing as intentional communication are whether
it is done for somebody and whether persistence or flexibility in signal
use occurs when the recipient does not react appropriately. There is
little doubt that this is the case for imperative pointing (infants repeat
requests and often increase their vocalizations or cry when they do
not obtain the desired object). Our studies presented below show that
there is also persistence in 12-month-olds’ declarative pointing. For
example, when an adult ignores infants’ declarative pointing, infants
repeat it within a trial and are overall less likely to point on further
occasions than when the adult shares attention and interest with the
infant. Thus, both imperative and declarative pointing at 12 months
involves the intent to communicate.
However, infants also point for self. Recently, Delgado et al. (1999)
observed that infants as young as 12 months point even when left
alone in a room. But all these infants pointed already communicatively
(mean age = 18 months), which casts doubt that pointing for self is
only a precursor to communicative pointing. Further, Bruner (1983)
reported an infant’s point for self in the absence of a visible referent
and interpreted it as the infant locating “in his ‘present’ space an object
recalled from memory” (p. 76). And DeLoache et al. (1985) observed
infants using pointing sometimes as a form of an early mnemonic
strategy. Thus, it is possible that pointing also serves the function of
directing one’s own attention. Therefore, I would suggest that pointing
for self coexists with rather than leads into communicative pointing.

Pointing at 12 Months: Communicative Goals and


Motives
I have proposed that studies on communicative behavior should
consider both the communicative goal and its motives. This section
reports our studies on how infants intentionally communicate when
pointing (what their communicative goal is) and why (what their motives
are). A “rich” account would predict that infants should distinguish
conditions in which they successfully direct a recipient’s attention
to their referent from those in which the recipient’s attention is not
directed, or when he refers to something else. Moreover, following the
framework, infants’ motives to communicate may go beyond the mere
use of others’ behavior for their own immediate benefits. Understanding
others as persons with psychological states provides a basis for prosocial
motives like, for example, sharing interest with or helpfully providing
information for others. The next three studies provide first positive
experimental evidence for just such a view on infant pointing.

Twelve-month-olds Point to Share Attention and Interest


In a first study, we tested whether 12-month-olds point declaratively at
interesting events to direct others’ attention, and the potential motives
thereof (Liszkowski et al. 2004). Seventy-five infants participated in the
final sample. We elicited pointing with interesting events, like puppets
appearing or lights flashing behind a large screen at a distance. On each
of ten trials, if the infant pointed, a female experimenter (E) reacted
consistently in one of four ways across four experimental groups. Four
hypotheses about what infants want when they point were tested. First,
on the hypothesis that infants pointed for themselves (see above), E
neither attended to the infant nor to the event (Ignore condition). Second,
on Moore and D’Entremont’s (2001) hypothesis that infants do not want
to direct attention and just want to obtain attention to themselves,
E never looked at the event and instead attended to the infant’s face
and emoted positively to it (Face condition). Third, on the hypothesis
that infants just wanted to direct attention and nothing else, E only
attended to the events (Event condition). Fourth, on our hypothesis that
infants want to share attention and interest, E responded to an infant’s
point by alternating gaze between the event and the infant, emoting
positively about it (Joint Attention condition).
The overall finding was that infants point to share attention and
interest. First, infants preferred the Joint Attention condition and pointed
on significantly more trials in that condition compared with each of
the other three conditions, see Fig. 5.1. Second, when E emoted
positively,
infants repeated their pointing within a trial to the same referent
significantly more often when E attended only to them (Face), than
when he also attended to their referent (Joint Attention), see Fig. 5.2.
In other words, when E did not attend to what infants attended to,
infants repeated their pointing in an apparent attempt to direct E’s
attention. Third, when E only attended to the event and did not look
and comment back to the infant (Event condition), infants also repeated
their pointing within a trial and, in addition, looked significantly more
to E than in any other condition, see Figs. 5.2 and 5.3. Just directing
attention without sharing was not satisfactory either. Instead, infants
expected some sort of a comment from E, presumably indicative of
sharing the interest in the event that both attended to.
Results neither supported Moore and D’Entremont’s (2001) hypothesis
that infants only want attention to the self nor a hypothesis that they
only want to direct attention. Instead they showed that infants point to
direct others’ attention to share interest. This type of pointing involves
two components: (1) directing a person’s attention to an event (the
communicative goal) and (2) receiving a comment indicative of sharing
the event (the motive); neither of these alone is sufficient.
In a follow-up study (Liszkowski et al. n.d.), we investigated the two
components, directing attention and receiving a comment in more
detail. Following the design of the previous study, we were interested
which of E’s reactions would satisfy infants’ desire to share attention
and interest. Therefore, we systematically violated infants’ expectations
of E’s attention and his comment, such that E either did not share the

Figure 5.1. Pointing across trials: Mean proportion of trials in which infants
pointed at least once (and so E reacted at least once).
Figure 5.2. Point repetitions within trials: Mean number of points per trial with
at least one point.

Figure 5.3. Looking behavior across trials with a point: Mean number of looks
to E per trial with a point.
infant’s attention (i.e., he misunderstood what the infant pointed at)
or he did not share the infant’s interest (i.e., he commented neutrally
[“uninterested”] about the referent). We controlled the referent of E’s
attention, and his attitude about it as expressed in his comment (see
Table 5.1 for a summary). First, we wanted to know whether infants
would be satisfied when E simply oriented behaviorally in the general
direction of the referent without actually referring to it (a barrier
obstructed his line of sight to infants’ referent). Second, we wanted
to know whether the adult needed to share the interest and emote
positively, or whether a neutral comment would suffice. Eighty infants
participated in the final sample. In a Joint Attention condition E attended
to the infant’s referent and emoted positively about it (but never named
it). In a Misunderstanding condition, E reacted in the same way except
that a barrier obstructed his line of sight to the infant’s referent and
E mistakenly referred to an insignificant piece of paper attached to
the barrier (see Fig. 5.4). In the Uninterested condition, there was no
barrier and E reacted as in Joint Attention, except that he commented
neutrally about the referent, stating his disinterest in it. The No Sharing
condition involved again the same barriers as in the Misunderstanding
condition, and E reacted neutrally, as in the Uninterested condition, to
an incorrect referent.

Table
Table 5.1.
5.1. Design of
of study 2.
2.

Reference Attitude

Positive interest Neutral disinterest

Shared Joint attention Uninterested


Misunderstood Misunderstanding No sharing

We used the same measures as in Study 1. Table 5.2 summarizes the main
results. First, as in Study 1, infants preferred the Joint Attention condition,
pointing on significantly more trials in that condition than in each of
the other three. Infants also looked to E significantly more often in all
other than the Joint Attention condition, presumably because they were a
bit puzzled about those reactions. Second, when E emoted positively but
did not refer to infants’ referent (Misunderstanding condition), infants
were not satisfied and repeated pointing within a trial significantly more
Figure 5.4. Schematic setup of Study 2 with barriers. Background: cloth sheet
with window openings to protrude 5 puppets. In front: barriers that obstruct E’s
line of sight to infant’s referent, and three stimuli that are activated electronically.

often than when E attended to their referent. The point repetitions in


the Misunderstanding condition also differed qualitatively from those
in Joint Attention and were accompanied by significantly more looks
to E and more vocalization during the second point. Third, when E
attended to the referent but the comment was not the preferred one
(Uninterested condition), infants did not repeat their pointing within a
trial and pointed overall on significantly fewer trials.
In line with the first study, these results show, first, that infants point
referentially to direct others’ attention to interesting events. When a
recipient does not refer to their indicated referent infants attempt to
redirect his attention by repeated pointing, accompanied by increased
vocalization and looks to the recipient. Importantly then, it was not
enough when E emoted positively and oriented behaviorally in the
general direction of an object. Instead, infants wanted E to attend and
Table
Table 5.2.
5.2. Summary of
of main
main results.
results.

Mean proportion Mean proportion Mean number of


of trials with of trials with looks to E in trials
a point point repetitions with a point
Joint attention + - -

Misunderstanding -
+ +

Uninterested - -
+

No sharing - -
+

Note:'+' indicates statistically higher means than

refer specifically to what they point at. Thus, infants’ communicative


goal is not simply to direct a person’s behavior, as if to request only a
behavioral reaction. Second, results show that in this context infants
prefer a positive over a neutral comment about a jointly attended
event. When a recipient was not interested, infants pointed for him on
fewer trials. This shows that infants were not satisfied with any type of
comment and so did not simply try to obtain any kind of information.
Third, in contrast to the first study (infants repeated pointing within
a trial when E did not provide any comment [Event condition]) in
the Uninterested condition of the current study infants did not repeat
pointing within a trial when the comment was not the preferred one.
Thus, infants’ motive was not simply to request a positive comment,
as if imperatively requesting an object (then they should have repeated
pointing). Instead, infants’ pointing may be an offer to mutually engage
about an event and positively share interest in it with an interested
partner—and when the partner is not interested, offers cease.
Pointing to share attention and interest reveals infants’ active role in
informational exchange and sensitivity to the social context in which
pointing is embedded. The two studies show that infant pointing is a
joint communicative act, to comment and point out something for the
other and share attitudes about it with the other. Moreover, they reveal
protoconversational structures like “repairs” in turn taking (see Schegloff
this volume), which may be interpreted as helping a cooperative partner
to understand a message. Pointing to share attention and interest is
not only motivated egocentrically to obtain something for self. Instead,
sharing is a motive of aligning self and other in some way. Infants align
with others’ facial gestures from birth (Meltzoff and Moore 1977) and
with others’ visual experience around 12 months (Brooks and Meltzoff
2002). Pointing to share reflects infants’ motive of aligning by actively
offering their own experience to a recipient, so that he hopefully aligns
with them.

Twelve-month-olds Point to Provide Information for Others


In a third study (Liszkowski et al. 2006), we further explored infants’
communicative goal of directing attention, and a prosocial motive
thereof. We investigated whether infants point for others only in
response to externally exiting events, or also in response to a person’s
psychological relation to an otherwise uninteresting object. Specifically,
we were interested whether infants would also point to help others
by providing information about the location of an object that was
searched for. Adults frequently point at things others are looking for,
without direct interest in the objects, simply to help others. However,
the presence of a helpful motive in infants this young has previously
not been investigated.
In the main experiment, E demonstrated on each of twelve trials an
action to infants that always involved one of two objects (the target).
Then, both objects (target and distractor) disappeared out of E’s but
not infants’ view in the same way. For example, they were dropped
accidentally, displaced on a shelf behind E, or used up (e.g., water)
while replica objects (another water bottle) remained visible to the
infants. After the disappearance, E attempted to repeat the action and
began searching for the target object. In three phases she first looked
herself, then emphasized her search with an unspecific verbal cue
(“where is it?”), and then explicitly asked the infant using the object
label (“[Name], where is the [X]?”). Sixteen 12- and sixteen 18-month-
olds participated in the final sample.
Strikingly, 12-month-olds pointed in such a situation too (see Fig.
5.5), even when potentially interesting sounds or movements of the
referent were absent, or when there was no displacement at all (e.g.,
when the objects were used up). Thus, infants’ communicative goal of
directing an adult’s attention was not only driven externally by exciting
events but instead a function of the adult’s psychological relation to
the referent, that is to find it. Indeed, infants pointed significantly
more to the target E was looking for than to a distractor simultaneously
displaced in the same way (see Fig. 5.6). Thus, infants did not generally
point at things that disappeared, as if to share their interest in the
disappearance. Instead, they selectively provided information for E about
the location of the relevant object she was looking for. Importantly,
Figure 5.5. Still-frame showing a 12-month-old pointing for the experimenter
to one of two objects on the shelves behind her.

requestive accompaniments of pointing like reaching or whining or


repeated pointing after E had retrieved the object for herself were rare.
Thus, infants did not request the object for themselves or attempted
to obtain it. Further, E’s actions with the objects were not particularly
interesting per se, without specific effects (e.g., stapling papers), and
infants were never involved in them (they simply watched E). Finally,
infants pointed mostly before E verbally asked about the object, so that
it was not simply in response to a naming game.
This is the first study to show that infants point simply to provide
information for others. Informative pointing is not so much about a
sender’s than about a recipient’s relation to an object. It is prosocially
motivated by helping a partner communicatively in finding what she
is looking for. Previous studies have not shown a prosocial motive of
helping in children this young, and it may be that humans are especially
predisposed to help others simply by communicating, without great
physical effort. The early emergence of a motive for helping by informing
may be seen as part of human pedagogy (see Gergely and Csibra this
volume) and as an ontogenetic precursor to uniquely human forms of
instructing and teaching.

Toward a New View on Infant Pointing


Taken together, the three studies did not support lean interpretations of
early infant pointing. New experimental evidence has been presented
for infants’ communicative goal of directing others’ attentional states,
not only their bodily behavior. Findings show that infants point for
Figure 5.6. Mean percent of trials with the first point to target or distractor.

various motives. Bates et al.’s (1975) dichotomy of pointing may


have been too narrow and the motive of declarative pointing been
underspecified, both empirically and theoretically. The studies presented
here emphasize that infant communicative pointing involves prosocial
motives. Results clarify that the motive of infant “declarative” pointing
is to share interest, that is, align the interest about a referent with a
partner. Moreover, first evidence for another, new motive of infant
pointing
has been shown: to helpfully provide information for others. It is
possible that other motives may underlie infant pointing, for example
an interrogative motive, to find out about people’s attitudes or discern
ambiguous situations. Here, positive evidence has been presented for
infant pointing as a full human communicative act with underlying
prosocial motives, even before language has emerged. Like speech acts of
reference (Searle 1969), infant pointing too involves both reference and
attitude about the referents. The next section discusses the implications
of these new results on our understanding of infants’ social–cognitive
abilities.
Pointing at 12 months: Social-Cognitive Abilities
As laid out, human communicative pointing involves two main
components: understanding an indication as being about a referent,
that is, understanding reference, and understanding the interlocutors’
relations toward each other and a referent, that is, understanding people’s
attitudes about referents. First, to understand the referential character
of pointing, one has to understand that people have attentional states
toward the environment that can be followed and directed. Second, to
comprehend why something has been indicated, one has to understand
the interlocutors’ relations toward each other and a referent, that is,
their attitudes.

Pointing and Understanding Others’ Attention


The communicative goal of directing others’ attention presupposes
some type of understanding of others having attentional states. Results
presented here show that when infants direct others’ attention by
pointing, they understand whether their partner does or does not attend
to their referent. Attention has an external behavioral manifestation,
for example eyes, head and body, and often also facial expressions
indicate a person’s attentional state. However, findings do not support
the view that infants understand only these behavioral manifestations
without understanding the recipient’s attention to a referent. Results
showed that when E oriented behaviorally to the correct direction of
infants’ referent, infants were not satisfied when E did not also attend
to the referent. Further, a lean interpretation that infants only want
to direct the other’s bodily orientation would be unsatisfactory: which
motive should underlie such a request for “object-directed orienting
behavior” in a declarative or informative pointing context? Instead,
infants understand the recipient’s relation to a referent as attentional,
as orienting to process environmental information. Thus, infants refer
to things for the other so that she attends to (i.e., “mentally process”)
these.
Infants accumulate first-person experience in gaze following over the
first year and from around 10 to 12 months choose to look specifically
at what others attend to (Brooks and Meltzoff 2002; Flom et al. 2004).
Roles are reversed when infants point to direct the adult’s attention to
what they attend to. Just like their own attention can be aligned with
that of others to an object, infants understand that others’ attention
can be aligned with their own. Whereas studies on early gaze following
are sometimes explained by sophisticated behavioral cue reading, the
present findings of 12-month-olds’ goal of actively directing others’
attention (and their motives) reveal an understanding of attention that
clearly goes beyond surface reading of behavior.
Directing and understanding others’ attention is rooted in ongoing
interactions. But do infants understand that others’ attentional states
also go beyond the immediate present? Do they understand that
referents of attention are retained as information, and how good are
they in remembering a person’s information state? Especially, it may
be difficult to understand mismatches between current and prior, or
others’ and one’s own information states. Informative pointing involves
an understanding that the searcher is looking for something currently
not present to him but to the informer. Thus, informative pointing in
the second year of infancy may suggest that one-year-olds understand
others as people with information states who may sometimes lack
information. Whereas such understanding has usually been held to
emerge only after infancy, the current interpretation is corroborated by
the recent finding that 12-month-olds discern what is new for someone
else (Tomasello and Haberl 2003). Our own work in progress on infants’
pointing indeed supports the view that infants understand that a person
may need some but not other information, as evident in their selective
providing of new and relevant information for that person.

Pointing and Understanding Others’ Attitudes


Attitudes, that is, the psychological relations between interlocutors
and a referent are made overtly manifest in communicating about the
environment. Pointing to share attention and interest is about
interlocutors’
interest in a referent. By sharing interest the infant expresses
his or her own relation toward a referent, undifferentiated as it may
be, and experiences others’ attitudes about it. Studies on infant social
referencing have shown that 12-month-olds link a person’s comment
selectively to his or her referent (Moses et al. 2001). Social referencing
is more about a referent, to discern its ambiguity, than it is about
sender and recipient. Recently, it has been suggested that infants
attach others’ comments only to the referent, as its valence, and do
not link it to the adults as their attitude about the object (Egyed et
al. 2004). However, if infants pointed to request valence information
about an object, one would expect no difference in pointing regardless
of whether the information about the object was positive or not. But
results showed that infants pointed on fewer trials when the adult
commented neutrally, although each presented object was new and
all occurred in different locations. Instead, what was constant across
situations was the adult’s attitude. More likely then, infants linked
the adult’s neutral comment to his attitude of disinterest in engaging
with the infant about the referent—presumably detecting the failure
to share and so accumulate common ground (see also Enfield this
volume). Importantly, infants differentiated between reasons of failure
to share interest. They distinguished a recipient’s need for clarification
(Misunderstanding condition) from his disinterest in sharing a referent
(Uninterested condition), which supports the interpretation that infants
understand something about people’s attitudes.

Infant Pointing and the Development of Social Cognition


The three studies presented yield a new view on infant pointing that
reveals social–cognitive understanding of people’s attention and attitudes
in interaction, before language has emerged. This social–cognitive
understanding, together with prelinguistic communicative behaviors,
equips infants to subsequently acquire symbolic communication and
accumulate psychological knowledge.
The new findings of infants’ communicative and social–cognitive
abilities might contribute to reevaluating the supposed divide between
early forms of human preverbal communication like pointing, and
late-emerging social cognition as “Theory of Mind” ([ToM] see Breheny
in press). On the one hand, the new view on infant pointing suggests
more social–cognitive understanding than may have been assumed
previously. At the same time there has been overreliance on the falsebelief
task as the watershed of social–cognitive understanding although
its validity has long been questioned (Bloom and German 2000). Along
these lines, for example, it would be difficult to explain that some deaf
adult Nicaraguan home signers who lack specific linguistic structures
and, for that reason presumably fail false-belief tasks (Pyers this volume)
do not have a ToM, given that they are socially competent adults who
participate in daily complex social life (e.g., economic activities) and
not autistic members of a society meaningless to them.
From another point of view (e.g., Levinson, Sperber, and Tomasello in
this volume), it has been suggested that understanding others’ minds lies
at the very roots of human sociality and communication. As Tomasello
and Rakoczy (2003) put it, shared intentionality at around one year
is “the real thing” (p. 124) and developing an understanding of false
beliefs “the icing on the cake” (p. 122). Results presented here show that
infants have the social–cognitive understanding of, and the motivation
to engage in social communicative interactions, which then enables the
development of linguistic communication and further social–cognitive
developments. The emergence of language boosts many cognitive
developments, and so too ToM (Astington this volume). Recently, Harris
(2005) has argued that the influence of language on ToM is mediated by
particular discourse practices like role taking and children’s role play.
In extension, based on the new view on infant pointing, discourse and
role taking as mediating factors in ToM development may be linked
to infant pointing as socially situated practice of early psychological
perspective taking.

The Ontogeny of Infant Pointing


Little is known on how exactly infants come to point. Broadly, two
learning mechanisms have been proposed to play a role: ritualization
and imitation. Ritualization may possibly account for young infants’
earliest forms of object requests, which are shaped by adults’ repeated
interventions to infants’ action attempts (e.g., reaching; Vygotsky
1978). Empirically, however, pointing and reaching are developmentally
dissociated, with a dramatic increase in frequency of pointing at around
12 months, although the frequency of reaching remains stable (Masataka
2003). Further, it is not clear how a ritualized, individualistic action
could transform into a flexibly used referential communicative act,
especially not for motives like offering as opposed to requesting.
Imitation alone, however, may not be sufficient to lead to pointing
either. First, adults request objects from infants usually with an open
hand, palm up or grasping movements and rather rarely model imperative
pointing to infants. However, infants seem to point imperatively
from early on. Second, imitation involves an understanding of the
actor’s intention to produce a goal in a specific way. But declarative
pointing is not an individualistic act on an external goal. Therefore,
it may be particularly demanding to imitate this communicative act
in the same way as simple object-directed actions, because the goal
of a communicative act is not object- but person centered. Third, it
may be that infants simply copy adults’ arm movements, which in
combination with adults’ responses leads to ritualized communicative
pointing. However, findings showed that infants use pointing in a
flexible way with various motives from early on, which goes beyond
simple mimicking of an adult’s extended arm and index finger.
Two other accounts on the emergence of pointing identify behavioral
antecedents in infants’ earlier object-related activities. First, Butterworth
(1998) emphasized the role of object exploration with the “pincergrip"
and the index finger in the process of singling out objects in the
environment, and Masataka (2003) has suggested that selective index
finger extension in response to arousing stimuli at eight months is
related to exploration and self-regulation of attention. It might then
be that such behaviors lead to “distal” object exploration, to support
attention, in the form of extended arm and index finger. However,
such pointing needs to become communicative somehow, and, again,
results show that infants point communicatively from very early on,
so that it is not clear whether communicative pointing is preceded by
a stage of pointing for self.
Second, Werner and Kaplan (1963) proposed the activities toward
objects, for example “reaching-to-touch” or “turning-for-looking,”
to be quasi referential and “only a short step away from full-fledged
pointing” (p. 79), revealing an increasing intentionality of infants’ own
attentional processes. However, in addition to infants’ own attentional
processes, Werner and Kaplan (1963) emphasized the importance of a
social context motivated by sharing, claiming that “referential behavior
emerges through sharing of contemplated objects in an interpersonal
context” (p. 79). The social context of sharing may enable infants to
attend to things in a specific way, that is, not only as “things of action”
on which one can individually act, but instead as “objects of regard”
(“objects of contemplation”), which can be regarded jointly. Thus, in
answering how infants come to point (and apes maybe not) it may
be that it is this type of attentional stance toward the environment,
mediated through social interaction, which enables infants to conceive
of the environment independently of physically acting on it, and as a
source for mental processes that can be shared with others. This specific
type of interaction between infants’ attention behavior and the social
context in which it develops may be at the roots of reference and lead to
unique forms of human communicative pointing at twelve months.

Acknowledgments
I am thankful for interesting discussions with all contributors to this
volume, and for helpful comments by Nick Enfield and Steve Levinson.
Thanks for comments on an earlier version to Franklin Chang, Malinda
Carpenter, and Mike Tomasello.
Notes
1. A communicative intention entails not only S’s intent that: R understands
that (R think X), but instead S’s intent that: R understands that (S intends that:
[R think X]). The intention that “R think X” is embedded in the intention that
R understands S’s intention toward R’s understanding.
2. Carpenter et al. (1998) used the number of infants as a dependent
measure instead of the frequency of pointing. The pattern of their results still
holds when distal gestures only are analyzed (no “shows” and “gives”; M.
Carpenter personal communication, December, 2004).

References
Baron-Cohen, S. 1989. Perceptual role taking and protodeclarative
pointing
in autism. British Journal of Developmental Psychology 7(2):113–127
Bates, E., L. Camaioni, and V. Volterra. 1975. The acquisition of performatives
prior to speech. Merrill-Palmer-Quarterly 21(3):205–226.
Bloom, P., and T. P. German. 2000. Two reasons to abandon the false
belief task as a test of theory of mind. Cognition 77(1):283–286.
Breheny, R. in press. Communication and folk psychology. Mind and
Language.
Bretherton, I., S. McNew, and M. Beeghly-Smith. 1981. Early person
knowledge as expressed in gestural and verbal communication: When
do infants acquire a “theory of mind”? In Infant social cognition:
Empirical and theoretical considerations, edited by M. E. Lamb and L.
R. Sherrod, 333–373. Hillsdale, NJ: Erlbaum.
Brinck, I. 2004. The pragmatics of imperative and declarative pointing.
Cognitive Science Quarterly 3(4):429–446.
Brooks, R., and A. N. Meltzoff. 2002. The importance of eyes: How
infants interpret adult looking behavior. Developmental Psychology
38(6):958–966.
Bruner, J. 1983. Child’s Talk. New York: W. W. Norton.
Butterworth, G. 1998. What is special about pointing in babies? In The
development of sensory, motor and cognitive capacities in early infancy:
From perception to cognition, edited by F. Simion and G. Butterworth,
171–190. Hove: Psychology Press.
Camaioni, L. 1993. The development of intentional communication:
A re-analysis. In New perspectives in early communicative development,
edited by J. Nadel and L. Camaioni, 82–96. London: Routledge.
Camaioni L., P. Perucchini, F. Bellagamba, and C. Colonnesi. 2004. The
role of declarative pointing in developing a theory of mind. Infancy
5(3):291–308.
Carpendale, J., and C. Lewis. 2004. Constructing an understanding of
mind: The development of children’s social understanding within
social interaction. Behavioral and Brain Science 27(1):96–97.
Carpenter, M., K. Nagell, and M. Tomasello. 1998. Social cognition, joint
attention, and communicative competence from 9 to 15 months of
age. Monographs of the Society of Research in Child Development, no.
255, 63(4):1–176.
Delgado, B., J. C. Gómez, and E. Sarriá. 1999. Non-communicative
pointing in preverbal children. Paper presented at the 9th European
Conference on Developmental Psychology, Spetses, Greece, September
4.
DeLoache, J. S., D. J. Cassidy, and A. L. Brown. 1985. Precursors of
mnemonic strategies in very young children’s memory. Child
Development 56(1):125–137.
Desrochers, S., P. Morissette, and M. Ricard. 1995. Two perspectives
on pointing in infancy. In Joint attention: Its origins and role in
development, edited by C. Moore and P. Dunham, 85–101. Hillsdale,
NJ: Erlbaum.
Egyed, K., I. Kiraly, and G. Gergely. 2004. Object-centered versus agent-
centered interpretations of attitude expressions. Paper presented at
the International Conference on Infant Studies, Chicago, May 6.
Flom, R., G. O. Deak, C. G. Phill, and A. D. Pick. 2004. Nine-month-olds’
shared visual attention as a function of gesture and object location.
Infant Behavior and Development 27(2):181–194.
Franco, F., and G. Butterworth. 1996. Pointing and social awareness:
Declaring and requesting in the second year. Journal of Child Language
23(2):307–336.
Gomez, J. C., E. Sarria, and J. Tamarit. 1994. The comparative study of
early communication and theories of mind: Ontogeny, phylogeny,
and pathology. In Understanding other minds: Perspectives from autism,
edited by S. Baron-Cohen, H. Tager-Flusberg, and D. J. Cohen, 397–
426. Oxford: Oxford University Press.
Goodwin, C. 2003. Pointing as situated practice. In Pointing: Where
language, culture, and cognition meet, edited by S. Kita, 217–241.
Mahwah, NJ: Erlbaum.
Grice, H. P. 1969. Utterer’s meaning and intentions. Philosophical Review
66:147–177.
Harris, P. 2005. Conversation, pretence, and theory of mind. In Why
language matters for theory of mind, edited by J. W. Astington and J.
Baird, 70–83. Oxford: Oxford University Press.
Kendon, A. 2004. Gesture. Visible action as utterance. Cambridge:
Cambridge University Press.
Kita, S. (ed.). 2003. Pointing: Where language, culture, and cognition meet.
Mahwah, NJ: Erlbaum.
Leavens, D. A., J. L. Russell, and W. D. Hopkins. 2005. Intentionality as
measured in the persistence and elaboration of communication by
chimpanzees (pan troglodytes). Child Development 76(1):291–306.
Leung, E. H., and H. L. Rheingold. 1981. Development of pointing as
a social gesture. Developmental Psychology 17(2):215–220.
Liszkowski, U., M. Carpenter, A. Henning, T. Striano, and M. Tomasello.
2004. Twelve-month-olds point to share attention and interest.
Developmental Science 7(3):297–307.
Liszkowski, U., M. Carpenter, T. Striano, and M. Tomasello. 2006.
Twelve- and 18-month-olds point to provide information. Journal
of Cognition and Development 7(2).
Liszkowski, U., M. Carpenter, and M. Tomasello. n.d. Infant pointing:
Reference and attitude. MS submitted for publication.
Marcos, H., and J. Bernicot. 1994. Addressee co-operation and request
reformulation in young children. Journal of Child Language 21(3):677–692.
Masataka, N. 2003. From index-finger extension to index-finger
pointing:
Ontogenesis of pointing in preverbal infants. In Pointing: Where
language, culture, and cognition meet, edited by S. Kita, 69–84. Mahwah,
NJ: Erlbaum.
Meltzoff, A. N., and M. Moore. 1977. Imitation of facial and manual
gestures by human neonates. Science 198(4312):75–78.
Moore, C., and V. Corkum. 1994. Social understanding at the end of
the first year of life. Developmental Review 14(4):349–372.
Moore, C., and B. D’Entremont. 2001. Developmental changes in
pointing as a function of attentional focus. Journal of Cognition and
Development 2(2):109–129.
Moses, L. J., D. A. Baldwin, J. G. Rosicky, and G. Tidball. 2001. Evidence
for referential understanding in the emotions domain at twelve and
eighteen months. Child Development 72(3):718–735.
O’Neill, D. K. 1996. Two-year-old children’s sensitivity to a parent’s
knowledge state when making requests. Child Development 67(2):659–
677.
Perner, J. 1991. Understanding the representational mind. Cambridge, MA:
MIT Press.
Povinelli, D. J., J. M. Bering, and S. Giambrone. 2003. Chimpanzees’
“pointing”: Another error of the argument by analogy? In Pointing:
Where language, culture, and cognition meet, edited by S. Kita, 35–68.
Mahwah, NJ: Erlbaum.
Searle, J. R. 1969. Speech acts. Cambridge: Cambridge University Press.
Shwe, H. I., and E. M. Markman. 1997. Young children’s appreciation
of the mental impact of their communicative signals. Developmental
Psychology 33(4):630–636.
Sperber, D., and D. Wilson. 1995. Relevance: Communication and cognition.
2nd edition. Oxford: Blackwell.
Tomasello, M. 1999. The cultural origins of human cognition. Cambridge,
MA: Harvard University Press.
Tomasello, M., and L. Camaioni. 1997. A comparison of the gestural
communication of apes and human infants. Human Development
40(1):7–24.
Tomasello, M., and K. Haberl. 2003. Understanding attention: 12- and
18-month-olds know what’s new for other persons. Developmental
Psychology 39:906–912.
Tomasello, M., and H. Rakoczy. 2003. What makes human cognition
unique? From individual to shared to collective intentionality. Mind
and Language 18(2):121–147.
Vygotsky, L. 1978. Mind in society: The development of higher psychological
processes. Cambridge, MA: Harvard University Press.
Werner, H., and B. Kaplan. 1963. Symbol formation: An
organismic-developmental
Wiley.
approach to language and the expression of thought. New York:
six

The Developmental Interdependence


of Theory of Mind and Language
Janet Wilde Astington

My aim in this chapter is to provide a developmental view of the


relation between theory of mind (ToM) and language in early
childhood and—briefly—before this period during infancy and after it
during the school years. First, however, I address the issue of what ToM
is and how culture specific it might be. Then I consider its development
and the relationship with language for children in Western cultures.
Finally, I address a fundamental paradox at the heart of the matter—that
is, communicative exchanges in infancy and toddlerhood appear to
depend on a ToM that is not developed until later in the preschool
years.
Both ToM and language are complex multifaceted systems that
undergo enormous development throughout childhood, but particularly
during the second to fifth years of life. Importantly, the relation between
them changes as the systems mature, with developmental changes in
one promoting further advances in the other. Initially, intersubjectivity
developing during infancy facilitates the beginnings of language
acquisition. Subsequently, during the toddler and preschool years, the
language system comes to serve two different functions: interpersonal
social communication and intrapersonal mental representation. I argue
that it is because the same system is used for both these purposes that
metarepresentational abilities develop. Such abilities then facilitate the
more mature understanding and use of complex language that develop
during the school years.
Psychological Foundations

ToM
ToM may not be the best term to use but it has become widely accepted
and is probably not easily replaced now. We can get away with it if we
do not take it too literally. As is well known, Premack and Woodruff
(1978) introduced the term into the psychological literature when they
asked whether the chimpanzee has a ToM, which they defined as a
system that imputes mental states to make inferences about behavior.
Developmental psychologists picked up the term and quickly applied it
quite broadly, using it to characterize nine-month-olds’ communicative
abilities (Bretherton et al. 1981) as well as five-year-olds’ ability to meta-represent
belief (Wimmer and Perner 1983). However, the term was
soon claimed by a particular camp (Astington et al. 1988) that took
the “theory” notion seriously. In its most precise use, ToM is a domain-specific,
psychologically real structure, comprising an integrated set
of mental-state concepts employed to explain and predict people’s
actions and interactions, that is reorganized over time when faced
with counterevidence to its predictions (e.g., Gopnik and Wellman
1994). From this theoretical perspective, ToM development in children
is analogous to theory development in science (the so-called “theory–
theory” view). On this view, ToM is both a cognitive structure leading
to certain abilities as well as a theoretical perspective explaining the
development of these abilities. The term was also used by simulation
theorists, who disagreed with the theory–theory view, claiming that
mental-state concepts are not theoretical postulates but are derived
from experience (Goldman 1989; Gordon 1986; Harris 1992). And it
was used by modularity theorists, who argued that the theory is not
developed by a process of theorizing but is innate and matures (Baron-Cohen
1995; Fodor 1992; Leslie 1994).
Currently, however, ToM is often used with either much narrower
or much broader scope—all the way from designating false-belief
understanding in particular, to social understanding in a most general
sense. It is accepted (or at least used) even by those who do not endorse
the theory-theory, simulation-theory, or modularity-theory perspectives.
I will use it here, as a broad term for a multifaceted system. On this
catholic view, ToM encompasses:
social perception in late infancy
mental-state awareness in toddlers and preschool children
metarepresentational ability in older preschool children
multiply embedded recursive and interpretive abilities in school-age
children.
Developmental Interdependence: Mind and Language

Importantly, new abilities do not replace earlier ones during the course
of development but are added to them, so that the full set constitutes
ToM in late childhood and adulthood.

Cultural Diversity in ToM


I will focus on ToM development primarily in typically developing,
middle-class children in North America, Europe, and Australasia
(“Western” for want of a better word) because this is where most of the
data have so far been gathered. I do, however, acknowledge the need
for a wider view. ToM is acquired within a cultural context, in which
“mind” itself is a culturally constituted entity, not a natural kind, but
the influence of cultural context has not been widely investigated. Most
psychologists studying children’s ToM development assume that it is
a universal development, or at the very least, that there is a universal
core to ToM that is acquired in the early years (Harris 1990; Wellman
2002). But this conclusion may be premature. In any event, we need
more detail concerning what exactly is within the universal core and
what is the nature and extent of the cultural diversity built on it.
In addition to the Western data, there are also data from Japan and
China, although gathered at school or preschool from children in literate
urban societies (e.g., Gardner et al. 1988; Jin et al. 2002; Lee et al. 1999;
Naito et al. 1994). In general, Asian children’s ToM development as
reported in these works is quite similar to that described for Western
children, with some variations in timing and sometimes more weight
given to social rules than to individual mental states (Naito 2004).
There are very few data from unschooled, nonliterate populations.
Avis and Harris (1991) reported that Baka children of Cameroon were
successful on a task that assessed their understanding of another person’s
false belief, at about the same age that Western children pass such tasks
(i.e., 4 to 5 years). However, Danziger (this volume) found that Mopan
Maya children in Central America did not understand false belief until
later in childhood. Similarly, Vinden (1996, 1999), working with four
non-Western cultural groups—the Mofu of Cameroon, the Quechua of
Central Peru, and the Tolai and the Tainae of Papua New Guinea—found
that most did not understand false belief until later childhood, with
the Tainae continuing to have difficulty even in adolescence.
The fact that there are so few ToM studies from non-Western,
unschooled populations makes it difficult to draw conclusions about
cultural diversity in ToM. However, rather than more data along these
lines, we may need different data. The research that has been conducted
so far generally takes a “cross-cultural” stance (Vinden and Astington
2000). That is, Western tasks are adapted and used, albeit with great care
taken to ensure that valid translations are employed by native speakers,
and that the objects shown to the children are culturally appropriate.
However, the tasks themselves were developed in a cultural setting
in which we explain behavior by ascribing mental states to the self
and others. Perhaps Western ToM tasks, even testing itself, are not an
appropriate way to assess understanding in other cultures. But what
other ways might there be?
Conceivably, we might discover more by examining studies of
language
development, particularly the social context in which children
acquire language, and the particular concepts that are lexicalized within
their language. Language is fundamental because language use is not
an individual process but a joint action among participants based on
common ground, which in broadest terms is a shared cultural background
(Clark 1996). Indeed, Nelson (2005) proposes replacing the whole idea
of “theory of mind” as an individual cognitive achievement, with the
idea that children enter into a “community of minds” in which people
exchange ideas, plans, memories, and so on. Language is crucial for
entry into this community because without language children could not
share in others’ mental life and would remain isolated within their own
perceptually based knowledge. Language acquisition enables children
to participate in the culture and to share the creation of meaning that
is the basis of culture. Children’s own cognitive resources for language
acquisition are crucial, but it is by virtue of acquiring language within
Western culture that children acquire our ToM. In Western society we
take a mentalistic stance to ourselves and other people, including infants
and children (Vinden and Astington 2000). We ascribe mental states
to them in the ways we interact with them and in the lexical terms
we use. Thus, in acquiring language, children acquire our ToM, which
is embedded in our speech practices. Other cultures may have quite
different conceptions of mind, or the concept of mind may not exist in
every culture. That is to say, there could be ways of interpreting social
behavior that do not necessarily rely on ToM (e.g., by relying more on
social roles and rules). This remains an open question, to be kept in
mind as I describe the relationship of ToM and language in Western
culture.

ToM and Language


For both language and ToM, consideration of representation and
communication
is fundamental. Language allows people to communicate
their representation of reality, that is, their point of view. ToM involves
an understanding of representation and communication, that is, an
understanding that people have different points of view, that underlie
all of their communicative interactions and that they sometimes may
directly communicate to others. Thus, ToM entails an awareness of
mental states (e.g., attention, perception, belief, knowledge, desire,
intention, and emotion) and the ability to use one’s awareness of
this network of mental states to explain, predict, and interpret the
behavior of self and others—and not just physical actions, but also
speech acts. That is, ToM underlies the ability to interpret human action
and communication.
Language is a similarly complex construct. It is a multifaceted system
that has two basic functions: communication and representation. Many
species represent and communicate, but only humans use one and
the same system for both representing and communicating. Human
language is used as an intraindividual representational system,1 on
the one hand, and as an interindividual communication system, on
the other hand. Thus, children’s language competence includes their
semantic and syntactic representations as well as their pragmatic
ability to express and interpret intended meanings in communicative
exchanges. Further, in considering language in relation to ToM, one
needs to distinguish between the individual and the social: that is,
children’s own linguistic abilities (semantics, syntax, and pragmatics)
and the linguistic environment, which comprises the communicative
exchanges in which children are involved as participants or bystanders.
Obviously these two—individual competence and social context—are
related to one another, but they may relate to ToM in different ways.
The social context will affect children’s own linguistic abilities and,
indeed, their linguistic ability may affect their environment, in terms
of the kinds of communications they receive. Nonetheless, one can
consider the effects of the linguistic environment while controlling for
individual differences in children’s own linguistic competence.

Developmental Relations between ToM and Language


The relationship between ToM and language is quite complicated because
both are complex multifaceted systems and because their development
is intertwined. Development of both begins in infancy and they are
closely connected from the start.
Infant Social Perception
Infants are tuned into people right from birth. They attend to human
faces and voices more than to nonhuman sights and sounds, and they
can soon discriminate the mother’s face and voice from those of others.
They can also imitate human facial movements from very early in life.
Around two months of age infants enter into dyadic interactions of
smiling and vocalizing with the mother or other adult. Later in the
first year, from eight or nine months on, these interactions become
triadic with the addition of an environmental focus, such as a toy or
other object. Over the next few months these interactions develop in
complexity as infants engage in shared reference and joint attention
episodes, following the adult’s head orientation, eye gaze, and pointing
gestures. During this period infants also respond to the adult’s emotional
reaction in a behavior known as social referencing; that is, in a situation
of uncertainty (e.g., a strange new object in the environment) they will
look to the mother’s face and respond in accord with her positive or
negative emotional expression. In addition, infants produce pointing
gestures themselves, to indicate objects of interest as well as to request
objects. ToM is grounded in and built on these early social behaviors,
and these behaviors are also associated with the beginnings of language
acquisition. In a longitudinal study of nine- to 15-month-old infants,
Carpenter et al. (1998) found high correlations of such behaviors with
measures of the infants’ communicative competence, for example, in
producing words and gestures and comprehending language.
There is, however, some debate over how soon infants can attribute
communicative intentions to others. Some researchers (e.g., Liszkowski
this volume; Tomasello et al. 2005) argue that from about nine months
onward infants recognize the other person as an intentional agent,
whose behavior—like their own—is governed by goals and perceptions,
and they can understand the other’s pointing gesture, for example, as
a communicative intention. However, other researchers (e.g., Moore
1998), although agreeing that it is possible and maybe intuitively natural
to make such an interpretation, argue that it ascribes more conceptual
understanding to the infant than is warranted at this stage.
Anyway, toward the middle of the second year children are clearly
aware of others’ communicative intentions and use them as a guide
in word learning, for example. Baldwin (1993) showed that by 18
months of age children can use the direction of a speaker’s gaze to
infer the referent of a novel word. When hearing a new word in the
presence of two unfamiliar objects, they will associate the word with
the object the speaker is looking at, even in a condition in which their
own attention is directed to the other object. There is also evidence
that two-year-olds can take account of the listener’s state of knowledge
by appropriately distinguishing between new and given information
in their own communications, for example, when requesting help in
retrieving an object (O’Neill 2005). It is in this sense that older infants
and young toddlers have a ToM—they are sufficiently aware of the
adult’s mental states (attention, intention, knowledge, and ignorance)
that they can make use of them, for example, in learning words or
seeking assistance. It is important to note that this ability is not solely
utilitarian, motivated by their desire to obtain a goal. If the listener
misunderstands their request but they do, fortuitously, get what they
wanted, they still make an effort to repair the communication to achieve
listener understanding (Shwe and Markman 1997).
One could say that at this stage children’s understanding of mental
states is only implicit, as expressed in their behavior. Some researchers
argue that this is necessarily true because such young children have
limited verbal skills and cannot reveal their understanding in any other
way (Chandler et al. 1989; Meltzoff et al. 1999). I agree that these
children do already have some understanding of mental states. This
will, however, become more evident as their verbal skill increases.

Mental-state Awareness in Toddlers and Preschool Children


On my view, children’s understanding of mental states develops
substantially over the next few years as their verbal skills increase,
because language provides children with resources that promote or
permit ToM development. Children develop both language and ToM
as participants in the social world, where participation depends on
communication (Dunn and Brophy 2005; Nelson 2005). It is primarily
in communicative interaction that one learns about what is in another
person’s mind. “Mindreading” is indirect at best—one has to infer
what is in another person’s mind from behavior, gestures, emotional
expressions, and to a large extent, from what the person says. It is in
conversation that children acquire language and learn about people’s
minds—in talking with parents, siblings, and friends, and in listening
to the talk that goes on amongst these others. Furthermore, the efficacy
of conversation in developing social understanding is influenced by the
quality of the relationship between the communicative participants
(Dunn and Brophy 2005). For example, specific features of conversation
(e.g., causal talk) between mother and child at 33 months predicted false-
belief understanding at 40 months, when the conversation occurred in
a warm emotional environment but not when it occurred in a hostile
controlling situation (Dunn et al. 1991).
Tomasello (1999) argues that linguistic symbols are inherently
perspectival
and that language use in discourse provides young children with
many opportunities to experience contrasting and shared perspectives,
for example, in disagreements or misunderstandings, with the ensuing
negotiation or clarification sequences. Similarly, Harris (1999) points
out that in conversational exchanges children are frequently exposed
to the fact that different people know different things. He argues
that the experience of exchanging information leads children to an
understanding of people as epistemic subjects, and an awareness
that there are different points of view on the same material world. A
longitudinal study in which mother and child were recorded talking
about a series of pictures provides compelling evidence of such a causal
relation between maternal discourse and ToM development (Ruffman
et al. 2002). Mothers’ use of mental-state terms predicted children’s
ToM understanding one year later, on a range of measures, even when
controlling for the children’s earlier ToM abilities as well as their own
earlier language ability, including their use of mental terms. Harris (2005)
points out that the mother’s use of mental terms may not, however,
be the most important factor in developing the child’s understanding
but, rather, it is an easily countable measure that is likely to covary
with her pragmatic intent to introduce varying points of view into
the conversation. On his view, this is the effective source of variation
in promoting ToM development. In support of this suggestion, Harris
cites two training studies (Hale and Tager-Flusberg 2003; Lohmann and
Tomasello 2003), both of which indicate that conversation, not using
mental terms, but nonetheless emphasizing different points of view,
is sufficient to generate an improvement in children’s performance on
ToM tasks.2
However, other researchers put more emphasis directly onto children’s
semantic development (Olson 1988; Peterson and Siegal 2000). On their
view, it is children’s acquisition of mental-state terms that mediates
their developing understanding of mind. By mediate I mean that the
child’s mental-state concepts are determined by the categories and the
relations among the categories lexicalized in the child’s natural language
(Nelson 1996). Language itself creates categories and distinctions that
do not, indeed cannot exist for us without language (Astington 1999).
I would not claim that children have no understanding of mental
states until they start to talk about them. I do, however, believe that
the ability to talk about them significantly influences and increases
their understanding. On this view, it is by virtue of being a linguistic
creature and growing up within a certain culture that children acquire
an understanding of mind (Astington 1996). This can be construed
as a Whorfian argument, although I do not intend it in the radically
relativist way often associated with Whorf (1956). My argument, like
that of Nelson (1996) and Tomasello (1999), is that as children acquire
language they learn to think in cultural ways. This is a Vygotskian
perspective that sees children internalizing their culture’s construal of
mind through linguistic social interaction.
On this view, ToM exists in the folk ways and speech practices of the
culture and the child would never come up with the theory except by
participating in dialogue with more knowledgeable members of the
culture. Furthermore, within this dialogue the mental-state lexicon
plays a vital role. Not surprisingly, there are strong relations between the
frequency and type of mental-state terms used in family conversations
and children’s use. Children whose parents use more mental-state terms
correspondingly talk sooner and more frequently about mental states
(e.g., Moore et al. 1994; Ruffman et al. 2002). Siblings are also influential:
four-year-olds with an older sibling both heard and produced more
cognitive terms than four-year-olds with a younger sibling (Jenkins
et al. 2003). In using mental terms as they comfort or explain things
to young children, parents will refer to the child’s desires, beliefs, and
emotions as desires, beliefs, and emotions (e.g., You want candy but
we don’t have any. You thought it was in the cupboard. Now you feel
sad.). Importantly, parents use the same linguistic terms to talk about
other family members (e.g., Molly wants candy too. I think grandma
will bring some when she comes. Then you’ll both feel happy.). That
is, children’s own experience is referred to using the same terms that
are applied to others, such that a child is able to map other people’s
experiences onto his or her own and so come to discern mental states
in self and other (Astington 1996).
There is, however, a great deal to debate here (see Baldwin and Saylor
2005; Montgomery 2005). How can children acquire the concepts from
the language? Some researchers assume that children have to work
out word-referent relations as they do in learning terms for objects,
but in this case mapping mental terms onto mental concepts. The
acquisition of terms for objects depends on the abstraction of perceptual
invariants from sensory experience. Beckwith (1991) proposes a theory
of “nominalist bootstrapping” in which the acquisition of an abstract
term is assisted by linking the term to appropriate external correlates,
that is, perceptible entities or events associated with the term’s use. Some
support for Beckwith’s view is found in studies of children’s acquisition
of the meaning of mental terms such as know, guess, lie, and promise,
which show that children first focus on tangible correlates of these verbs
(Astington 1988; Johnson and Wellman 1980; Wimmer et al. 1984).
However, Johnson (1982) argues that young children do not focus on
perceptible features, rather they focus on single salient aspects, which
often, but importantly not always, are perceptible features.
A particular problem for the view that children first focus on tangible
correlates in acquiring the meaning of mental terms is that although
beliefs, desires, intentions, and emotions may be revealed in facial
expressions, talk, and behavior, they are expressed in a great variety of
ways. There is no one-to-one relation between particular mental states
and specific patterns of behavior. Baldwin and Saylor (2005) make an
alternative proposal in suggesting that mental terms may serve as an aid
to analogical reasoning and inductive inference, extending Gentner’s
(e.g., Gentner and Ratterman 1991) ideas to the mentalistic domain. That
is, a mental term invites children to compare different behaviors that
otherwise they would not attempt to align, thus promoting inferences
about nonobvious commonalities across particular mental states, such
as belief, desire, attention, and intention.
However, other researchers argue against the view that the child’s
task is to match words and referents and assume that what children
are doing is learning how to use mental terms in appropriate contexts
(Montgomery 2005). Nelson (1996) points out that young children’s use
of mental terms does not, at least at first, indicate that they understand
the concepts to which those terms refer. She argues that meaning
must be extracted from use (Levy and Nelson 1994). At first, the words
children use are tightly tied to particular discourse contexts. Children
produce these words in specific contexts, modeled on adult usage, before
they have any real meaning for the terms, that is, before they have the
concept that the adult term refers to. This is what Nelson calls “use
before meaning” and she argues that meaning arises from use. Adults use
terms in contexts that are relevant to the child’s own interests, and the
child uses the same form in the same contexts. This early use is entirely
pragmatic, a sort of meaningless repetition, that nonetheless establishes
the term in the child’s productive vocabulary. This leads to the child’s
noticing the term when adults use it in other contexts and allows the
child to compare the ways in which self and others use the term. By this
stage, the child has competent productive use of the term but still may
fail comprehension tests designed to probe for an understanding of the
precise meaning of the term. For example, children do not make precise
distinctions between the terms know, guess, remember, and forget until
the early school years (Johnson 1982), although these terms are part
of their productive vocabulary in the preschool years. The final stage,
when the child has full understanding of the relational and contrastive
meanings of the term, is achieved when terms within a domain are
related to one another within a lexical-semantic system, independent
of actual contexts and situations (Nelson and Kessler Shaw 2002).
Thus, as different people express their own thoughts in conversation
children come to understand that people have different points of view,
that is, that they think and know, and like and want different things.
Children can acquire this understanding, even if the perspective-taking
discourse does not explicitly use mental-state terms (Harris 2005).
However, these terms are often used in conversation, so children also
acquire them from their participation in conversation. They use the
terms that they hear others use, at first without real understanding.
Following this pragmatic use, the terms acquire meaning for the child
and take their place within a coherent conceptual system, which
develops during the later preschool years and on into the school-age
years.

The Development of Metarepresentational Abilities


The crucial question now is how the child progresses from conversation
about mental states to the acquisition of a coherent conceptual system
of mental states used in the interpretation of human behavior. A most
important point is that although children acquire mental terms from
their participation in conversation, they are not passive recipients of
mentalistic concepts, but are actively involved in their construction.
Although the social environment is significant, conceptual development
is critically dependent on children’s own cognitive resources. The
challenge is to find the appropriate balance between internalist,
individualistic
explanations and externalist, enculturation explanations of
ToM development (Astington and Olson 1995). Much progress has
been achieved in this area over the past 20 years, but there is still work
to be done.
On the one hand, from an internalist perspective, theorists have
described the innate foundation for ToM (Leslie 1994), the domaingeneral
general cognitive resources (e.g., executive functions: Carlson and Moses
2001; Hughes 1998) that are called into play in acquiring a ToM, and the
sequential developments in children’s representational abilities (Perner
1991). On the other hand, although completely externalist views are rare
in this area (perhaps Rogoff et al. 1993 provide an example), there are a
number of social-constructivist viewpoints that recognize the important
role of social interaction in ToM development (e.g., Carpendale and
Lewis 2004; Fernyhough 1996; Garfield et al. 2001). However, although
researchers may focus on the individual child or the social interaction,
it would be useful to combine insights from both perspectives. This
might lead to a more detailed and precise account of the mechanism
whereby the child’s cognitive abilities, operating within social activities,
lead to an understanding of mind.
On my view, progress will be made if we focus on the two different
functions—communication and representation—that language
simultaneously
serves. Language is first used pragmatically, in social
interaction,
and becomes a tool for verbal representation (Nelson 1996). The
essence of Nelson’s argument that “use before meaning” leads to the
acquisition of “meaning from use” is that language that has been used
in social interaction becomes established as a device for individual
representational
activity—an obviously Vygotskian perspective: “Language
thus takes on an intrapersonal function in addition to its interpersonal
use” (Vygotsky 1978:27).
Another important Vygotskian concept is the “zone of proximal
development.” This is the idea of readiness for experience—it is the
area in which the child is able to claim as his or her own, concepts
that are expressed in social interaction. For a while, understanding rests
between child and adult, supported by the social situation, until the
child has independent operation of the concept. The zone of proximal
development is important in regards to ToM development. Western
culture’s folk psychology recognizes three basic aspects of the human
mind—cognition, motivation, and affect—that interact in behavior
regulation. In acquiring a ToM, children come to understand concepts
of cognition (perception, knowledge, and belief), motivation (goals,
desires, and intentions), and affect (emotional states, such as happy,
sad, angry, etc.), and they learn how to use these concepts in predicting,
explaining, and interpreting their own and other people’s actions.
However, despite the fact that infants and children are immersed in
talk about cognitions, motivations, and emotions from the start of
life, there is a regular sequence in which children themselves talk
about mental states and understand their relation to behavior. When
they are 18 months to two years old children start to use perception,
emotion, and desire terms (e.g., see, look, happy, sad, like, love, want),
whereas cognition terms (e.g., know, think, remember) are not produced
until some months later (Bartsch and Wellman 1995; Bretherton and
Beeghly 1982). In English, desire terms are used in simpler syntactic
constructions than belief terms, which might explain the sequencing,
but in languages where this is not the case, children still talk about
desire before they talk about belief. For example, in Chinese, desire
and belief terms are expressed in equally simple constructions, and in
German, in equally complex constructions; nonetheless, both Chinese
and German children talk about desire before belief, and succeed on
tasks assessing their understanding of desire before they succeed on
comparable tasks assessing understanding of belief (Perner et al. 2003;
Tardif and Wellman 2000).
Correspondingly, in a recent meta-analysis of studies conducted in
English, Wellman and Liu (2004) showed that on comparable tasks,
children correctly judge people’s desires and can predict actions based
on desires before they correctly judge people’s beliefs and predict actions
based on beliefs. Wellman and Liu (2004) also showed that true belief
is understood before false belief, and ignorance is understood before
false belief. Furthermore, they developed a scaled series of ToM tasks for
children aged three to six years and showed that there is a consistent
developmental progression in children’s understanding of different types
and aspects of mental states (diverse desires, diverse beliefs, knowledge
access, false belief, and emotion based on belief). The scale is proposed as
an instrument for capturing the developmental progression in children’s
acquisition of mentalistic concepts, which cannot be captured within
a single type of task, such as the false-belief test.
However, children’s false-belief understanding (i.e., their ability to
reason, within an experimental setting, about the behavioral
consequences
of holding a false belief: Wimmer and Perner 1983) is the
aspect of ToM development that has been most extensively investigated
over the past 20 years. Why this focus? The fundamental tenet of ToM
is that people act to fulfill their desires in light of their beliefs; that
is to say, desire is equally important to belief in determining human
action. Indeed, in our everyday social interaction, we are more aware
of and more explicitly consider people’s desires. It may be that our
own experience of desire is more directly tied to physiological states;
thus, desires are more salient than beliefs (Jolly 1988). It is also the case
that beliefs are more generally shared than desires, so that we more
often need to attend to motivational states in interpreting actions. As
Premack and Woodruff (1978:515) note: “It seems beyond question that
purpose or intention is the state we impute most widely.” However,
as commentaries on the article pointed out (Bennett 1978; Dennett
1978; Harman 1978), it is the attribution of false belief that provides an
unequivocal demonstration that an individual has a ToM, in the sense
Premack and Woodruff intended it, that is, the individual is actually
imputing mental states to another in predicting the other’s behavior.
Wimmer and Perner (1983) developed the false-belief task on the basis
of these commentaries. They emphasized that individuals who possess
a ToM have an ability for metarepresentation, that is, they not only
represent the situation but also represent their own and the other’s
different relationships to this situation.
Western children generally achieve success on standard false-belief
tasks around four years of age, with individual variation from three to
five years. The age of success can also be varied by changing the demand
characteristics of the task. However, despite a cottage industry of false-belief
task manipulations, the hypothesis that success is dependent on
an underlying conceptual advance still holds (Wellman et al. 2001).
Clarity is needed regarding the nature of this conceptual advance and
how it relates to the developmental progression in children’s acquisition
of other mentalistic concepts. 3
Over the last ten years a large number of studies has established
that successful performance on false-belief tasks is strongly related
to children’s own language ability. These studies have used a range
of tasks (false-belief prediction, false-belief explanation, deception,
appearance–reality, and emotion prediction); different measures of
language (general language ability, receptive vocabulary, grammatical
complexity, and narrative speech); with different populations (typically
developing children, deaf children, children with autism, and children
with specific language impairment); and with different languages in
addition to English. My colleagues and I have recently conducted a
meta-analysis of some of these data (Milligan et al. n.d.). The analysis
includes almost 9,000 typically developing children in 106 studies
and shows a strong relation between language ability and false-belief
understanding. The overall effect size weighted by sample size is .43. In
other words, approximately 18 percent of the variance in false-belief task
performance is accounted for by children’s language ability. Secondary
analyses investigating the effect of different types of language ability
showed no significant difference between children’s semantic and
syntactic knowledge, which respectively accounted for 23 percent and
29 percent of the variance in false-belief understanding.
Most false-belief tasks are verbal tasks, leading some researchers to
argue that language ability limits task performance, which they say
provides sufficient explanation for the correlation between language
and false belief. However, many researchers make the stronger
argument
that language is causally related to the development of false-belief
understanding. The fact that both semantics and syntax are
equally strongly related to false-belief understanding in the
metaanalysis
(Milligan et al. n.d.) supports the argument I am making that
communication and verbal representation are both of importance. That
is, the semantic ability is indicative of the amount of conversational
experience children have been exposed to (Huttenlocher et al. 1991),
whereas syntactic ability shows their own facility with linguistic
structures. Needless to say, it is the same language system that is used
in communication and verbal representation so that children’s semantic
and syntactic abilities are related. But the perspectival understanding
that is acquired from conversation (Harris 2005; Tomasello 1999) is
internalized in a way that facilitates false-belief task performance.
In conversation, different knowledge states can be expressed by
different participants, for example: “The chocolate is in the cupboard.”
“No, it’s in the drawer.” If one of these sentences is true, then the other
is false. However, when each sentence is embedded under a mental verb
as a sentential complement, both of the resulting complex sentences can
be true: “He thinks the chocolate is in the cupboard” and “I know it’s in
the drawer.” In a standard change-of-location false-belief task (Wimmer
and Perner 1983) an object is moved from one place to another while
the story protagonist is off the scene (e.g., Maxi’s chocolate is moved
from cupboard to drawer). To predict correctly where Maxi will look for
the chocolate when he returns, the child needs to be able to represent
Maxi’s knowledge state, which is contradicted by the evidence now
given in the visual display. This can be achieved by using the syntactic
construction in which the mistaken representation is embedded in a
true report: “Maxi thinks that the chocolate is in the cupboard.” But is it
possible for the child to make the necessary causal connections without
the embedded representation, for example, by predicting that Maxi
will look in the cupboard because that is where he put the chocolate
before he went out? Perner (1991) denies this possibility, arguing that
the critical point is the fact that Maxi takes his belief as being the
way the world is, and so the child needs to represent Maxi’s view of
the situation as his [Maxi’s] representation—that is, the child has to
metarepresent Maxi’s belief.
J. de Villiers (2005) makes a similar claim, but takes a strong position
of linguistic determinism in proposing that such metarepresentational
abilities crucially depend on mastery of the syntax of complementation.
Importantly, this syntactic development does not apply to all object
complements (i.e., it does not hold for desire [want + infinitive] in
English), nor does it apply to all tensed object complements [that +
finite verb]; for example, it does not hold for [want-that] in German, and
[pretend-that] in English, because these verbs take irrealis complements
(i.e., about future or imaginary events). Rather, it applies only to belief
and communication verbs, which take realis complements (i.e., about
actual events). De Villiers proposes that there is a point-of-view marker
on the complement clause, for belief and communication verbs, that
is specified by the verb itself. De Villiers argues that desire verbs (e.g.,
want) and belief verbs (e.g., think) develop along radically different
trajectories. Complements of belief verbs are never treated as irrealis. At
an early stage, before children understand that these verbs open up a
new point-of-view domain, belief complements are treated as true realis
clauses. A crucial stage comes with the realization that complements
embedded under the verb think can be false compared to the world.
De Villiers claims that this comes about via analogy with the verb say.
The two verbs are used in the same discourse contexts and are alike
syntactically and so they are placed in the same subclass. It is obvious
that say takes false complements; that is, children have evidence that
what people say sometimes does not correspond to the way things are
in the world. Children then extend this understanding to complements
of the verb think. Thus, syntax provides a bootstrap from the overt
evidence of falsity for complements of say, to the possibility of false
complements for think.
Perner (Perner et al. 2005) argues against de Villiers’s linguistic
determinism theory, pointing to a large number of studies showing a
correspondence between the age at which children understand
differences
in point of view in the context of conflicting desires and the
age at which children understand differences in point of view in the
context of false beliefs. He argues that understanding point of view
cannot be derived from an understanding of the particular syntactic
structure associated with belief verbs, because desire verbs do not share
this structure. Perner’s objection is supported by the fact that children
use complement constructions well before they understand false belief,
indeed, almost as soon as they start to produce mental verbs, that is, at
two years of age (Bartsch and Wellman 1995; Bloom et al. 1989). However,
Diessel and Tomasello (2001) show that this early use is formulaic and
argue that it does not provide evidence of mastery of complement
syntax (it is an example of pragmatic “use before meaning”). In support
of de Villiers, comprehension of complements is not mastered until a
year or two later, when it correlates with children’s performance on
false-belief tasks—indeed, predicts this performance in a longitudinal
study (de Villiers and Pyers 2002). Further support is provided by Pyers’s
(this volume) data from two groups of deaf adults using Nicaraguan
Sign Language (NSL), a recently developed sign language. The younger
group, who use a more syntactically complex form of NSL that includes
complement structures, all passed a false-belief test given nonverbally.
The NSL used by the older group is less systematic and complex, and
all but one in this group failed the false-belief test.

Interpretive Abilities in School-age Children


Children’s ToM develops further during the early school years. One of
the main developments is the ability to understand doubly embedded
representations, that is, to be aware not just that people have beliefs
(and false beliefs) about the world but that they also have beliefs about
the content of others’ minds (e.g., about others’ beliefs), and similarly,
these too may be different or wrong. Such beliefs about beliefs are
referred to as second-order beliefs. Perner and Wimmer (1985) showed
that around seven years of age children are able to represent and reason
from second-order beliefs: X believes that Y believes that p. The task they
devised is quite complex, but even with a simplified version children
are not reliably correct until age six (Astington et al. 2002). In addition,
other tasks have been developed to assess children’s ability to deal
with higher-order representations involving intentions and emotions
as well as beliefs, for example, Happé’s “strange stories” (1994); Baron-Cohen
et al.’s “faux pas” stories (1999). Another development that
occurs around six or seven years of age is the recognition that different
people may make legitimate but different interpretations of the same
external reality, which requires more than understanding the possibility
of true versus false beliefs (Chandler and Lalonde 1996).
Understanding second-order representations and recognizing
interpretive
diversity are abilities that likely underlie the more mature
understanding and use of complex language that develop during the
school years. Two studies in my lab examine this possibility. Filippova
(2005) developed a new measure to assess children’s understanding of
four different types of irony (counterfactual and hyperbolic criticism
and praise). She assessed six- to ten-year-olds’ performance on this
measure and compared it with their performance on a composite ToM
measure comprised of higher-order belief and intention tasks. She found
strong correlations between the ToM composite and all types of irony,
although the correlations were attenuated after partialing out age,
language ability, and working memory. Comay (n.d.) examined five- to
seven-year-olds’ narrative productions in relation to their performance
on second-order false-belief and interpretive diversity tasks. The stories
are coded for length, mental-state references, the use of evaluative
narrative enhancers, and the level of intentional structure. There are
interesting relations between the ToM scores and the narrative measures,
which hold when controlling for language ability and story length.
It may be that narrative competence mediates between language and
ToM development, and that individual differences in narrative skill
in children are related to a more general understanding of mind and
intentional action. It seems likely that children’s stories both serve to
indicate a level of psychological understanding and, at the same time,
foster further development in this area.

A Paradox at the Heart of the Matter


In the previous section, I described the interdependent development of
language and ToM from infancy to the school years. I argued that during
the preschool years ToM development is promoted by language, first
used in social interaction and then internalized as a representational
device. I suggested that participation in conversation leads to awareness
of mental states, while subsequently, children’s own syntactic and
semantic abilities facilitate metarepresentational interpretations of
human behavior. There is, however, a fundamental paradox at the
heart of this argument. Children communicate with others from a
very young age, even before they can talk, and they respond to others’
communications. Certainly, by two or three years of age, they can
engage in meaningful conversational exchanges. But the expression
and interpretation of meaning fundamentally depends on inferring
the mental states of one’s interlocutor, as Grice (1957) first argued, and
as many researchers have subsequently made clear (e.g., Sperber and
Wilson 1986). Thus, the paradox: two- and three-year-olds are proficient
communicators, who must infer their interlocutor’s mental states, and
yet they fail simple tests designed to assess their understanding of belief
and intention.
How can we resolve this paradox? Perhaps children understand others’
beliefs and intentions long before this is apparent in their performance
on standard tests, such as the false-belief task. There is, indeed, increasing
evidence that children do have some awareness of others’ mental
states before the age of two, as I reported earlier in this chapter in the
section on infancy, and as reported in other chapters (e.g., Gergely
and Csibra, and Liszkowski in this volume). For example, Liszkowski
shows that 12-month-olds use pointing to provide information for
others. Moreover, Onishi and Baillargeon (2005), using violation-of-expectation
methodology, compared 15-month-olds’ looking times in
four conditions in which an actor, holding a true or false belief about
the location of a toy (because the actor did or did not see it moved)
searches for the toy in its actual location or in the empty location.
The researchers showed that infants looked longer in conditions that
violated the expectation that an actor’s search is premised on the actor’s
beliefs, not on the actual situation. From these findings, they infer that
15-month-olds understand false belief.
Is it the case then, as some researchers have argued (e.g., Baron-Cohen
1995; Fodor 1992; Leslie 1994), that the essential aspects of ToM are
present by late infancy if not earlier, and that developmental
psychologists
are wrong to focus on the preschool years as a critical period in
ToM development? These researchers contend that performance on
standard false-belief tasks requires the development of linguistic and
computational resources that are not in place until the preschool years,
but preschoolers’ correct performance simply marks the advent of these
resources and tells us nothing about the child’s understanding of false
belief. I disagree. It is unproductive to argue that the ToM abilities that
really matter are innate or that they develop at such-and-such a particular
age. What is needed is a description of ToM development that begins
in infancy and continues through the preschool and into the school
age years. Such a description would go a long way toward resolving
the paradox referred to above, without dismissing the importance of
the developmental changes that occur toward the end of the preschool
years.
Karmiloff-Smith’s (1992) theory of representational redescription (RR)
may provide a useful way of thinking about earlier and later competences
in ToM and the role of language in facilitating the development of the
latter from the former. Her theory resists the antagonistic stance of
nativist versus constructivist theories and integrates the two in a theory
of developmental change. According to the RR theory, information
that is implicit in a cognitive system, either innately or as a result of
the system’s activity, can over time become explicit knowledge to that
system. The redescriptive process allows for the re-representation of
knowledge in a new representational format. That is, knowledge that
is at first implicit in procedures can become explicit and available to
conscious access.
In regards to the acquisition of ToM or social understanding, it is
important to remember that children are participants in the social
world right from birth. What they are acquiring in developing ToM
is an understanding of patterns of actions in which they are already
participants. “There are several aspects of the child’s commonsense
psychology in which the knowledge is initially in the structure of the
infant’s interaction with conspecifics, rather than solely in the child’s
perception and representation of the world” (Karmiloff-Smith 1992:122).
This knowledge, embedded in the interaction, is stored internally and
is available as a data structure for the redescriptive process.
There is ample evidence of the implicit level of understanding seen
in infant and toddler social interaction, including imitation, joint
attention, social referencing, and so on. There is evidence of intermediate
stages, in which knowledge is redescribed but not available to conscious
access and verbal report; for example, young three-year-olds look to the
correct location in a false-belief task but give an incorrect verbal response
(Clements and Perner 1994) and they use different intonations in setting
the scene and acting within pretend scenarios (Nelson 1996). By five
years of age the metarepresentational level is available to conscious
access and verbal report, as in standard false-belief tests. Although this
suggestion is little more than a promissory note, it is offered as a way in
which it might be possible to resolve the paradox. Much detail remains
to be filled in, however, but that must be left for another time.

Acknowledgments
I thank Chris Moore and Lisa Dack for helpful comments on drafts of
this chapter, and the Natural Sciences and Engineering Research Council
of Canada for support of my research.

Notes
1. Representation, in this sense, is essentially equivalent to verbal
thought; used in this way it includes only part of the full scope of
the term representation.
2. It is likely that both perspectival conversations and mental terms are
important. In the Lohmann and Tomasello (2003) training study that
Harris (2005) cites, even though perspective-shifting discourse and
use of mental terms had independent effects on the development
of false-belief understanding, the largest training effect occurred in
a condition that combined perspective-switching discourse with
mental term use.
3. Changes in children’s concepts of desire and intention are correlated
with the development of false-belief understanding. Before this
time, children see desires and intentions as motivational states
that are not clearly distinguished from one another and that are
tied to actions and speech acts. They understand that people may
have different desires or intentions, that each person acts to fulfill
his or her own desires and intentions, and are happy if they are
fulfilled, and so forth. After the conceptual advance marked by
false-belief understanding children recognize the consequences
of incompatible, conflicting desires (Perner et al. 2005), they can
distinguish between desire and intention, and understand
intentional
causation (Astington 2001).

References
Astington, J. W. 1988. Children’s understanding of the speech act of
promising. Journal of Child Language 15:157-153.
——. 1996. What is theoretical about the child’s theory of mind? A
Vygotskian view of its development. In Theories of theories of mind,
edited by P. Carruthers and P. K. Smith, 184-199. Cambridge:
Cambridge University Press.
——. 1999. The language of intention: Three ways of doing it . In
Developing theories of intention: Social understanding and self control,
edited by P. D. Zelazo, J. W. Astington, and D. R. Olson, 295-315.
Mahwah, NJ: Erlbaum.
——. 2001. The paradox of intention: Assessing children’s metarepresentational
understanding. In Intentions and intentionality: Foundations of
social cognition, edited by B. F. Malle, L. J. Moses, and D. A. Baldwin,
85-103. Cambridge, MA: MIT Press.
Astington, J. W., P. L. Harris, and D. R. Olson (eds.). 1988. Developing
theories of mind. New York: Cambridge University Press.
Astington, J. W., and D. R. Olson. 1995. The cognitive revolution in
children’s understanding of mind. Human Development 38:179-189.
Astington, J. W., J. Pelletier, and B. Homer. 2002. Theory of mind
and epistemological development: The relation between children’s
second-order false-belief understanding and their ability to reason
about evidence. New Ideas in Psychology 20:131-144.
Avis, J., and P. L. Harris. 1991. Belief-desire reasoning among Baka
children: Evidence for a universal conception of mind. Child
Development 62:460-467.
Baldwin, D. A. 1993. Infants’ ability to consult the speaker for clues to
word reference. Journal of Child Language 20:395-418.
Baldwin, D. A., and M. Saylor. 2005. Language promotes structural
alignment in the acquisition of mentalistic concepts. In Why language
matters for theory of mind, edited by J. W. Astington and J. A. Baird,
123-143. New York: Oxford University Press.
Baron-Cohen, S. 1995. Mindblindness: An essay on autism and theory of
mind. Cambridge, MA: Bradford Books–MIT Press.
Baron-Cohen, S., M. O’Riordan, V. Stone, R. Jones, and K. Plaisted.
1999. Recognition of faux pas by normally developing children with
Asperger syndrome or high-functioning autism. Journal of Autism &
Developmental Disorders 29:407-418.
Bartsch, K., and H. M. Wellman. 1995. Children talk about the mind. New
York: Oxford University Press.
Beckwith, R. T. 1991. The language of emotion, the emotions, and
nominalist bootstrapping. In Children’s theories of mind, edited by D.
Frye and C. Moore, 77-95. Hillsdale, NJ: Erlbaum.
Bennett, J. 1978. Some remarks about concepts. Brain and Behavioral
Sciences 1:557-560.
Bloom, L., M. Rispoli, B. Gartner, and J. Hafitz. 1989. Acquisition of
complementation. Journal of Child Language 16:101-120.
Bretherton, I., and M. Beeghly. 1982. Talking about internal states: The
acquisition of an explicit theory of mind. Developmental Psychology
18:906-921.
Bretherton, I., S. McNew, and M. Beeghly-Smith. 1981. Early person
knowledge as expressed in gestural and verbal communication: When
do infants acquire a “theory of mind”? In Infant social cognition, edited
by M. E. Lamb and L. R. Sherod, 333-373. Hillsdale, NJ: Erlbaum.
Carlson, S. M., and L. Moses. 2001. Individual differences in inhibitory
control and children’s theory of mind. Child Development 72:1032-.1053
Carpendale, J. I. M., and C. Lewis. 2004. Constructing an understanding
of mind: The development of children’s social understanding within
social interaction. Behavioral and Brain Sciences 27:79-151.
Carpenter, M., K. Nagell, and M. Tomasello. 1998. Social cognition, joint
attention, and communicative competence from 9 to 15 months of
age. Monographs of the Society for Research in Child Development, No.
255, 63(4).
Chandler, M. J., A. S. Fritz, and S. M. Hala. 1989. Small scale deceit:
Deception as a marker of 2-, 3- and 4-year-olds’ early theories of
mind. Child Development 60:1263-1277.
Chandler, M., and C. Lalonde. 1996. Shifting to an interpretive theory
of mind: 5- to 7-year-olds’ changing conceptions of mental life. In
The five to seven year shift: The age of reason and responsibility, edited
by A. J. Sameroff and M. M. Haith, 111–139. Chicago: University of
Chicago Press.
Clark, H. H. 1996. Using language. Cambridge: Cambridge University
Press.
Clements, W. A., and J. Perner. 1994. Implicit understanding of belief.
Cognitive Development 9:377-395.
Comay, J. n.d. Individual differences in narrative perspective-taking and
theory of mind: A developmental study. Ph.D. dissertation, Department
of Human Development and Applied Psychology, University of
Toronto.
de Villiers, J. 2005. Can language acquisition give children a point
of view? In Why language matters for theory of mind, edited by J. W.
Astington and J. A. Baird, 186-219. New York: Oxford University
Press.
de Villiers, J. G., and J. E. Pyers. 2002. Complements to cognition: A
longitudinal study of the relationship between complex syntax and
false-belief understanding. Cognitive Development 17:1037-1060.
Dennett, D. C. 1978. Beliefs about beliefs. Brain and Behavioral Sciences
1:568-570.
Diessel, H., and M. Tomasello. 2001. The acquisition of finite complement
clauses in English: A corpus-based analysis. Cognitive Linguistics
12:97-141.
Dunn, J., and M. Brophy. 2005. Communication, relationships, and
individual differences in children’s understanding of mind. In Why
language matters for theory of mind, edited by J. W. Astington and J.
A. Baird, 50-69. New York: Oxford University Press.
Dunn, J., J. Brown, C. Slomkowski, C. Tesla, and L. Youngblade. 1991.
Young children’s understanding of other people’s feelings and beliefs:
Individual differences and their antecedents. Child Development
62:1352-1366.
Fernyhough, C. 1996. The dialogic mind: A dialogic approach to the
higher mental functions. New Ideas in Psychology 14:47-62.
Filippova, E. 2005. Development of advanced social reasoning: Contribution
of theory of mind and language to irony understanding. Ph.D. dissertation,
Department of Human Development and Applied Psychology,
University
of Toronto.
Fodor, J. A. 1992. A theory of the child’s theory of mind. Cognition
44:283-296.
Gardner, D., P. L. Harris, M. Ohmoto, and T. Hamasaki. 1988. Japanese
children’s understanding of the distinction between real and
apparent emotion. International Journal of Behavioral Development
11:203-218.
Garfield, J. L., C. C. Peterson, and T. Perry. 2001. Social cognition,
language acquisition and the development of the theory of mind.
Mind & Language 16:494-541.
Gentner, D., and M. J. Ratterman. 1991. Language and the career
of similarity. In Perspectives on language and thought: Interrelations
in development, edited by S. A. Gelman and J. P. Byrnes, 225-277.
Cambridge: Cambridge University Press.
Goldman, A. I. 1989. Interpretation psychologized. Mind & Language
4:161-185.
Gopnik, A., and H. M. Wellman. 1994. The theory theory. In Mapping the
mind: Domain specificity in cognition and culture, edited by L. Hirschfeld
and S. Gelman, 257-293. New York: Cambridge University Press.
Gordon, R. M. 1986. Folk psychology as simulation. Mind & Language
1:156-171.
Grice, H. P. 1957. Meaning. Philosophical Review 66:377-388.
Hale, C. M., and H. Tager-Flusberg. 2003. The influence of language on
theory of mind: A training study. Developmental Science 6:346-359.
Happé, F. G. E. 1994. An advanced test of theory of mind: Understanding
of story characters’ thoughts and feelings by able autistic, mentally
handicapped and normal children and adults. Journal of Autism and
Developmental Disorders 24:129-154.
Harman, G. 1978. Studying the chimpanzee’s theory of mind. Brain
and Behavioral Sciences 1:591.
Harris, P. L. 1990. The child’s theory of mind and its cultural context. In
The causes of development, edited by G. Butterworth and P. E. Bryant,
43-64. London: Harvester Wheatsheaf.
——. 1992. From simulation to folk psychology: The case for
development
. Mind & Language 7:120-144.
——. 1999. Acquiring the art of conversation. In Developmental
psychology: Achievements and prospects, edited by M. Bennett, 89-105.
Philadelphia: Psychology Press–Taylor & Francis.
——. 2005. Conversation, pretense, and theory of mind. In Why language
matters for theory of mind, edited by J. W. Astington and J. A. Baird,
70-83. New York: Oxford University Press.
Hughes, C. 1998. Executive function in preschoolers: Links with theory
of mind and verbal ability. British Journal of Developmental Psychology
16:233-253.
Huttenlocher, J., W. Haight, A. Bryk, M. Seltzer, and T. Lyons. 1991.
Early vocabulary growth: Relation to language input and gender.
Developmental Psychology 27:236-248.
Jenkins, J. M., S. Turrell, Y. Koguchi, S. Lollis, and H. S. Ross. 2003. A
longitudinal investigation of the dynamics of mental state talk in
families. Child Development 74:905-920.
Jin, Y., J. Jing, R. Morinaga, K. Miki, X. Su, X. Chen, and S. Source. 2002.
A comparative study of theory of mind in Chinese and Japanese
children. Chinese Mental Health Journal 16:446-448.
Johnson, C. N. 1982. Acquisition of mental verbs and the concept of
mind. In Language development: Syntax and semantics, edited by I. S.
Kuczaj, 445-478. Hillsdale, NJ: Erlbaum.
Johnson, C. N., and H. M. Wellman. 1980. Children’s developing
understanding of mental verbs: Remember, know and guess. Child
Development 51:1095-1102.
Jolly, A. 1988. The evolution of purpose. In Machiavellian intelligence:
Social expertise and the evolution of intellect in monkeys, apes, and
humans, edited by R. W. Byrne and A. Whiten, 363-378. Oxford:
Oxford University Press.
Karmiloff-Smith, A. 1992. Beyond modularity: A developmental perspective
on cognitive science. Cambridge, MA: MIT Press.
Lee, K., D. R. Olson, and N. Torrance. 1999. Chinese children’s
understanding
26:1-21.
of false beliefs: The role of language. Journal of Child Language
Leslie, A. M. 1994. ToMM, ToBy, and agency: Core architecture and
domain specificity. In Mapping the mind: Domain specificity in cognition
and culture, edited by L. Hirschfeld and S. Gelman, 119-148. New
York: Cambridge University Press.
Levy, E., and K. Nelson. 1994. Words in discourse: A dialectical approach
to the acquisition of meaning and use. Journal of Child Language
21:367-389.
Lohmann, H., and M. Tomasello. 2003. The role of language in the
development of false-belief understanding: A training study. Child
Development 74:1130-1144.
Meltzoff, A. N., A. Gopnik, and B. M. Repacholi. 1999. Toddlers’
understanding
of intentions, desires, and emotions: Explorations of the
Dark Ages. In Developing theories of intention: Social understanding and
self control, edited by P. D. Zelazo, J. W. Astington, and D. R. Olson,
17-41. Mahwah, NJ: Erlbaum.
Milligan, K. V., J. W. Astington, and L. A. Dack. n.d. Language and theory
of mind: Meta-analysis of the relation between language and false-belief
understanding. Manuscript submitted for publication.
Montgomery, D. E. 2005. The developmental origins of meaning for
mental terms. In Why language matters for theory of mind, edited by J.
W. Astington and J. A. Baird, 106-122. New York: Oxford University
Press.
Moore, C. 1998. Social cognition in infancy. Monographs of the Society
for Research in Child Development, No. 255, 63(4):167-174.
Moore, C., D. Furrow, L. Chiasson, and M. Patriquin. 1994. Developmental
relationships between production and comprehension of mental
terms. First Language 14:1-17.
Naito, M. 2004. Is theory of mind a universal and unitary construct?
International Society for the Study of Behavioural Development Newsletter
45(1):9-11.
Naito, M., S. Komatsu, and T. Fuke. 1994. Normal and autistic children’s
understanding of their own and others’ false belief: A study from
Japan. British Journal of Developmental Psychology 12:403-416.
Nelson, K. 1996. Language in cognitive development: The emergence of the
mediated mind. New York: Cambridge University Press.
——. 2005. Language pathways into the community of minds. In Why
language matters for theory of mind, edited by J. W. Astington and J.
A. Baird, 26-49. New York: Oxford University Press.
Nelson, K., and L. Kessler Shaw. 2002. Developing a socially shared
symbolic system. In Language, literacy, and cognitive development: The
development and consequences of symbolic communication, edited by E.
Amsel and J. P. Byrnes, 27-57. Mahwah, NJ: Erlbaum.
Olson, D. R. 1988. On the origins of beliefs and other intentional states
in children. In Developing theories of mind, edited by J. W. Astington, P.
L. Harris, and D. R. Olson, 414-426. New York: Cambridge University
Press.
O’Neill, D. K. 2005. Talking about “new” information: The given/
new distinction and children’s developing theory of mind. In Why
language matters for theory of mind, edited by J. W. Astington and J.
A. Baird, 84-105. New York: Oxford University Press.
Onishi, K. H., and R. Baillargeon. 2005. 15-month-old infants understand
false beliefs. Science 308(April 8):255-257.
Perner, J. 1991. Understanding the representational mind. Cambridge, MA:
Bradford Books–MIT Press.
Perner, J., M. Sprung, P. Zauner, and H. Haider. 2003. Want that is
understood well before say that, think that, and false belief: A test
of de Villiers’ linguistic determinism on German-speaking children.
Child Development 74:179-188.
Perner, J., and H. Wimmer. 1985. “John thinks that Mary thinks that. . .”
Attribution of second-order beliefs by 5- to 10-year-old children.
Journal of Experimental Child Psychology 39:437-471.
Perner, J., P. Zauner, and M. Sprung. 2005. What does “that” have to
do with point of view? Conflicting desires and “want” in German.
In Why language matters for theory of mind, edited by J. W. Astington
and J. A. Baird, 220-244. New York: Oxford University Press.
Peterson, C. C., and M. Siegal. 2000. Insights into theory of mind from
deafness and autism. Mind & Language 15:123-145.
Premack, D., and G. Woodruff. 1978. Does the chimpanzee have a
theory of mind? Behavioral and Brain Sciences 1:515-526.
Rogoff, B., P. Chavajay, and E. Matusov. 1993. Questioning assumptions
about culture and individuals. Behavioral and Brain Sciences 16:533-534.
Ruffman, T., L. Slade, and E. Crowe. 2002. The relation between
children’s
and mothers’ mental state language and theory-of-mind
understanding. Child Development
73:734 751
- .
Shwe, H. L., and E. M. Markman. 1997. Young children’s appreciation
of the mental impact of their communication skills. Developmental
Psychology 33:630-636.
Sperber, D., and D. Wilson. 1986. Relevance: Communication and cognition.
Cambridge, MA: Harvard University Press.
Tardif, T., and H. M. Wellman. 2000. Acquisition of mental state language
in Mandarin- and Cantonese-speaking children. Developmental
Psychology 36:25-43.
Tomasello, M. 1999. The cultural origins of human cognition. Cambridge,
MA: Harvard University Press.
Tomasello, M., M. Carpenter, J. Call, T. Behne, and H. Moll. 2005.
Understanding
and sharing intentions: The origins of cultural cognition.
Behavioral and Brain Sciences 28:675-735.
Vinden, P. G. 1996. Junin Quechua children’s understanding of mind.
Child Development 67:1701-1716.
——. 1999. Children’s understanding of mind and emotion: A
multicultural
study. Cognition and Emotion 13:19-48.
Vinden, P. G., and J. W. Astington. 2000. Culture and understanding other
minds. In Understanding other minds: Perspectives from developmental
cognitive neuroscience, edited by S. Baron-Cohen, H. Tager-Flusberg,
and D. J. Cohen, 503-519. Oxford: Oxford University Press.
Vygotsky, L. S. 1978. Mind in society. Cambridge, MA: Harvard University
Press.
Wellman, H. M. 2002. Understanding the psychological world:
Developing
a theory of mind. In Blackwell handbook of childhood cognitive
development, edited by U. Goswami, 167-187. Oxford: Blackwell.
Wellman, H. M., D. Cross, and J. Watson. 2001. Meta-analysis of theory
of mind development: The truth about false-belief. Child Development
72:655-684.
Wellman, H. M., and D. Liu. 2004. Scaling theory-of-mind tasks. Child
Development 75:523-541.
Whorf, B. L. 1956. Language, thought, and reality. Cambridge, MA: MIT
Press.
Wimmer, H., S. Gruber, and J. Perner. 1984. Young children’s conception
of lying: Lexical realism—moral subjectivism. Journal of Experimental
Child Psychology 37:1-30.
Wimmer, H., and J. Perner. 1983. Beliefs about beliefs: Representation
and constraining function of wrong beliefs in young children’s
understanding of deception. Cognition 13:103-128.
seven

Constructing the Social Mind:


Language and False-Belief
Understanding
Jennie E. Pyers

One of the central goals of this volume is to map out the uniquely
human aspects of sociality. As is evidenced in the included
chapters, human sociality is a complex conglomerate of behaviors and
experiences, and in this chapter, I propose that its uniqueness is found
in the interdependence of these behaviors and experiences. In particular,
I argue that adult social cognition cannot develop without access to
a rich and complex language. Drawing from theory-of-mind (ToM)
data collected from typically developing preschoolers, orally taught
deaf children, and deaf signers who are exposed to an emerging sign
language, Nicaraguan Sign Language (NSL), I show that language is the
prerequisite foundation on which humans build a mature understanding
of other people’s minds.
First, I provide background about the nature of ToM, focusing on
children’s acquisition of false-belief understanding. Second, I review
several theoretical proposals, each isolating a different aspect of language
as the facilitating force behind children’s emerging understanding of
the mind. Finally, I point to significant ToM impairments in languagedelayed
children and adults exposed to an emerging language to argue
that a rich and complex language must be in place for false-belief
understanding to develop in humans.
Psychological Foundations

ToM
The term theory of mind, although first introduced by primatologists, now
functions as a catchall phrase for the general understanding that human
behavior is motivated by the unobservable intentions, desires, and
thoughts of the individual (Premack and Woodruff 1978). In humans,
ToM development begins in the first days of life and extends into
early childhood, including milestones like understanding that human
action is goal directed, and appreciating that people can have different
desires and preferences. In the first year of life, early intersubjective
social behaviors, such as paying attention to human faces, imitating,
following eye gaze, social referencing, pointing, and joint attention,
reflect infants’ understanding that humans are informational agents and
possess different knowledge states. At 12 months, infants informatively
point to objects that have been displaced out of their interlocutor’s sight,
to help the interlocutor find the object. This intersubjective activity
clearly indicates that infants monitor the mental state of others, and
that they understand that their knowledge states can differ from that
of someone else (Liszkowski this volume).
One feature of ToM that stands out in the literature as a momentous
achievement in children’s social development is an understanding that
thoughts and beliefs can actually be wrong—that people can have a
“false belief.” In the first three years of life, children egocentrically
operate in the world, believing that what they think is actually true and
that their thoughts and beliefs correspond with those of everyone else.
Three-year-olds are self-perceived omniscient forces in their world. In
their fourth year, children suddenly see that other people have thoughts
and beliefs distinct from any other person, and importantly that those
mental states can be falsified. For example, your mother may think you
are asleep in your bed, but you are really under the covers reading a
book. In this instance, your mother has a false belief that you are asleep.
The reality of the world, that you are awake reading in bed, counters
her representation of you sleeping.
False-belief understanding is a fundamental turning point in
children’s
development because it marks a transition to higher-level
metarepresentational abilities (Perner 1991). With this ability, children
are able to represent another representation and, instead of simply
understanding
that people have different viewpoints, they can now represent
a belief, assess whether it is true or false, and use the representation to
predict human behavior.
Across different cultures, children arrive at an understanding of false
belief, on average, sometime between four and five years of age (Wellman
Constructing the Social Mind

et al. 2001). Their understanding of false belief can be assessed by a


variety of measures, but two tasks stand out as “classic” false-belief tests:
an “unseen-displacement” measure,1 and an “unexpected-contents”
measure.2
In the unseen-displacement task, children are told a narrative about
an object displaced while a character is not watching (Wimmer and
Perner 1983). For example, they hear a story of a boy, who puts a piece
of chocolate in the cupboard, then goes outside to play. While he is
outside, his mother finds the chocolate in the cupboard and moves it
to the refrigerator. When he returns to the kitchen to eat his chocolate,
the children are asked the test question: “Where will he first look for
his chocolate?”
Children with a mature understanding of belief report that the boy
will first look in the cupboard, because they recognize that the boy
did not observe the displacement of the chocolate and therefore he
still believes that it is in the cupboard. However, children who are still
struggling with this concept believe that everyone else, including the
boy, shares their knowledge that the chocolate is now in the refrigerator.
These children ignore that the boy had no perceptual access to the
displacement of the chocolate, and they report that he will first look
in the refrigerator.
To correctly answer the false-belief question, children have to
understand
that the boy did not see the displacement of the chocolate, and
therefore does not know that it is in the refrigerator. They have to
represent that the boy will have a false belief that the chocolate is still
in the cupboard, and that his false belief will guide his actions, even
when his actions will not satisfy his desire to retrieve the chocolate.
In the second classic false-belief measure, the “unexpected-contents”
task, children are asked to recall their own false belief as well as predict
the false belief of a friend (Perner et al. 1987). Here, children are
presented with a familiar container, like a box of raisins. They are first
asked to report what they think is in the box. Given an understanding
that raisin boxes typically contain raisins, children report that there
should be raisins in the container. After they report their belief about
the contents, it is revealed that there is a different object, a key, inside
the box.
The first test question asks children to recall what they initially said
was in the container. Surprisingly, three-year-olds consistently report
that they had always known the key was in the box, and that they had
previously said there was a key in the box. Now that the three-year-olds
have a representation of the key in the raisin box, they cannot retrieve
and report their original belief that there were raisins in the box. Only
after age four do children recall that they first said there were raisins
in the box but now know that there is a key in the box.
Children’s ability to predict what a friend will say is in the raisin box
is probed in a second question; “What will your friend think is in the
box?” Children who understand false belief will correctly report that
a friend will be fooled by the container and say that there are raisins
in the box. Those still struggling with ToM claim that a friend will
definitively know that a key is in the raisin box.
Both of these tasks assess children’s understanding of false belief, each
in different ways. The unexpected-contents task asks them to report
their own as well as a friend’s false belief. The unseen-displacement task,
does not directly ask children about the character’s false belief; it requires
them, instead, to use their representation of the belief to correctly
predict the ignorant character’s action. Measuring children’s ability to
predict action based on a false belief is an indirect means of tapping
into their capacity to represent that belief. Numerous studies have used
these tasks to tap children’s false-belief understanding. Importantly,
all of the results show that by the age of five children have acquired
a metarepresentational tool that affords more accurate predictions of
human behavior.

False-belief Understanding and Language


The saturation of the ToM literature with studies on false-belief
development
tellingly recognizes the significance of this metarepresentational
transition in the cognitive development of preschoolers. One of the
striking characteristics of a mature understanding of false belief is that
it emerges so late in development. Many of the core features of ToM
emerge in the first two years of life. Why would false-belief understanding
emerge so much later?
During the preschool years children experience an array of
developmental
changes, all of which could be related to their emerging
understanding
of belief. Children’s language, in particular, changes dramatically
over the fourth year of life. The correlation between their expanding
linguistic ability and their acquisition of a ToM has not gone unnoticed
in the literature, with strong arguments supporting the facilitative role
of language in the development of false-belief understanding. Three
different aspects of children’s linguistic experience have been isolated
as the single causal force behind a mature understanding of belief:
conversational interaction, the acquisition of mental verbs, and the
comprehension and production of complex syntax.
Conversation
Conversation, by definition, is a means of social interaction mediated
by language. It encompasses a general level of pragmatic, semantic,
and syntactic skill. However, some specific features of conversational
interaction have been singled out as facilitating children’s ToM
understanding.
One proposal argues that language is the medium by which we make
unobservable mental states explicit; it is a direct index of our thoughts
and beliefs (Harris 1996). Thus, through conversational interaction,
children are exposed to the explicit thoughts and beliefs of others, and
over the course of conversation they observe differences between their
own and others’ mental states. By this account, children also monitor
what people say, and assess whether or not statements are true. For
example, if a child’s mother says, “it is raining outside,” and the child
looks outside and sees a beautiful sunny day, that child can learn that
what people say, and what people think, may not always be true.
An alternative to this account states that the interactive nature of
conversation, in particular experience with conversational breakdowns,
forces children to consider why their conversational intention was
misunderstood (Tomasello 1999). Successful repair of conversational
breakdowns involves assessing the misunderstanding, then
accommodating
the conversational partner’s knowledge state, so that the
conversational intent is understood. Thus, conversation provides
children with real-world practice in reading the thoughts and beliefs
of other people; this solidifies their understanding that each person’s
knowledge is different.
A final account proposes that children exposed to extensive
conversational
interaction learn that the information shared in conversation
adjusts according to the knowledge base of the conversation partner
(Happé 1993). During social linguistic interaction, children observe
others being selective about what information is included during
conversational interaction, and monitor the correlation between the
extent of the shared information and the knowledge state of the
interlocutor.
Thus, it is through conversation that children are tuned into
the differential knowledge states of others.
Although each of these accounts highlights a different aspect of
conversational
discourse as the key facilitator in children’s ToM
understanding,
they all share a common conclusion: It is only through human
linguistic interaction that children acquire the capacity to represent a
false belief.
Mental Verbs
A series of developmental studies have noted a spurt in three-year-olds’
mental-state vocabulary; young three-year-olds increasingly incorporate
words like, think and know into their existing vocabulary (Bartsch and
Wellman 1995). These studies note that the timing of this spurt occurs
prior to children’s mastery of false belief and that children who pass
the false-belief measure demonstrate high levels of production and
comprehension of mental-state terms (Ziatas et al. 1998). There are
several interpretations of the role that mental-state terms play in the
acquisition of a mature ToM.
The first argument states that mental-state language indexes the
richness of children’s social and linguistic interactions. Bartsch and
Wellman (1995) propose that children’s mental-state vocabulary
is an indicator of the frequency and quality of their conversations
about the mind. They suggest that caregivers use conversation rich
with mental-state terms to scaffold children’s ToM understanding by
explicitly labeling the unseen internal states that drive human action.
This argument is supported by a longitudinal study that found that
children’s false-belief performance was predicted by their mothers’ use
of mental-state terms three months earlier (Ruffman et al. 2002). A
mother’s use of mental-state language was argued to indicate her role
as a mediator for her children, making linguistic links between the
mental states of her children and of others. This mediation process
encourages children to draw on their own experiences to simulate the
belief states of others.
A second proposal, falls in line with the “Specificity Hypothesis”
defined by Gopnik and Meltzoff (1997). For three-year-olds, words
like, think and know explicitly label the mental states that they cannot
otherwise perceive. The language of mental states, when first acquired,
encourages children to pay particular attention to the very concept
labeled, namely the unseen thoughts and beliefs that they have previously
ignored. If mental states were not explicitly encoded in the language,
children might not have the tools to represent their own and others’
internal states, let alone to understand that thoughts and beliefs could
be false. According to this proposal, comprehension, and not necessarily
production, of mental-state terms would be sufficient for children to
understand that linguistic utterances are explicit representations of
unobservable internal states—representations that can be used to gauge
the difference between their own and others’ beliefs.
Finally, Olson (1988) argues that children’s, not parents’, production
of mental-state language opens up the possibility to consider false beliefs.
According to his proposal, the act of labeling thoughts and beliefs,
leads to a meta-awareness that all linguistic utterances are explicit
representations of internal beliefs, and that those internal processes
can actually affect human action. For the preschooler, the use of mentalsta e
language engenders a new consciousness of belief states.

Syntax
A third, somewhat controversial, proposal claims that the rich complex
syntax of human language provides the representational means to
understand a false belief. According to one account, complex language
is the necessary tool with which children can encode representations
of the world (Plaut and Karmiloff-Smith 1993). Syntax allows children
to encode and retain a representation even in the face of a real-world
situation that may falsify that representation. For children, an observed
reality carries more representational weight than a belief that has not
been encoded by language.
An alternative proposal argues that mastery of a specific syntactic
structure, embedded sentential complements, underpins children’s
mastery of the mind (de Villiers and de Villiers 2000). Both mental
and communication verbs take a full embedded proposition as their
complement as in (1):

(1) Bonnie thought she saw the Loch Ness Monster.


The embedded proposition in this sentence is: “she saw the Loch
Ness Monster.” De Villiers and de Villiers note that children’s mastery
of these sentence structures precedes their acquisition of a mature
understanding of false belief, and also argue that these structures enable
the representation of a false belief.
Sentence (1) describes Bonnie’s false belief, and the embedded
proposition specifies her belief. Any interpretation of the embedded
proposition must be relative to Bonnie’s perception of the world, not
relative to the real world in which there is no Loch Ness Monster.
When evaluating the proposition according to the true state of affairs,
the embedded construction is false. There is no Loch Ness Monster;
therefore, it cannot be true that she saw the Loch Ness Monster.
But it is true, according to Bonnie’s representation of the world, that
she did think she saw the Loch Ness Monster. This sentence structure
embeds a false proposition, “she saw the Loch Ness Monster,” under a
true proposition, “Bonnie thought,” while still maintaining the truth
value of the entire sentence. For the de Villierses, the embedding of a
false proposition in a true sentence provides a direct linguistic parallel
to the cognitive problem of false beliefs.

Summary
Each of the accounts recognizes that language plays a crucial role
in children’s timely acquisition of a ToM, but each differs on which
aspect—conversation, semantics, or syntax—serves as the springboard.
These questions are still being teased apart; but in collaboration with
other researchers, I have sought to argue that complex language is the
necessary prerequisite for the human-specific understanding of false
belief. I turn to studies of false-belief understanding in three different
populations—typically developing children, language-delayed deaf
children, and adult native learners of an emerging sign language—to
support this argument.

Language and False-belief Understanding in Typically


Developing Preschoolers
Most studies of false-belief understanding in typically developing children
are cross-sectional correlational studies that demonstrate a relationship
between false-belief understanding and language, but are unable to
determine the direction of the relationship. Although I argue for the
dependence of false-belief understanding on language development, it
is just as likely for language to be dependent on cognitive development.
According to Fodor’s (1975) proposal, concepts must be firmly in place
before language can encode them; thus, talk about mental states should
not appear until children could richly represent mental states. Here, I
present several studies with typically developing children that attempt to
experimentally determine the direction of causality in the relationship
between language and false-belief understanding.
There are two ethical means of determining the causal direction of
the relationship. First, researchers can track the natural development
of children longitudinally, measuring both their language and their
ToM development over time. This allows researchers to observe
which aspects of language emerge before or after children acquire a
mature false-belief understanding. Second, researchers can perform
an intervention. In the unethical case they can deprive children of
language, and observe how they subsequently perform on false-belief
measures. More ethically, training studies divide three-year-old failers of
false-belief understanding into separate groups, and each group is given
a different training intervention. After the intervention, the groups are
retested on false-belief understanding. If any one group improves in
performance, the improvement can be causally attributed to the specific
training intervention.
Focusing on the importance of a rich mental-state vocabulary, Hughes
and Dunn (1998) conducted a longitudinal study measuring children’s
mental-state vocabulary and their false-belief performance across time.
They found that the production of mental verbs preceded false-belief
understanding and predicted how well children performed on the
false-belief test three months later. Importantly, the reverse was not
true: false-belief performance did not precede fluent use of mental-state
vocabulary. This longitudinal study demonstrated for the first time
that an aspect of language seemed fundamental to children’s emerging
understanding of the mind.
The study, however, neither measured children’s understanding of
complex syntax nor examined the syntax produced with the observed
mental-state terms. Without such measures this study could not eliminate
the possibility that children were producing embedded sentential
complements under their mental-state terms. Children’s mental-state
vocabulary could have served as a proxy for their syntactic ability, and
syntactic ability could have been the underlying, unmeasured, predictor
of children’s false-belief performance. Although this longitudinal study
provided strong support for the importance of mental-state vocabulary
as a foundation on which children can build a mature understanding
of false belief, it did not exclude the possibility that complex syntax
plays a roll as well.
A second longitudinal study directly compared the individual
contributions
of semantic and syntactic ability on children’s emerging
ToM. Over the course of a year, this study tracked children’s general
semantic and general syntactic abilities as well as their understanding
of false belief (Astington and Jenkins 1999). The striking results showed
that children with better general syntactic proficiency performed better
on false-belief tests three months later than did children with lower
syntactic skills. By demonstrating that general syntactic ability, above
and beyond general semantic ability, accounted for a significant amount
of the variance in false-belief performance, this study demonstrated that
syntax played a far greater role than semantics in children’s emerging
understanding of the mind. But, because these are general syntactic and
semantic measures that do not isolate a specific syntactic structure, for
example complementation, or a specific semantic feature, for example
mental-state terms, we cannot conclude that complementation or
mental-state verbs are, or are not, the features of language that support
the acquisition of false-belief understanding. What is clear, however,
is that general syntactic ability is laying down the path for children to
discover the complexities of the human mind.
To isolate the importance of complementation, de Villiers and
Pyers (2002) longitudinally tracked the development of preschoolers’
mastery of embedded sentential complements and of ToM. Over the
course of one year, they tested a cohort of preschoolers every three
months on their comprehension of complementation and on their
false-belief understanding. In this study, preschoolers’ performance
on false-belief understanding was predicted by how well they had
mastered complementation three months prior to their false-belief test.
Thus, children who mastered complementation but failed false-belief
understanding were more likely than children who struggled with both
to pass the false-belief test three months later. This study provided
the first strong evidence for the causal role of complementation in
children’s understanding of false belief. Although mastery of mental
verbs was not measured in this study, mastery of complementation
was separated into complementation under communication verbs and
complementation under mental verbs. Further analyses revealed that
it was specifically comprehension of complement structures under
communication verbs, not mental verbs, that predicted the children’s
false-belief performance. The results from this study support the strong
linguistic determinism argument that mastery of syntax, and not of
mental-state verbs, promotes a mature understanding of the mind.
The importance of complementation independent of mental-state
verbs was underlined in a large training study (Lohmann and Tomasello
2003). In this study, children who failed false-belief understanding
were divided into five groups receiving five different types of training
exposure: experience with deceptive objects without linguistic support;
experience with deceptive objects accompanied by discourse about
perspective shifting, but with no complementation or mental state
verbs; exposure to complementation under mental verbs, without
experience with deceptive objects; or full training with deceptive objects
using discourse about perspective shifts and complementation under
communication verbs or under mental verbs. After training, children
were retested on their false-belief understanding. Only children who
received training on sentential complements, on deceptive objects with
discourse about perspective shifting, or on the full training battery
significantly improved their false-belief performance. For the purposes
of this chapter, these results emphasize two important points. First,
sentential complements without experience with deceptive objects can
facilitate children’s false-belief understanding. Second, there was no
difference in the posttraining false-belief performance between those
children who received the full training with sentential complements
under communication verbs and those who received the full training
with sentential complements under mental-state verbs. These results
support the claim that the syntax of complementation promotes an
understanding of false belief, and that complementation does not
have to occur with mental-state terms to achieve the improved ToM
performance.
When designed to determine the causal direction of the relationship
between false-belief understanding and language, studies with typically
developing children repeatedly demonstrate that timely acquisition
of false-belief understanding is dependent on language development.
In particular, those studies that include syntactic measures show that
syntax, independent of the semantics of mental-state terms, plays a
causal role in children’s developing ToM.

False-belief Understanding in Language-delayed Deaf


Children
If false-belief understanding is dependent on language for its timely
acquisition, then we should also observe ToM impairments in languagedelayed
children. In this section, I present several studies that confirm
that language-delayed deaf children exhibit significant delays in their
ToM understanding. However, these studies differ in which aspect of
language they attribute the ToM delays to: conversation or syntax.
Several studies have used language-delayed deaf children to support
the argument that rich communicative interaction solely facilitates
the development of false-belief understanding. They propose that deaf
children born to hearing families do not have rich communicative
interactions, and therefore do not experience the differing thoughts
and beliefs that are made observable during conversational interaction.
According to these studies, lack of conversational experience directly
leads to impairments in children’s false-belief understanding (Peterson
and Siegel 1995,1998, 1999, 2000; Woolfe et al. 2002).
Although each of the studies found significant impairments in deaf
children’s understanding of false belief, only one included a language
measure, and none included measures of communicative experience.
Woolfe et al. (2002) matched deaf native signing children, exposed to
British Sign Language (BSL) from birth, to deaf children exposed late to
BSL, on their general BSL syntactic ability. They found that even when
matched on syntactic skills, native signing children outperformed the
late learners on the false-belief measures. They used this finding to make
the case that native signers’ experience with fluent communicative
interactions in their homes, and not their mastery of the syntax of BSL,
was the prerequisite linguistic experience for a mature understanding of
the mind. Although a measure of general syntactic ability was included
in this study, it did not test for complement structures in BSL. And,
therefore, it cannot eliminate the possibility that the native signers
understood complement structures whereas the late learners did not.
Furthermore, although the study concluded that communicative
interaction facilitates false-belief understanding, it did not include an
explicit measure of conversational interaction or competence. Without
explicit measures of conversational experience, mental-state verbs, or
complementation, one cannot determine on which of these features
of language false-belief understanding depends.
De Villiers and Pyers (2001) built a measure of complementation
into a cross-sectional study examining false-belief understanding
and language in orally taught deaf children who were only exposed
to English. Orally taught deaf children are educated using an aural
and oral method that emphasizes lipreading and auditory training to
learn spoken language. These children do not acquire a natural sign
language early in their lives. As a result of their oral training, these
children are significantly language delayed. The children in this study
also exhibited impairments in their false-belief understanding, passing
even minimally linguistic measures around the age of seven—three years
after typically developing children pass false-belief tests (de Villiers and
Pyers 2001). Furthermore, their ToM delay significantly correlated with
their ability to produce embedded complement structures. All children
who passed the false-belief test also produced sentential complements.
And no children passed the false-belief test without demonstrating
that they could produce sentential complements. Although this study
did not include a measure of mental-state vocabulary, when combined
with the studies of typically developing children, the argument for the
strong relationship between syntax and false-belief understanding is
bolstered.
The results from both typically developing and language-delayed
deaf children illuminate the facilitative role of language in the human
understanding of the mind. But we know little of what would happen
in the rare case of adults with significant language delays. Such a case
would clarify the nature of the causal role of language, providing answers
to open questions: Does language facilitate the timely acquisition of
false-belief understanding, or is it the necessary prerequisite for a mature
ToM?

False-belief Understanding in Learners of an


Emerging Language
In one scenario, rapid acquisition of false-belief understanding is
facilitated by normal language development; but in the face of extreme
language delay, false-belief understanding could emerge with enough
social interactive experience. In an alternative scenario, false-belief
understanding may be completely dependent on the acquisition of
a full and rich language. Here, language plays a deterministic role in
the understanding of false belief, such that a human with little or
no language could never acquire a mature ToM, regardless of their
level of social contact. This section summarizes the results of a recent
study I conducted examining the relationship between language and
false-belief understanding in learners of an emerging sign language in
Nicaragua (Pyers 2004).

NSL
Twenty-five years ago, when the Nicaraguan government opened the
doors of the school for special education in Managua, deaf Nicaraguan
children from all class backgrounds were, for the first time, afforded a
public education, triggering a set of events that led to the birth of a new
language in Nicaragua, NSL. On the buses and playgrounds, children
converged on a set of lexical signs that served as a common basis of
the new language that has since become more complex. In analyzing
the emergence of this language, Senghas (1996) made an important
observation: one of the significant predictors of a signer’s linguistic
sophistication was the year he or she entered the school for special
education. The language of the children who entered in the late 1970s
and early 1980s was less complex than that of children who entered
the school ten years later. Contrary to the model of typical language
development, in the Nicaraguan deaf community, the younger you were
the better your linguistic skills. The younger signers who entered the
school after 1986 are referred to as the second cohort. Deaf Nicaraguans
who entered the school before 1986 are called the first cohort.
Two significant grammatical differences have been observed in the
language of the two cohorts. First, second-cohort signers are more
consistent and regular than first-cohort signers in their use of spatial
modulations to mark the arguments of a verb. The first-cohort signers
use space unsystematically, placing the subject of a verb equally as often
on their left side as on their right side (Senghas and Coppola 2001).
Second, the first-cohort signers use holistic expressions of manner and
path when talking about motion events, whereas the second-cohort
signers componentially break down the manner and path information
into two different signs (Senghas et al. 2004).
These two important differences in the complexity of the two
cohorts’ language underscore that NSL is undergoing rapid change and
systemization in its grammar. What is unique about this population with
respect to our interest in the relationship between language and social
cognition is that older signers, who have more years of social exposure
in the world, exhibit less linguistic complexity in their language; and,
younger adolescents, the second-cohort signers, have fewer years of
experience but richer linguistic knowledge. If language is a facilitative
and not a deterministic tool in children’s acquisition of false-belief
understanding, social experience could compensate for what language
cannot provide. Thus, we should observe no differences between signers
from each cohort in their ToM knowledge. Conversely, if language is
a prerequisite for a mature ToM, without which children could not
acquire an understanding of belief, then we may observe ToM delays
in the older, first-cohort signers.

False-belief Understanding
When examining the relationship between language and cognition,
it is important to rule out the language of the task as a potential
contributor to failure. To avoid this confound a minimally verbal picture
completion version of the unseen-displacement measure was developed
and administered to eight first-cohort signers, with a mean age of 27
years, and eight second-cohort signers with a mean age of 17 years
(Pyers 2004).
The results revealed a striking difference in the false-belief performance
of the two cohorts. Seven of the second-cohort signers, but only one
of the first-cohort signers passed the minimally verbal false-belief
measure. For typically developing children, there is a positive correlation
between age and false-belief understanding; in this population there
was a negative correlation—the younger signers outperformed the older
signers. This, however, did not indicate that as the Nicaraguan signers
aged they lost their ability to represent a false belief. Rather, there was
a third factor that separated the two groups in their performance on
the false-belief task.
Information gathered in background interviews showed that the two
groups did not differ from each other in terms of family demographics,
education level, or employment history. They did differ, however in
the year when they first learned the sign language. The first-cohort
signers were exposed to NSL in its earliest form, when the language was
first emerging. As children, the first-cohort signers contributed to the
creation of this language, moving it away from a conglomeration of
unsystematic gestures imported from an array of home-sign systems.3
As these first-cohort signers hit adolescence, a new group of children
entered the school. The language of the first cohort served as the input
to the second cohort. The second-cohort children took their somewhat
irregular and unsystematic linguistic input and produced language that
was more systematic and regular (Senghas 2003). The language of the
second cohort is more sophisticated than that of the first cohort, and
this linguistic difference seemed to underlie the differential performance
on the false-belief measure.

Relevant Linguistic Differences between the First and


Second Cohort
In this section, I summarize the relevant linguistic differences that could
contribute to the two cohorts’ differential false-belief performance.
There is little evidence showing significant differences between the
two cohorts in the quantity of conversational interaction. But, there are
striking and theoretically relevant differences between the two groups
in their production of mental-state terms and multiclausal embedded
structures.
Although the study had nodirect measure ofcommunicative experience
or competence, it did provide some observations from participants’
self-reports that indicated little difference between the two cohorts in
their communicative experiences. Both first- and second-cohort signers
reported that they actively engaged in fluent conversation with their deaf
peers; yet, only the first-cohort signers struggled with an understanding
of false belief. Both the first- and second-cohort signers reported not
understanding the hearing Spanish speakers around them and not being
understood by the hearing Nicaraguans. Recall that Tomasello (1999)
proposes that children have to experience misunderstanding others
and being misunderstood to observe that mental states can differ. Daily,
both cohorts experience the communicative misunderstandings that
Tomasello argues are the building blocks to a mature ToM.4 Although
self-reports gathered about communicative experience identified no
differences between the two cohorts, more explicit measures of quality
of conversational interaction would need to be developed to fully rule
it out as a factor in the acquisition of false-belief understanding.
Mental-state terms and complement structures were elicited from
both cohorts using a well-established procedure (de Villiers and Pyers
2002; de Villiers and Pyers 2001; Gale et al. 1996). The Nicaraguan
passers and failers significantly differed in the number of mental verbs
they produced. Furthermore, there was a strong correlation between
production of mental-state terms and performance on the false-belief
measure. The Nicaraguan passers of false-belief understanding produced
significantly more complement structures under mental-state terms
than the failers. Unfortunately the sample size was too small to tease
apart the relative, independent contributions of mental-state vocabulary
and of syntax on false-belief performance. Regardless, the fact that
the younger second-cohort signers outperformed the older first-cohort
signers in both false-belief and linguistic complexity further supports the
dependence of ToM on the acquisition of a full and rich language.

Discussion
When the results from typically developing children, language-delayed
children, and learners of an emerging language are pieced together,
the picture that emerges is one in which complex social cognition
is contingent on the acquisition of complex syntax. Without a full
and rich language, humans fail to consider a person’s false belief in
predicting their actions.
A recent study, however, appears to challenge the proposal that falsebelief
understanding is dependent on language acquisition. Using a
violation-of-expectancy method, Onishi and Baillargeon (2005)
monitored 15-month-old infants’ eye gaze and showed that it revealed
the infants’ early implicit understanding of false belief. They argued that
the task demands of the traditional false-belief measures are too high,
and cannot capture this early implicit understanding. The results from
the Nicaraguan signers, however, counter their argument, because the
first-cohort signers failed a false-belief task in which the linguistic and
cognitive demands were minimal for an otherwise normal adult. Instead,
the Nicaraguan results are quite parsimonious with the interpretation
that Perner and Ruffman (2005) provide for the performance of these
infants. The precocious false-belief performance of 15-month-olds,
as measured by their eye gaze, may reflect an innate “core theory”
about the behavior of conspecifics; but only experience in the world,
specifically linguistic experience, provides children with the sufficient
means to build the kind of false-belief representation that could be
called on to reliably predict human behavior.
The strong dependence of false-belief understanding on complex
syntax is also evident in the emergence of NSL. The language is
undergoing
rapid change both in its lexicon and in its syntax, and the very
structure, argued to trigger the children’s understanding of false belief,
appears early in the emergence of NSL. The first and second cohort differ
in their use of complement structures not only with mental verbs, but
also with desire verbs; second-cohort signers produce significantly more
complements with desire verbs than do the first-cohort signers (Pyers
2004). Evidently, the second cohort of early exposed child learners
introduced the systematic use of these structures into the language and
readily uses them in adult narratives. Notably, it only takes one cohort
of child learners for this complex structure to emerge; by the second
generation of child learners,5 this stepping-stone to a mature ToM is
present and available in the language.
How important is false-belief understanding to the human experience?
Although first-cohort signers struggle to accurately predict human
behavior when a false belief is involved, they lead otherwise normal
lives, living with extended families, raising children, holding down
jobs, and even navigating the Managuan bus system. How can all of
this “know-how” be maintained without a representation of false belief?
False belief, although considered the most momentous achievement
in ToM development, is not the sole internal state that drives human
behavior. Humans also act in the world motivated by their desires
and emotions. An understanding of desires and an understanding of
emotions, neither of which seem to rely on language to develop, can be
tapped to explain much of human behavior. And, the first-cohort signers
show no impairments in their understanding of desires and emotions as
driving forces of human action (Pyers 2004). The Nicaraguan failers of
false belief draw on their understanding of the physical world, of desires,
and of emotions to explain why people make mistakes in the world
(Pyers 2004). Although tapping into desires and emotions to explain
mistakes that result from false beliefs does not completely capture the
essential cause of the mistake, it may provide enough explanatory
strength for the first-cohort signers to operate somewhat effectively
in the world.
One question that remains open for now is whether there is a
critical period for acquiring complementation or for developing an
understanding of false belief. Now that the NSL has complementation,
would it be possible for the first-cohort failers to acquire this linguistic
structure as adults and have it improve their performance on the falsebelief
test? We do not know if the adult first-cohort signers could
acquire complementation; and, if they could, if acquiring false-belief
understanding would still be available to them in adulthood. As the
second-cohort signers enter adulthood and increase their social and
linguistic interactions with the first cohort, we may observe important
changes in the language and perhaps cognitive capacities of the first
cohort.

Conclusion
In three different populations with varying language capabilities, we
observe that false-belief understanding is causally linked to the use and
mastery of complex syntax. Language is a fundamental and unique
characteristic of humans and, I argue, underpins human sociality.
Language itself serves two important functions in the development of
human social interaction (Astington this volume). First, as an interactive
tool, language communicates information. The communicative
experience
is a core feature of human development, such that, without
exposure
to a rich linguistic system, deaf children adopt gestures to meet
their communicative goals. Under the pressure of communicative
needs, these gestures begin to adopt languagelike features, revealing
the dependence of efficient communication on the complex properties
of language (Goldin-Meadow this volume). Second, language is not
only a communicative tool, but also a means of verbally representing
or misrepresenting the events of the world. The data reviewed in this
chapter demonstrate that the representational capacity of language
opens up a whole new domain of social cognition in humans.
False-belief understanding is an important turning point in children’s
development, but its role in adult social interaction is not fully
understood.
For example, how does the ability to consider a false belief shape
the coordination and establishment of joint commitment assumptions
(Clark this volume)? How does false belief understanding shape our
ability to engage in efficient linguistic communication? False-belief
understanding may be dependent on language, but it is very likely that
rich communicative interaction depends on a mature understanding of
false belief. Further work on the interactional consequences of delays
in social cognition still needs to be done. Developmental research has
shown a strong bidirectional relationship between social cognition and
language through children’s early years; it is likely that this relationship
would continue in later development.
Language and social cognition have evolved in humans to support
and enhance each other throughout development. False-belief
understanding
is just one domain of social cognition in which this
relationship
plays out, with cognition dependent on language. But earlier in
children’s development, social cognition, specifically joint attention,
supports their early language learning (Tomasello and Farrar 1986).
Social cognition promotes language development, and language triggers
more advanced social cognition, and the interplay continues throughout
human development. And, it is in the developmental interdependence
of language and cognition that we find the roots of human sociality.

Notes
1. This measure of false belief is sometimes referred to as the “changed-
location” task.
2. The common term for the “unseen-displacement” measure is the
“Smarties task.” The familiar container first used in this study was
a tube of Smarties, a type of candy similar to M&M’s.
3. Home-sign systems are the gestures created by deaf children for use
in their homes to communicate with their families (Goldin-Meadow
1982). Each individual family can develop their own unique homesign
system, which would be unintelligible to other families with
home-sign systems.
4. One possibility is that misunderstandings could be attributed
to breakdowns in linguistic form not meaning. Deaf children’s
speech is difficult for adults with normal hearing to understand,
and the children are often asked to repeat themselves. Frequent
misunderstandings at the level of form may lead deaf children to
assume that all bids for clarification result from their speech not
being understood. Under this assumption, deaf children would
rarely experience communication breakdowns as encounters with
different mental states; instead, they would interpret them as failures
to understand the form of the utterance. This kind of interpretation
could actually work against children’s timely acquisition of ToM.
5. The use of generation in this context essentially refers to cohort. I
chose the word generation here because I want to emphasize that
the language has to go through a second round of child learners.
But, these learners do not necessarily have to be members of the
second cohort. For example, one first-cohort passer of false-belief
understanding had a significantly older deaf sibling. This same
first-cohort signer patterns like second-cohort signers on most of
the language measures. Although temporally a member of the first
cohort, this signer’s language use and cognitive capacities look much
like second-cohort signers. Her language sophistication could be
attributed to her early regular exposure to the raw materials of
NSL.

References
Astington, J. W., and J. M. Jenkins. 1999. A longitudinal study of
the relation between language and theory-of-mind development.
Developmental Psychology 35(5):1311–1320.
Bartsch, K., and H. M. Wellman. 1995. Children talk about the mind.
London: Oxford University Press.
de Villiers, J. G., and P. de Villiers. 2000. Linguistic determinism and
the understanding of false beliefs. In Children’s reasoning and the
mind, edited by P. Mitchell and K. Riggs, 191–228. Hove: Psychology
Press.
de Villiers, J. G., and J. E. Pyers. 2002. Complements to cognition: A
longitudinal study of the relationship between complex syntax and
false-belief-understanding. Cognitive Development 17(1):1037–1060.
de Villiers, P. A., and J. Pyers. 2001. Complementation and false-belief
representation. In Research on child language acquisition, edited by M.
Almgren, M. J. Ezeizabarrena, and I. Idiazabal, 984–1005. Somerville,
MA: Cascadilla Press.
Fodor, J. A. 1975. The Language of Thought. New York: Crowell.
Gale, E., P. de Villiers, J. de Villiers, and J. Pyers. 1996. Language and
theory of mind in oral deaf children. In Proceedings of the 20th Annual
Boston University Conference on Language Development, edited by A.
Stringfellow, D. Cahana-Amitay, E. Hughes, and A. Zukowski, 213–244.
Sommerville, MA: Cascadilla Press.
Goldin-Meadow, S. 1982. The resilience of recursion: A study of a
communication system developed without a conventional language
model. In Language acquisition: The state of the art, edited by E. Wanner
and L. R. Gleitman, 51–77. New York: Cambridge University Press.
Gopnik, A., and A. Meltzoff 1997. Words, thoughts, and theories.
Cambridge, MA: MIT Press.
Happe, F. G. E. 1993. Communicative competence and theory of mind
in autism: A test of relevance theory. Cognition 48(2):101–119.
Harris, P. 1996. Desires, beliefs, and language. In Theories of theories of
mind, edited by P. Carruthers and P. K. Smith, 200–222. New York:
Cambridge University Press.
Hughes, C., and J. Dunn. 1998. Understanding mind and emotion:
Longitudinal associations with mental-state talk between young
friends. Developmental Psychology 34(5):1026–1037.
Lohmann, H., and M. Tomasello. 2003. The role of language in the
development of false belief understanding: A training study. Child
Development 74(4):1130–1144.
Olson, D. R. 1988. On the origins of beliefs and other intentional states
in children. In Developing theories of mind, edited by J. W. Astington,
P. L. Harris, and D. R. Olson, 414–426. Cambridge: Cambridge
University Press.
Onishi, K. H., and R. Baillargeon. 2005. Do 15-month-old infants
understand false beliefs? Science 308:255–258.
Perner, J. 1991. Understanding the representational mind. Cambridge, MA:
MIT Press.
Perner, J., S. R. Leekam, and H. Wimmer. 1987. Three-year-olds’ difficulty
with false belief: The case for a conceptual deficit. British Journal of
Developmental Psychology 5(2):125–137.
Perner, J., and T. Ruffman. 2005. Infants’ insight into the mind: How
deep? Science 208:214–216.
Peterson, C. C., and M. Siegal. 1995. Deafness, conversation and theory
of mind. Journal of Child Psychology and Psychiatry and Allied Disciplines
36(3):459–474.
——. 1998. Changing focus on the representational mind: Deaf, autistic
and normal children’s concepts of false photos, false drawings and
false beliefs. British Journal of Developmental Psychology 16:301–320.
——. 1999. Representing inner worlds: Theory of mind in autistic, deaf,
and normal hearing children. Psychological Science 10(2):126–129.
——. 2000. Insights into theory of mind from deafness and autism.
Mind & Language 15(1):123–145.
Plaut, D. C., and A. Karmiloff-Smith. 1993. Representational development
and theory-of-mind computations. Behavioral and Brain Sciences
16:70–71.
Premack, D., and G. Woodruff. 1978. Does the chimpanzee have a
theory of mind? Behavioral and Brain Sciences 1(4):515–526.
Pyers, J. E. 2004. The relationship between language and false-belief
understanding: Evidence from learners of an emerging sign language in
Nicaragua . Ph.D. dissertation, Department of Psychology, University
of California, Berkeley.
Ruffman, T., L. Slade, and E. Crowe. 2002. The relation between
children’s and mother’s mental state language and theory-of-mind
understanding. Child Development 73(3):734–751.
Senghas, A. 1996. Children’s contribution to the birth of Nicaraguan Sign
Language. Ph.D. dissertation, Department of Brain and Cognitive
Science, Massachusetts Institute of Technology, Cambridge, MA.
Senghas, A. 2003. Intergenerational influence and ontogenetic
development in the emergence of spatial grammar in Nicaraguan
Sign Language. Cognitive Development. Special Issue: The Sociocultural
Construction of Implicit Knowledge 18(4):511–531.
Senghas, A., and M. Coppola. 2001. Children creating language: How
Nicaraguan Sign Language acquired a spatial grammar. Psychological
Science 12(4):323–328.
Senghas, A., S. Kita, and A. Özyürek. 2004. Children’s creation of core
properties of language. Science 305(5691):1779–1782.
Tomasello, M. 1999. The cultural origins of human cognition. Cambridge,
MA: Harvard University Press.
Tomasello, M., and M. J. Farrar. 1986. Joint attention and early language.
Child Development 57(6):1454–1463.
Wellman, H. M., D. Cross, and J. Watson. 2001. Meta-analysis of theory-
of-mind development: The truth about false belief. Child Development
72(3):655–684.
Wimmer, H., and J. Perner. 1983. Beliefs about beliefs: Representation
and constraining function of wrong beliefs in young children’s
understanding of deception. Cognition 13(1):103–128.
Woolfe, T., S. C. Want, and M. Siegal. 2002. Signposts to development:
Theory of mind in deaf children. Child Development 73(3):768–778.
Ziatas, K., K. Durkin, and C. Pratt. 1998. Belief term development
in children with autism, Asperger syndrome, specific language
impairment, and normal development: Links to theory of mind
development. Journal of Child Psychology and Psychiatry and Allied
Disciplines 39(5):755–763.
eight

Sylvia's Recipe: The Role of Imitation


and Pedagogy in the Transmission of
Cultural Knowledge
György Gergely and Gergely Csibra

Historically, imitation has frequently been proposed as the central


mechanism mediating the reproduction, spread, intergenerational
transmission, and stabilization of human cultural forms, population-
specific behavioral traditions found in groups of nonhuman primates,
or both (Baldwin 1894; Bandura 1986; Blackmore 2000; Byrne and
Russon 1998; Dawkins 1976; Dennett 1995; Donald 1991; Meltzoff
1996; Tomasello 1999; Tomasello et al. 1993; Whiten 2000; Whiten
and Custance 1996). In this chapter, we provide a critical reappraisal
of the dominant role classically attributed to imitation in cultural
reproduction.
We argue that the properties of alternative social learning mechanisms
(such as emulation or imitation) reflect the specific demand
characteristics
that different kinds of cultural products impose on their cultural
transmission process to ensure their reproducibility. We distinguish
between cultural forms whose functionally relevant aspects are
cognitively “transparent” versus “opaque” for the observational learner
and discuss the inherent relation between these properties on the one
hand, and emulation versus imitation on the other hand. We argue how
the emergence of different cultural environments with predominantly
cognitively transparent versus opaque cultural forms may have led to the
selection and specialization of suitable alternative social transmission
mechanisms.
Psychological Foundations

Social Transmission of Behaviors in Nonhuman


Primates
"Simple" (Goal-driven) Teleology and Tool use in Primates
Many field researchers (e.g., Boesch and Boesch 1993; Byrne this volume;
Byrne and Russon 1998; Goodall 1986; McGrew 1996, 2004; Nishida
1987) have documented socially transmitted population-specific
behavioral skills in primates (such as nut cracking or termite fishing
in chimpanzees or manual techniques of leaf gathering in mountain
gorillas). It is possible that these behavioral skills involve no more than
chance discovery and “blind” associative processes resulting in habitual
behavior sequences leading to rewarding outcomes. However, there are
several lines of evidence suggesting that these goal-directed and socially
transmitted behavioral traditions are likely to involve a rudimentary
understanding of at least some aspects of teleological relations (Gergely
and Csibra 2003).
For example, from objects (such as sticks or stone flakes) scattered
around a locally visible goal, primates seem able to pick and choose
as their “tool” the one whose physical properties seem most affordant
to ensure goal attainment. Similarly, apes can make simple functional
modifications in the causal-physical properties of objects used as tools
to make them more affordant in relation to the visible properties of a
locally present concrete goal (Boesch and Boesch 1993; Goodall 1986;
Tomasello and Call 1997).
Tomasello (1996) argues that during observing goal-directed object
manipulations of other animals, apes can learn something about the
affordance properties of objects. Byrne (this volume) documents that
mountain gorillas can learn relatively complex manual actions (leaf
gathering skills) to achieve visible goals through observing others. He also
reports spontaneous idiosyncratic but functionally efficient variations of
the modeled manual skills in animals crippled by snare wounds. These
gorillas with severely maimed hands seem able to significantly modify
the observed manual means actions in a functionally appropriate
manner adjusting them to the morphological constraints of their hand
deformities.
Nonhuman primates’ teleological abilities are not restricted to tool
use. Uller (2004) replicated with juvenile chimps the looking time results
first demonstrated by Gergely et al. (1995) in human infants, showing
the teleological ability to evaluate the relative efficacy of different means
actions of another agent in relation to a visible goal. Tomasello et al.
Sylvia’s Recipe

(2005) summarize a series of new studies indicating that chimpanzees


have a rudimentary understanding of intentional actions of others in
terms of goals and perceptions.
In sum, different lines of evidence converge to suggest a simple level
of teleological understanding of means–ends relations in nonhuman
primates. These include the comprehension of the relative efficacy of
means actions of others as well as the relative degree of affordance of
objects used as tools in relation to visible goals.

Cognitive Limitations on Nonhuman Primates' Simple Teleology


and Functionalist Understanding of Objects as Tools
Remarkable as it is, the level of teleological understanding exhibited
by apes shows severe limitations when compared with the systematic
inferential and predictive use of teleological reasoning in human infants
(Csibra et al. 2003; Gergely and Csibra 2003). These limitations are
also severe in comparison with the rather sophisticated functional
understanding of tools and tool manufacturing and maintenance
procedures of our hominid ancestors present, as evidenced by the
archeological record, already roughly two million years ago (Mithen
1996; Schick and Toth 1993; Semaw 2000).1
1. Nonhuman primates’ teleofunctional conceptualization of objects as
tools is activated by perceptual access to concrete goals at specific locations.
In apes, teleological thinking about objects as tools seems induced only
in the perceptual presence of specific and concrete goals that provide
direct access to their affordance requirements (and when being in an
unsatisfied motivational state to attain such goals). Importantly, to
represent objects as tools by interpreting their causal-physical properties
as affordances, it seems necessary for apes to have direct perceptual
access to the relevant functional properties of the goal object. Their
capacity to interpret physical object properties teleofunctionally as
affordance properties seems a transient and unstable conceptual ability
that is triggered only under restricted and specific input conditions.
It seems that only when these conditions are satisfied can primates
evaluate objects from a functional point of view, choose and use them
as tools, or modify their affordance properties functionally in relation
to the visible properties of concrete goals.
2. Lack of stable functional representations of objects as tools in terms of
affordance properties. These restricted input conditions on the activation
of teleological thinking impose serious limits on the cognitive abilities
of apes to functionally categorize and represent objects as tools. Such
functional representations tend to be transient and local, involving
only short periods of functionalist insight about objects as potential
tools that is likely to be forgotten as soon as the goal is satisfied or
abandoned, or the goal object is lost sight of. This is indicated by the
temporally and locally restricted use of objects as tools by apes who
tend to discard their tools after their goal has been satisfied and who
(unlike our hominid ancestors) do not routinely keep, store, or carry
tools for long distances with them. Similarly, although apes show some
ability to functionally modify tools in the perceptual presence of a
goal object, they hardly if ever make tools, modify their functional
properties, or engage in maintenance activities in locations other than
their direct application. 2
3. The goal concept of apes is restricted to objects that afford direct
reinforcement. Primates tend to interpret actions teleologically only in
relation to a restricted and small set of specific types of goals that provide
direct reinforcement such as food or sex. In contrast, no such restrictions
seem to apply to the wide range and types of goals that human infants
can attribute to actions, which do not have to be (and very often are
not) tied to reward contents. There is convincing experimental evidence
(Csibra et al. 2003) indicating that in human infants goals are identified
and attributed to actions by an abstract representational and interpretive
system (the “teleological stance”) specialized for interpreting behaviors
as goal directed actions. By one year of age (and even earlier) human
infants apply the “teleological stance” in a general and systematic
manner to interpret and attribute the outcome of any observed behavior
as the goal of the agent’s action whenever considerations of efficiency
justify the action as the most optimal means available to achieve the
goal within the particular physical constraints of the situation (Gergely
and Csibra 2003).

Demand Characteristics of Primate Cultural Traditions for Social


Transmission Mechanisms: Cognitive Transparency and Teleological
Emulation
The classical view among many primatologists has been that the existence
of population-specific behavioral traditions in primate groups implies
that these cultural forms are socially transmitted through imitation (e.g.,
Boesch and Boesch 1993; Byrne and Russon 1998; Goodall 1986; McGrew
1996; Nishida 1987). Several researchers point out, however, that the
time it takes to learn population-specific traditions by individuals in ape
or monkey communities turns out to be much longer than what would
be expectable if the mechanism of transmission involved imitation. The
same is suggested by the slow rate of spread of such behavioral routines
within the population (see Galef 1990; Tomasello 1996; Tomasello
and Call 1997; Visalberghi and Fragaszy 1990). These considerations
(together with experimental difficulties in demonstrating clear-cut cases
of imitative copying in primates, Call and Tomasello 1994; Tomasello
and Call 1997) gave rise to the idea that instead of “blind” imitation,
the dominant social–cognitive learning mechanism mediating cultural
transmission of primate traditions is some form of emulation learning
(Tomasello 1996).
Tomasello proposed that emulation learning takes place when “by
observing the manipulations of other animals individuals may learn all
kinds of ‘affordances’ of the environment that they would be unlikely
to discover on their own” (1996:321). One of the important features
distinguishing emulation from imitative copying is that in emulation
learning the animal selectively attends to the interesting outcome (the
concrete goal) that the other’s manipulations bring about, although
it apparently pays no attention to the particular behavioral means
that the other performs to bring about the outcome. As Tomasello
puts it, this kind of “social learning operate(s) without the individual
organism paying any attention whatsoever to the actual behavior of
other organisms” (1996:322). Having observed the desirable outcome,
the emulating animal tries to reproduce it on its own by applying
the action schemes available in its motor repertoire to manipulate the
tool and the goal object. Eventually, this process leads to success in
finding and learning an efficient procedure that produces the outcome.
This may happen either by rediscovering the same means action or
manner of tool use that the observed model originally performed or
by hitting on some behavioral means other than the one observed, but
that nevertheless also affords the attainment of the outcome. As a result,
the degree of fidelity of cultural transmission of the observed skill is
characteristically lower in emulation than what it would be expected
if its reproduction were mediated by imitative copying.
The social reproduction of goal-directed skills in primates often
produces relatively low-fidelity idiosyncratic variants of the actually
observed actions that, nevertheless, retain their functional efficacy
in relation to the goal. An intriguing example is the significant but
functionally appropriate modifications observed in the manual food-
processing skills of mountain gorillas with maimed hands (Byrne this
volume). Sumita et al. (1985), who looked closely at the spread of nut
cracking in a captive group of chimpanzees, reported that even in
normal primate populations many idiosyncrasies can be observed in
the manner that different individuals perform nut cracking.
Nagell et al. (1993) presented chimpanzees with a new rakelike
tool being used by a human demonstrator either in a more efficient
(upside down) or a less efficient (canonical position) manner to retrieve
a small out-of-reach object. Instead of blindly imitating the model,
these chimpanzees used the physically more efficient method in both
cases independently of whether that means action had or had not been
modeled to them (see also Call and Tomasello 1994).
Recently, Horner and Whiten (2005) provided evidence of rationally
selective omission of irrelevant behavior by chimpanzees learning
how to obtain a food reward by observing the actions of a model.
These animals reproduced only the causally relevant behaviors from
the sequence of actions modeled in which some of these actions were
functionally relevant whereas others were irrelevant for achieving the
goal. Importantly, this rational teleological selectivity was observed
only in one of the experimental conditions in which—because of the
use of a transparent box—the causal role mediating the effect of the
actions inside the box was directly visible to them.
Above, we argued that the characteristics of primate cultural tool
use, tool modification, and production of goal-directed manual skills
indicate some basic level of teleofunctional understanding of means–
ends relations in these animals. Now, given the variability of socially
transmitted forms and their selectivity in relation to their causal and
functional relevance reviewed above, we hypothesize that the kind of
emulation mechanism that mediates the social reproduction of primate
cultural skills is also based on and exploits the simple teleological
understanding of visible means–ends relations that primates possess.
We shall refer here to this cognitively enriched notion of emulative
observational learning as “teleological emulation.”
Because of the cognitive constraints on primate teleology the
behavioral
traditions it creates are restricted to skills whose concrete goal
is typically locally present and visually observable. This makes the
means–ends structure of the observed goal-directed manual skills
and tool use cognitively transparent for the primate learner in terms
of its own simple teleological interpretive capacities. The demand
characteristics represented by these cultural conditions of cognitive
transparency favored teleological emulation rather than blind imitation
as the dominant social–cognitive learning mechanism specialized to
mediate the within-group spread and intergenerational reproduction
of primate cultural products.
In short, our proposal is that as long as the causal-physical and means–
ends structure of the cultural skills modeled are directly observable—and,
therefore, cognitively transparent—to the primate learner, teleological
emulation provides a sufficient social–cognitive transmission mechanism
to ensure (and account for) the type of functionally relevant variability
of transmitted contents that characterizes the relatively low fidelity
cultural reproduction of primate behavioral traditions.

Can Apes Imitate, and if so, Why Don't They?


Given that, as the evidence suggests, primate behavioral traditions are
culturally reproduced by teleological emulation rather than imitation,
one may wonder (as many do) whether this is because of the fact that
apes may simply lack the capacity to imitate. This seems not to be the
case, however, as under some specific conditions apes can clearly be
induced to imitatively copy the observed behaviors of others. There
is certainly agreement that enculturated apes (like Kanzi or Chantek)
brought up by humans (Call and Tomasello 1996; Tomasello and Call
1997) do learn to imitate at least some new behaviors demonstrated to
them, although it is unclear which aspect(s) of their rich human cultural
environment is instrumental in activating this otherwise practically
dormant capacity. Recently, Horner and Whiten (2005) demonstrated
experimentally that when observing a series of actions performed on
a nontransparent box, chimpanzees blindly imitated a target action
(poking a rod into the opaque box) when the potential causal role it may
have played inside the box to enable the release of food reward remained
unobservable and, therefore, cognitively opaque to them. Importantly,
in another condition in which the causal-functional irrelevance of the
very same action in attaining the food was clearly observable because
the box was transparent (and so the chimpanzees could see that all the
rod did was hit a barrier that was spatially separated from the location
of the food and therefore it was clear that it played no causal role in
releasing the food), the chimps selectively (and rationally) omitted
this action from their subsequent attempts to get the food (cf. Call and
Tomasello 1995).
It seems, therefore, that apes do not simply lack the ability to
imitatively
copy observed behaviors,3 but that this capacity for blind imitation
seems activated only under conditions of cognitive opacity of relevance
of observed actions. This might explain then why in spite of being
able to imitate, apes hardly ever do so during the social learning of
the population-specific cultural skills they observe. In our view, this is
so because the primate behavioral traditions present in their natural
cultural environment typically involve perceptual access to visible
goals and means actions whose causal-functional structure is therefore
cognitively transparent to the observer’s teleological understanding.
In sum, we have argued that the conditions of cognitive transparency
characteristic of the cultural traditions of primate groups activate
teleological emulation as the dominant social–cognitive mechanism
mediating their cultural transmission. In contrast, imitation is a
mechanism of social transmission that is specially suited for (and
may be selectively triggered by) the demand characteristics of cultural
environments that contain cultural products whose causal, functional,
or intentional nature is cognitively opaque to the learner who can
therefore only acquire them through imitative copying.

Demand Characteristics of Human Cultural Forms for


Social Transmission Mechanisms: Cognitive Opacity
and Imitative Learning
In contrast to population-specific primate traditions, it seems to be a
central characteristic of human culture that many of its products are
cognitively opaque to the learner in a variety of ways. As a consequence
of this distinguishing feature, teleological emulation could not ensure
the cultural transmission and maintenance of such human cultural
forms that clearly necessitate the involvement of some form of imitative
learning for their successful cultural reproduction.

Arbitrariness and Conventionality of Human Cultural Forms


The most obvious cases of cognitive opacity in human culture are
provided by the arbitrary and conventional properties of most referential
devises used in human communicative systems (such as words, symbols,
or gestures). Such cultural forms could simply not be learned and
culturally transmitted without relying on the learner’s capacity to
imitate. For example, the acoustic–phonetic features of most words of
human languages necessitate imitative learning for their acquisition. It
is clear that no teleological efficiency considerations of causal-physical
affordance properties of phonemic strings could ever provide the
learner with cognitive “insight” into why a stone is referred to by the
word “stone” in English rather than some alternative phoneme string
such as, say, the (Hungarian) word “kö” that could (and does) do the
referential job equally well. Given the arbitrary relation between the
conventional sign and its referent, the relevance of its culturally shared
use is ensured by the conventionality of the linguistic form, rather than
by its causal-physical affordance properties. The only way, therefore,
that one can learn the vocabulary of the language of one’s culture is
through imitation, there is just no other way to do it.

Cognitive Opacity and Fidelity of Cultural Transmission: Sylvia's


Recipe
Maybe one of the most curious and hard-to-explain aspect of human
culture is the sufficiently high-fidelity social transmission, and relative
resistance to modification and change as a result of which many
cognitively opaque cultural forms tend to be protected from the danger
of entropy and eventual extinction from culture over the generations
(Blackmore 2000; Boyd and Richerson 1985; Dawkins 1976; Dennett
1995; Sperber 1996, this volume). This seems true in the case of many
human cultural forms in spite of (1) their cognitive causal or functional
opacity to both their users and learners, as well as (2) their apparent
lack of any clear locally adaptive value for the particular members of the
culture using, transmitting, and maintaining them. This has certainly
been a hot topic in the discussions of the role of imitation and other
possible mechanisms ensuring the fidelity of cultural transmission
and the stabilization of cultural forms in different models of cultural
evolution (Boyd and Richerson 1985) such as memetics (Dawkins 1976;
Dennett 1995), evolutionary psychology (Tooby and Cosmides 1992;
Barkow et al. 1992; Barrett et al. 2002), comparative approaches to culture
(Byrne et al. 2004; Whiten 2000) or cognitive cultural epidemiology
(Sperber 1994, 1996; Sperber and Hirschfeld 1999, 2004; for a review
see Pléh 2003).
Let us illustrate this most important human-specific aspect of social–
cultural inheritance by a (true) anecdote: The first author of this chapter
was having dinner with our friends the Watsons. During dinner, he
told them about our new theory of the human-specific adaptation for
“pedagogy” that, in his view, could explain the curious characteristics
of social transmission of relevant cultural knowledge in humans (see
later). In the discussion that followed, Marilyn Watson (an educational
psychologist, Watson and Ecken 2003) suddenly said “Well that makes
sense of my colleague Sylvia’s recipe for ham.” She went on to relate
this story. Sylvia, a fine educational researcher, was also a very good
cook. She had a very special way of doing a ham roast. One aspect of
her preparation was quite unique. She began by cutting a section off
238 Psychological Foundations

both ends of the ham. One day, while her elderly mother happened
to be visiting, she set out to make her special ham for dinner. As her
mother watched her remove the end sections, she exclaimed “Why
are you doing that?” Sylvia said, “Because that’s the way you always
began with a ham.” Her mother replied, “But that is because I did not
have a wide pan!”
There are a few morals of this story we would like to call attention to:
First, unlike her mother, Sylvia had plenty of large cooking pans that
could easily accommodate even a pretty large ham in one piece. In spite
of this, however, for many years she had continued to practice the habit
of cutting off the two sidepieces of the ham before cooking it. (Maybe
her children are also doing the same today.) Second, she did so without
ever spontaneously reflecting on the functional rationale (or lack of
it) for this curious procedure that remained cognitively opaque to her
during all these years. It was only by the happenstance of her mother’s
visit and comments that she came to possess a cognitive insight into
this matter, finally understanding (learning) what the original reason
was for the cultural habit that she had (socially) inherited from her
mother. Third, the specific habit survived in the family culture for all
those years in this cognitively opaque form even though the conditions
rationalizing the procedure as functional had long been absent.

Imitative Learning as a Human-specific Adaptation


for Cultural Transmission

Teleological Emulation Versus Rational Imitation: The Selective


Interpretive Nature of Imitative Learning in Human Infants
In a delayed imitation paradigm, Meltzoff (1988) demonstrated a novel
goal-directed action to 14-month-old infants: A female model illuminated
a magic light box by leaning forward from waist and touching the top
panel of the box with her forehead. A week later, when given the chance
to manipulate the box themselves, 67 percent of the infants reenacted
the novel “head action.” No infants performed the head action in a
baseline control group, however, who had not seen it demonstrated
before. Meltzoff’s (1988) finding that 14-month-olds readily imitate the
unusual and apparently less than optimal head action seemed rather
unexpected from the point of view of our own theory of the one-
year-old’s teleological stance or naive theory of rational action (Csibra and
Gergely 1998; Gergely and Csibra 2003). In a series of violation-of-
expectation studies (e.g., Csibra et al. 1999, 2003; Gergely et al. 1995),
we have shown that already nine- and 12-month-olds can attribute
goals to observed actions and evaluate the efficiency of the means
act in relation to the goal and the physical constraints of the actor’s
situation. When seeing the goal and the actor’s situational constraints,
these infants could infer—on the basis of the principle of the rational
action—what the most efficient available means action to the goal
would be and expected that the actor “ought to” perform that particular
means action (and not others) to achieve the goal (Gergely and Csibra
2003). On that ground, one would predict that in the Meltzoff (1988)
task infants should perform the most efficient goal-directed action (the
“hand action”) available to them, instead of imitating the cognitively
opaque and less rational head action.
To clarify this situation, Gergely et al. (2002) performed a modified
version of the Meltzoff (1988) task. They hypothesized that “if infants
noticed that the demonstrator declined to use her hands despite the fact
that they were free, they may have inferred that the head action must
offer some advantage in turning on the light. They therefore used the
same action themselves in the same situation” (Gergely et al. 2002:755).
To test this idea, Gergely et al. ran two groups of 14-month-olds varying
the situational constraints under which the model demonstrated the
very same head action to illuminate the magic box. In the “hands-
occupied” condition, the model’s hands were occupied in a salient
and natural manner when she performed the head action. (She first
pretended to be chilly and wrapped a blanket around her shoulders
visibly holding it tight with both her hands, and only then did she bend
forward to touch the box with her forehead). Another group of infants
were tested in the “hands-free” condition, in which after pretending to
be chilly and wrapping the blanket around her shoulders, the model
liberated her hands and placed them visibly on the table at the two
sides of the box. She then leaned forward and touched the box with
her forehead. This hands-free condition, therefore, basically replicated
the original demonstration situation of the Meltzoff (1988) study in
which the model’s hands were also free.
As Fig. 8.1 shows, when the model’s hands were occupied, 14-month-
olds were much less likely to imitate the head action (21%) when they
returned a week after the demonstration. Instead, they illuminated the
box by touching it with their hand (an emulative response), as this was
a more rational and efficient means action available to them, but not
to the model (teleological emulation). In contrast, when the model’s
hands were free, but in spite of this she still used her head to light up
the box, the majority of the 14-month-olds (69%) imitated her head
Figure 8.1. Selective imitation of the modeled “head action” (C) by 14-month-
olds in the “hands occupied” (A) vs “hands free” (B) demonstration conditions
(Gergely et al. 2002).

action, replicating Meltzoff’s (1988) original finding, (see Gergely et


al. 2002).4
A further unexpected but informative result of this study was that in
both conditions all the infants tested performed the prepotent emulative
hand action irrespective of whether they imitated the head action or
not.5 Moreover, all the infants in the hands-free condition who imitated
the head action, did so only after they had performed the emulative hand
action first that—in all these cases—actually succeeded in illuminating
the box! In other words, even after they have experienced that the
effect can be brought about by the simpler hand action as well, the
majority of infants in the hands-free condition were still motivated
to reenact the model’s demonstrated—although less-efficient—head
action. This indicates that the novel response imitatively learned from
the demonstration of a human model is retained by infants (even
for several months, see Meltzoff 1995) in spite of the availability and
production of more readily accessible and rational response alternatives
that also produce the same effect. This clearly suggests that imitative
learning of novel actions is a qualitatively different process in humans
than the imitative copying of new and reinforcing behavior of observed
conspecifics that has been demonstrated in several other animal species
(see Galef 1995; Galef et al. 1986; Heyes 1993; Heyes and Dawson 1990;
Heyes and Galef 1996) in which the initially copied modeled response
soon became extinguished when a more natural or equally or more
efficient alternative action was available to the animal.
Cultural Learning and Human Pedagogy
We now turn to our own interpretation of the nature of human
imitative learning and its role in the transmission and maintenance
of human cultural forms and knowledge. Earlier, we argued that the
demand characteristics of cognitively opaque cultural forms (a central
feature of human culture) necessitate the recruitment of imitation as
a social transmission mechanism to make the cultural reproduction
of cognitively opaque aspects of cultural skills possible. However, the
simple mechanism of blind imitative copying of observed actions is
“relevance blind” as it cannot differentiate which are the (relevant)
aspects of the observed behavior that should be imitated and retained,
and which are the (irrelevant) aspects that should be selectively omitted.
Therefore, blind imitation—without any correction mechanism—would
be a wasteful and error-prone social transmission process that would
represent serious danger for the cultural and cross-generational survival
of cognitively opaque human cultural forms (see Sperber, Boyd this
volume).
In our view, the capacity to blindly imitate observed behaviors of
conspecifics is a cognitively low-level, perceptual-motor mapping
ability
that is not unique to humans but is available to (and exploited
in different species-specific adaptive ways by) a variety of nonhuman
species as well. Imitation has, however, evolved to serve uniquely human
functions as a component mechanism recruited by a complex cognitive
system that we call human “pedagogy” (Gergely and Csibra 2005b). We
propose that human pedagogy is a primary species-specific cognitive
adaptation to ensure fast, efficient, and relevance-proof learning of
cultural knowledge in humans under conditions of cognitive opacity
of cultural forms (Csibra and Gergely 2006).
In cultural learning one obvious way to overcome the limitations
imposed by the cognitive opacity of relevance is to acquire the relevant
knowledge content directly from another conspecific who already
possesses it. As new behaviors, especially cultural activities, are often not
transparent as to either their knowledge base or their function, an active
communicative role of the more knowledgeable conspecific may greatly
assist the efficiency and viability of cultural knowledge acquisition.
We propose (Csibra and Gergely 2006) that Mother Nature’s “trick”
to make fast and efficient cultural transmission of cognitively opaque
relevant knowledge possible was precisely along these lines: humans
have evolved complex and specialized cognitive resources—that we call
“pedagogy”—that form a dedicated communicative system in which
the participants are inclined to teach and to learn new and relevant
cultural information to (and from) conspecifics. In particular, we suggest
that human individuals who possess cultural knowledge are naturally
inclined not only to use but also to ostensively manifest their knowledge
to (and for the benefit of) naive conspecifics, whereas the latter are
naturally motivated to acquire such knowledge by actively seeking
out, attending to, and being specially receptive to such communicative
manifestations of knowledgeable others. Through pedagogy, then, fast,
efficient and “relevance-proof” transfer of cultural knowledge—even
when its content is cognitively opaque, arbitrary or conventional—
becomes achievable.
Because of the design specifications of pedagogical knowledge transfer,
the relevance of knowledge acquired is neither a (statistical) function
of repetition of invariant contingencies and reinforcement nor assured
by innate triggering stimuli (as in imprinting), and it is not guaranteed
by innate content fixation either (as in the case of evolutionarily
selected prewired information structures). Rather, in pedagogy it is the
very fact that a knowledgeable conspecific (a “teacher”) ostensively
communicates his or her cultural knowledge by manifesting it for the
novice (the “learner”) is what ensures the (cultural) relevance of the
knowledge content transmitted. Because the learner is predisposed to
interpret ostensive communicative signals of the teacher as evidence
for the novelty and relevance of the knowledge content manifested,
this allows for fast learning of the knowledge communicated without
any further need to test its relevance before acquiring it. Furthermore,
because the relevance of knowledge in pedagogical transmission is
presumed, it also allows for the acquisition of knowledge contents that
are not only arbitrary, conventional, or functionally nontransparent
but that sometimes do not seem to (or actually do not) have any direct
and perceivable adaptive value at all.
We further propose (Csibra and Gergely 2006) that the human-specific
pedagogical inclination to teach each other (i.e., to transmit relevant
and new information to conspecifics) is complemented by a special kind
of human-specific receptivity to benefit from such teaching. Human
infants are equipped with specialized cognitive resources that enable
them to learn from infant-directed teaching: (1) they show very early
sensitivity to ostensive signals that indicate teaching contexts (including
eye contact, contingent reactivity, infant-directed speech, and hearing
one’s own name), (2) they tend to interpret directional cues (such as
gaze shift or pointing) produced in pedagogical contexts as referential
actions to identify the referents about which new information will be
manifested, (3) they expect the “teacher” to ostensively manifest by
his or her behavioral demonstration the relevant and new information
to be acquired about the referent, and (4) they are ready to fast map
such information to the object of reference (see Csibra and Gergely
2006 for a review of the developmental evidence supporting the very
early presence of these capacities in infancy). Finally, the infant’s
“pedagogical stance” contains the implicit assumption and expectation
that the information revealed about the referents in such ostensive
communicative teaching contexts will not only be new and relevant
but will consist of publicly shared and universal cultural knowledge
that is generalizable and shareable with other members of the cultural
community.6

Imitative Learning in the Service of Human Pedagogy: The Role of


Ostensive-communicative Cues
It is noteworthy that most developmental studies of imitative learning
typically present their target behaviors in a rich ostensive pedagogical
cuing context. For example, in a paradigm like Meltzoff’s (1988), a
female model demonstrates the novel means action by first establishing
eye contact with the infant maybe also addressing him or her by
name (ostensive communicative cues), then she shifts her eye gaze or
points to the object to be manipulated (referential cues). This may be
followed by an ostensive referential speech act (e.g., “Look, I’ll show you
something!”) accompanied by knowing looks and smiling, and it is only
then that the actual novel means act is manifested for the infant.
We hypothesize that in human infancy, initially, imitative learning
is triggered (and certainly strongly facilitated) by the presence of such
ostensive pedagogical cues that accompany the behavioral
manifestations
of relevant cultural information by others (Gergely and Csibra
2005b). Furthermore, we argue that the interpretive selectivity guiding
what aspect of the modeled behavior is going to be imitatively learned
is directed and constrained by the implicit assumptions of the infant’s
“pedagogical stance” that the ostensive cues produced by the other
activate in the infant. In particular, when taking the “pedagogical
stance,” the infant interprets the ostensive cues addressed to him or her
as indicating that the other is about to manifest “for” the infant some
significant aspect of cultural knowledge that will be new and relevant
to him or her and that, therefore, should be fast learned. 7
Let us illustrate how human pedagogy works by applying it to the
selective imitation finding of Gergely et al. (2002; see also Gergely and
Csibra 2005b). First, we assume that the 14-month-old interprets the
ostensive cues of the model as indicating her communicative intent
to manifest culturally relevant and new information for him or her to
acquire. Second, this pedagogical cuing context induces in the infant a
specific attentional and interpretive attitude that drives him or her to
apply his or her existing knowledge structures and explanatory “modes
of construal” (Gergely and Csibra 2003; Keil 1995, 2003; Kelemen 1999a,
1999b) to inferentially identify what aspect of the manifested behavior
conveys new and relevant information. Third, the pedagogical cuing
context triggers a special receptive learning mode in the infant to fast
learn what he or she has inferred to be new and relevant information
in the manifested action of the demonstrator.
Take the case of the hands-occupied condition. Clearly, the novel
outcome including the manifested affordance property of the object
(its illuminability on contact) is new information previously unknown
to the infant, so it is going to be retained in memory and reproduced
through action. But what about the particular behavioral means (the
head action) performed by the model to achieve the goal? Taking
the teleological stance and relying on the principle of rational action
(Gergely and Csibra 2003), the infant can infer that given the visible
physical constraints on the actor (her hands being occupied), the act of
touching the box by her forehead does qualify as a sensible, justifiable,
and physically efficient means action in this situation to bring about
the goal. So, because the physical-causal efficiency of the head action is
cognitively transparent (i.e., justifiable, expectable or even predictable
given that the actor’s hands are occupied), the fact that the actor used
her head (and not her hands) to touch the box does not qualify as
part of the new information that is being conveyed by the manifested
action. Therefore, it is predicted that the infant is not going to imitate
the head action in the hands-occupied context condition, but will
reproduce the novel information (i.e., will illuminate the box) by the
most efficient means available to him or her given his or her own
situational constraints: He or she will use his or her (free) hand to light
up the box (teleological emulation).
In the hands-free condition the situation is clearly different, however.
The novel goal involving the newly experienced affordance of the magic
box is new information here, too, so it is going to be retained and
reproduced. In contrast, when teleologically evaluating what the most
efficient means action would be under the given situational constraints,
on the basis of the fact that the actor’s hands were free to be used, the
infant must have identified the available “hand action” as the most
efficient (and, therefore, expectable) means that the model “ought to”
perform. Unexpectedly, however, the demonstrator chose not to use
her free hands, but instead manifested the unusual head action to bring
about the goal. We hypothesize that this perceived mismatch between
the predictable and the actually performed means action “marked” the
head action as also forming part of the new and relevant information
that the other’s ostensive manifestation conveyed. As a result, both
the new goal and the new means were retained and imitated by the
infant.
But is it really the case that the presence of pedagogical ostensive
cues indicating a communicative intent by the demonstrator to teach
are necessary to trigger the kind of inferences on the infant’s part that
can account for the selective imitation of the head action in the two
context conditions? To find out, we have recently run a new version of
the Gergely et al. (2002) study. Half of the subjects were presented with
the head action in either the hands-free or the hands-occupied context
conditions both introduced by the same pedagogical ostensive cues as
before. The rest of the 14-month-olds participated in an “incidental
observation” condition in which they observed the very same head
action performed in either the hands-free or the hands-occupied
condition, but without being exposed to any ostensive-communicative
cues by the model. Our findings (Király et al. 2004) show that the
ostensive context does make a qualitative difference, as we expected. In
the “pedagogical cuing” situation we replicated the very same pattern
of selective imitation, finding significantly more imitation of the head
action in the hands-free than in the hands-occupied conditions that was
reported by Gergely et al. (2002). However, in the incidental observation
situation there was no significant difference in the degree to which the
head action was imitated in the two context conditions. Furthermore,
as predicted, in the hands-free condition, we found significantly
more imitation of the head action in the pedagogical ostensive cuing
context than in the incidental observation condition (Király et al. 2004;
Gergely and Csibra 2005b). Thus, the pedagogical cuing context proved
necessary to induce the relevance-guided selective imitation of the head
action in the hands-free condition. This pattern of results, therefore,
provides support for our hypothesis that the presence of pedagogical
ostensive cues play a central role in triggering the infant’s interpretation
of the model’s behavior as a communicative manifestation of relevant
knowledge to be acquired.8
Finally, as argued in more detail elsewhere (Gergely and Csibra 2005b),
it should be pointed out that neither the finding of relevance-guided
selective imitation nor the causal role that pedagogical ostensive cues
play in inducing it seem easily accommodated by recent alternative
theories of cultural learning that attribute a general innate tendency
to human infants to imitate the observed actions of conspecifics that
is driven by a species-specific drive to “identify” with others who are
recognized as “just-like-me” (Meltzoff 1996, 2002) or by a human-
specific “motivation to share psychological states with others” with
whom they identify (Tomasello et al. 2005:1).9 This is so because the
presence of a human model—that would presumably automatically
trigger identification in all of our conditions—would predict an equal
amount of imitation of the demonstrated novel head action across
conditions (hands free vs. hands occupied) and across presentation
contexts (pedagogical cuing vs. incidental observation) if “imitative
learning. . . relie[d] fundamentally on infants’ tendency to identify with
adults” (Tomasello 1999:82; Tomasello et al. 1993), and if this human-
specific motive for identification activated in infants a general “inbuilt
drive to ‘act like’ their conspecifics” (Meltzoff 1996:363).

Where Does Cognitive Opacity of Cultural Skills come


from? A Brief Just-so Story
As our hypothesis asserts that pedagogy is a primary human-specific
adaptation that does not necessarily rely on other (arguably human-
specific) abilities like language or theory of mind (see Csibra and Gergely
2006), the question of evolutionary origin would inevitably be raised.
How and why did pedagogy evolve?
We hypothesize that the conditions that represented selective
pressure for the evolution of pedagogy may have first emerged because
of qualitative changes in the forms of teleological reasoning about
tools during hominid evolution that led to types of tool use and tool
manufacturing practices that made them cognitively opaque for the
observational learner. The simple goal-driven teleology of primates, as
we argued, is severely restricted by being activated only in the presence
of visible goals. When under these restricted input conditions, primates
could confront the teleological question: “What object could I use
to achieve this specific goal?” We know, though, that our hominid
ancestors have surpassed or qualitatively modified this simple teleology
already some two million years ago, when they started to view the
tools that they created as having permanent functions. As evidenced
in the archaeological record, this new level of more stable teleological
conceptualization of objects as tools was manifested in routine behaviors
such as keeping tools instead of discarding them after use, storing them
at specific locations, or prefabricating the tools at one location and
carrying them for long distances for later application at a different
place. We suggest that this momentous change in the application of
teleological reasoning about tools required a reversal of perspective in the
way our ancestors were thinking about tool-goal relations. Unlike simple
primate teleology that could only be triggered by direct perceptual access
to a concrete goal, “inverse” teleological reasoning could be activated
just by the sight of objects that were contemplated as potential tools
even when no specific goal was present. In other words, the sight of an
object itself (without the presence of a goal) could activate the question:
“What purpose could I use this object for?”
Early hominids not only manufactured tools at a distance from
their eventual functional use but they also used tools to manufacture
other tools (“recursive” teleology) in the absence of visible goals. In
both conditions the passive observational learner had no information
about the relevant properties of the goal (that was mentally represented
by the tool maker, but was visually inaccessible to the learner) that
guided and constrained the tool manufacturing process. Therefore,
the observed activity remained cognitively opaque for the learner as
he or she had no basis from which to infer what were the relevant
aspects of the observed activities that should be selectively retained
and reproduced. Cognitive “opacity,” therefore, represented a serious
learnability problem for previous forms of social–cognitive transmission
mechanisms (including emulation) thereby endangering the cultural
reproducibility of such new cultural practices.
We argue that the cognitive opacity of cultural products in early
hominid cultural environments represented evolutionary pressure for
the selection of a new type of social–cognitive learning mechanism
to solve this learnability problem and to ensure fast and efficient
transmission of culturally relevant knowledge.
So in our just-so story it was the emerging cognitive opacity of early
hominid technological culture that eventually led to the selection
of a human-specific communicative system specialized to ensure the
intergenerational transmission of relevant cultural knowledge. This
system, human pedagogy (Csibra and Gergely 2006; Gergely and Csibra
2005b), has provided a specialized cultural learning mechanism that
made relevance-guided selective imitative learning possible. The mutual
design specifications of pedagogy involve specialized cognitive resources
on the part of both participants of the communicative process that ensure
the efficient selective transfer of relevant cultural knowledge. On the one
side, knowledgeable humans (teachers) became spontaneously inclined
to ostensively communicate relevant cultural information by specific
types of behavioral knowledge manifestations. These were designed
“for” the learner to guide and constrain his or her inferential attempts
to identify from the communicative manifestation the new and relevant
cultural contents to be acquired. On the other side, ignorant conspecifics
(learners) developed specific receptivity to pedagogical ostensive cues
and knowledge manifestations and became equipped with specialized
cognitive devices to infer and fast learn the relevant and new cultural
information demonstrated “for” them.

Human Pedagogy as the Evolutionary Roots of


Human Sociality
In conclusion, we speculate that human pedagogy—originally selected
for the more restricted domain of hominid cultural learning—may
have provided the basic phylogenetic roots for a much wider range
and multiple forms of human sociality (Enfield and Levinson this
volume). We suggest that the adaptation for pedagogy already (and
maybe for the first time in evolution) exhibited some of the constitutive
elements of human sociality as its mutual design features involve built-
in assumptions (1) about the shared goal of both participants (being that
of the transfer of relevant cultural knowledge) that forms the “common
ground” around which the pedagogical communicative exchange of
relevant information is organized (see also Enfield, Clark, Tomasello,
and Schegloff in this volume), and (2) about the teacher’s cooperative
benevolence and communicative intent (Sperber and Wilson 1986) to
share his culturally relevant knowledge with the learner.

Notes
1. Primate teleology seems very limited also in comparison with the
amazingly
creative and generative—as well as causally sophisticated—innate
teleological
understanding of means–ends relations within the specific domain
of tool use and tool making recently documented in the Caledonian crow,
which, however, is also restricted to a highly specific task domain (Kenward
et al. 2005).
2. We are aware that our current characterization of the types and range of
restrictions that constrain primates’ functional conceptualization of objects
as tools may need to be qualified or tempered in the future as a function of
increased availability of new and relevant observational or experimental data.
At present, however, we feel that the few sporadic and often anecdotal reports
from field observations (see McGrew 2004, for a recent review) that may at
first seem to contradict our generalizations can be easily accommodated by
our hypothesis. For example, Boesch and Boesch-Ackermann (2000) describe
evidence that in the lowland rainforest at Tai where quartzite stones used by
apes to crack hard-shelled nuts are rare, chimpanzees do carry such stones to
known source sites using a minimal distance strategy. Such a strategy, however,
clearly implies prior perceptual access to the specific source location and the
affordance requirements of the particular type of goal object (hard-shelled nuts)
it contains. That is, it is the animal’s prior perceptual access to the specific goal
information that precedes, triggers, and directs the subsequent search for the
nearest object with suitable affordance properties to be carried to the goal site
for being used as a tool to attain the specific goal. No doubt, this remarkable
practice does imply the relatively short-term ability to mentally represent and
actively maintain in working memory the previously perceptually accessed
specific goal information that, nevertheless, still functions as the initial
triggering condition for the teleofunctional conceptualization of the stone
object in terms of its goal.
3. “Blind” imitative behavior copying seems to be a basic competence
available to a variety of different species, sometimes used extensively and
spontaneously in natural environmental conditions as in the case of vocal
imitation in learning species-specific songs and dialects in psittacine birds such
as sparrows (e.g., Petrinovich 1988), whereas in other cases experimentally
inducible by presenting pretrained conspecific models perform new behaviors
that result in direct reinforcement, as in budgerigars, rats, or pigeons (see
Galef 1995; Galef et al. 1986; Heyes 1993; Heyes and Dawson 1990; Heyes and
Galef 1996).
4. We have also replicated this finding of selective imitation between the
two context conditions in a situation in which the model was not present
during the testing phase (Gergely et al. 2003).
5. Meltzoff (1988, 1995) presented only frequencies of imitating the target
act and did not comment on the existence of alternative emulative responses
such as the hand action.
6. See Csibra and Gergely (2006) for additional arguments showing how
a variety of early emerging social cognitive capacities—such as imitative
learning (Gergely and Csibra 2005b), social referencing (Gergely et al. 2007),
protodeclarative pointing, or word learning—can be usefully reinterpreted as
examples of cultural learning through pedagogy.
7. Note that these assumptions are directly analogous, if not identical, to
the Gricean pragmatic assumptions of ostensive communication as spelled
out in Sperber and Wilson’s (1986) relevance theory. In our view, however,
pedagogy is a primary adaptation for cultural learning and not a specialized
module dedicated to the economic recovery of speaker’s intent in linguistic
communication that has evolved later as a submodule of the general theory of
mind capacity of humans (Sperber and Wilson 2002).
8. For further supporting evidence of the influence of pedagogical cues in
influencing the early teleofunctional construal of the function of new artifacts,
see Casler and Kelemen (2005), and DiYanni and Kelemen (2005).
9. For a critical analysis of this position, see Gergely and Csibra (2005a).

References
Baldwin, J. M. 1894. Mental development in the child and the race. Methods
and process. New York: Macmillan.
Bandura, A. 1986. Social foundations of thought and action: A social cognitive
theory. Englewood Cliffs, NJ: Prentice Hall.
Barkow, J., L. Cosmides, and J. Tooby. 1992. The adapted mind: Evolutionary
psychology and the generation of culture. New York: Oxford University
Press.
Barrett, L., R. Dunbar, and J. Lycett. 2002. Human evolutionary psychology.
Houndmills: Palgrave.
Blackmore, S. 2000. The power of memes. Scientific American 283(4):52–
61.
Boesch, C., and H. Boesch. 1993. Diversity of tool use and tool-making
in wild chimpanzees. In The use of tools by human and non-human
primates, edited by A. Berthelet and J. Chavaillon, 158–187. Oxford:
Oxford University Press.
Boesch, C., and H. Boesch-Acherman. 2000. The chimpanzees of the Tai
forest: Behavioural ecology and evolution. Oxford: Oxford University
Press.
Boyd, R., and P. J. Richerson. 1985. Culture and the evolutionary process.
Chicago: Chicago University Press.
Byrne, R. W., and A. E. Russon. 1998. Learning by imitation: A hierarchical
approach. Behavioral and Brain Sciences 21:667–721.
Byrne, R. W., P. J. Barnard, I. Davidson, V. M. Janik, W. J. McGrew, Á.
Miklósi, and P. Wiessner. 2004. Understanding culture across species.
Trends in Cognitive Sciences 8(8):337–386.
Call, J., and M. Tomasello. 1994. The social learning of tool use by
orangutans (Pongo pygmaeus). Human Evolution 9:297–313.
——. 1995. The effect of humans on the cognitive development of apes.
In Reaching into thought, edited by A. E. Russon, K. A. Bard, and S. T.
Parker, 371–403. New York: Cambridge University Press.
——. 2005. The use of social information in the problem-solving of
orangutans (Pongo pygmaeus) and human children (Homo sapiens).
Journal of Comparative Psychology, 109:308–320.
Casler, K., and D. Kelemen. 2005. Young children’s rapid learning about
artifacts. Developmental Science 8(6):472–480.
Csibra, G., S. Bíró, O. Koós, and G. Gergely. 2003. One-year-old infants
use teleological representations of actions productively. Cognitive
Science 27(1):111–133.
Csibra G., and G. Gergely. 1998. The teleological origins of mentalistic
action explanations: A developmental hypothesis. Developmental
Science 1(2):255–259.
——. 2006. Social learning and social cognition: The case of pedagogy.
In Progress of change in brain and cognitive development. Attention and
Performance, vol. 21, edited by Y. Munakata and M. H. Johnson, 249–
274. Oxford: Oxford University Press.
Csibra, G., G. Gergely, S. Bíró, O. Koós, and M. Brockbank. 1999. Goal
attribution without agency cues: The perception of “pure reason” in
infancy. Cognition 72:237–267.
Dawkins, R. 1976. The selfish gene. Oxford: Oxford University Press.
Dennett, D. 1995. Darwin’s dangerous idea: Evolution and the meanings
of life. New York: Simon and Schuster.
DiYanni, C., and D. Kelemen. 2005. Using a bad tool with good intention:
How preschoolers weigh physical and intentional cues when learning
about artifacts. Cognition 97:327–335.
Donald, M. 1991. Origins of the modern mind: Three stages in the evolution
of culture and cognition. Cambridge, MA: Harvard University Press.
Galef, B. G., Jr. 1990. Tradition in animals: Field observations and
laboratory analyses. In Interpretations and explanations in the study
of behavior: Comparative perspectives, edited by M. Bekoff and D.
Jamieson, 74–95. Boulder, CO: Westview Press.
——. 1995. Why behaviour patterns that animals learn socially are
locally adaptive. Animal Behaviour 49:1325–1334.
Galef, B. G., Jr., Manzig, L. A., and R. M. Field. 1986. Observational
learning in budgerigars: Dawson and Foss (1965) revisited. Behavioural
Processes 13:191–202.
Gergely, G., H. Bekkering, and I. Király. 2002. Rational imitation in
preverbal infants. Nature 415(6873):755.
Gergely, G., and G. Csibra. 2003. Teleological reasoning about actions:
The naïve theory of rational action. Trends in Cognitive Sciences 7:287–
292.
——. 2005a. A few reasons why we don’t share Tomasello et al.’s
intuitions about sharing. Commentary on Tomasello, Carpenter,
Call, Behne, and Moll’s target article “Understanding and sharing
intentions: The origins of cultural cognition.” Behavioral and Brain
Sciences 28:701–702.
——. 2005b. The social construction of the cultural mind: Imitative
learning as a mechanism of human pedagogy. Interaction Studies
6(3):463–481.
Gergely, G., K. Egyed, and I. Király. 2007. Early mindreading versus
pedagogical knowledge transfers interpreting object-referential
emotion expressions during the second year. Developmental Science,
in press.
Gergely, G., I. Király, and O. Koós. 2003. Developmental changes in
early observational Learning. Paper presented at the Biennial Meeting
of the Society for Research in Child Development, Tampa, April 24–
27.
Gergely, G., Z. Nádasdy, G. Csibra, and S. Bíró. 1995. Taking the
intentional
stance at 12 months of age. Cognition 56(2):165–193.
Goodall, J. 1986. The Chimpanzees of Gombe: Patterns of Behavior.
Cambridge, MA: Harvard University Press–Belknap Press.
Heyes, C. M. 1993. Imitation, culture and cognition. Animal Behaviour
46:999–1010.
Heyes, C. M., and G. R. Dawson. 1990. A demonstration of observational
learning using a bidirectional control. Quarterly Journal of Experimental
Psychology 42B:59–71.
Heyes, C. M., and B. G. Galef Jr. 1996. Social learning in animals: The
roots of culture. New York: Academic Press.
Horner, V., and A. Whiten. 2005. Causal knowledge and imitation/
emulation switching in chimpanzees (Pan troglodytes) and children
(Homo sapiens). Animal Cognition 8:164–181.
Keil, F. C. 1995. The growth of causal understandings of natural kinds.
In Causal cognition: A multi-disciplinary debate, edited by D. Sperber,
D. Premack, and A. J. Premack, 234–262. Oxford: Clarendon Press.
——. 2003. Folkscience: Coarse interpretations of a complex reality.
Trends in Cognitive Sciences 7:368–373.
Kelemen, D. 1999a. Beliefs about purpose: On the origins of teleological
thought. In The descent of mind: Psychological perspectives in hominid
evolution, edited by M. Corballis and S. Lea, 278–294. Oxford: Oxford
University Press.
——. 1999b. Functions, goals and intentions: Children’s teleological
reasoning about objects. Trends in Cognitive Sciences 12:461–468.
Kenward, B., A. A. S. Weir, C. Rutz, and A. Kacelnik. 2005. Behavioral
ecology: Tool manufacture by naive juvenile crows. Nature 433(7022):
121.
Király, I., G. Csibra, and G. Gergely. 2004. The role of communicative-
referential cues in observational learning during the second year.
Paper presented at the 14th Biennial International Conference on
Infant Studies, Chicago, May 5–8.
McGrew, W. C. 1996. Chimpanzee material culture: Implications for human
evolution. New York: Cambridge University Press.
——. 2004. The cultured chimpanzee: Reflections on cultural primatology.
New York: Cambridge University Press.
Meltzoff, A. N. 1988. Infant imitation after a one week delay: Long-
term memory for novel acts and multiple stimuli. Developmental
Psychology 24:470–476.
——. 1995. What infant memory tells us about infantile amnesia:
Long-term recall and deferred imitation. Journal of Experimental Child
Psychology 59:497 -15.
——. 1996. The human infant as imitative generalist: A 20-year progress
report on infant imitation with implications for comparative
psychology. In Social learning in animals: The roots of culture, edited
by C. M. Heyes and B. G. Galef Jr., 347–370. New York: Academic
Press.
——. 2002. Imitation as a mechanism of social cognition: Origins
of empathy, theory of mind, and the representation of action. In
Handbook of Childhood Cognitive Development, edited by U. Goshwami,
6–25. Oxford: Blackwell.
Mithen, S. 1996. The prehistory of the mind. London: Thames and
Hudson.
Nagell, K., Olguin, R., and M. Tomasello. 1993. Processes of Social
Learning in the tool use of chimpanzees (Pan troglodytes) and human
children (Homo sapiens). Journal of Comparative Psychology 107:174–
186.
Nishida, T. 1987. Local traditions and cultural transmission. In Primate
Societies, edited by B. B. Smuts, D. L. Cheney, R. M. Seyfarth, R. W.
Wrangham, and T. T. Struhsaker, 462–474. Chicago: University of
Chicago Press.
Petrinovich, L. 1988. Individual stability, local variability and the
cultural transmission of song in White-crowned Sparrows (Zonotrichia
leucophrys nuttalli). Behaviour 107:208–240.
Pléh, C. 2003. Thoughts on the distribution of thoughts: Memes or
epidemics. Journal of Cultural and Evolutionary Psychology 11:21–51.
Schick, K. D., and N. Toth. 1993. Making silent stones speak: Human
evolution and the dawn of technology. New York: Simon and Schuster.
Semaw, S. 2000. The world’s oldest stone artifacts from Gona, Ethiopia:
Their implications for understanding stone technology and patterns
of human evolution between 2.6–1.5 million years ago. Journal of
Archaeological Science 27:1197–1214.
Sperber, D. 1994. The modularity of thought and the epidemiology of
representations. In Mapping the mind: Domain specificity in cognition
and culture, edited by L. A. Hirschfeld and S. A. Gelman, 39–67. New
York: Cambridge University Press.
——. 1996. Explaining culture: A naturalistic approach. Oxford:
Blackwell.
Sperber, D., and L. Hirschfeld. 1999. Culture, cognition, and evolution.
In MIT Encyclopedia of the Cognitive Sciences, edited by R. Wilson and
F. Keil, cxi–cxxxii. Cambridge, MA: MIT Press.
——. 2004. The cognitive foundations of cultural stability and diversity.
Trends in Cognitive Sciences 8(1):40–46.
Sperber, D., and D. Wilson. 1986. Relevance: Communication and cognition.
Oxford: Blackwell.
——. 2002. Pragmatics, modularity and mind-reading. Mind and Language
17(1):3–23.
Sumita, K., Kitahara-Frisch, J., and K. Norikoshi. 1985. The acquisition
of stone-tool use in captive chimpanzees. Primates 26:168–181.
Tooby, J., and L. Cosmides. 1992. The psychological foundations of
culture. In The adapted mind: Evolutionary psychology and the generation
of culture, edited by J. Barkow, L. Cosmides, and J. Tooby, 19–136.
New York: Oxford University Press.
Tomasello, M. 1996. Do apes ape? In Social learning in animals: The roots
of culture, edited by C. M. Heyes and B. G. Galef Jr., 319–346. New
York: Academic Press.
——. 1999. The cultural origins of human cognition. Boston, MA: Harvard
University Press.
Tomasello, M., and J. Call. 1997. Primate cognition. Oxford: Oxford
University Press.
Tomasello, M., M. Carpenter, J. Call, T. Behne, and H. Moll. 2005.
Understanding and sharing intentions: The origins of cultural
cognition. Behavioral and Brain Sciences 28(5):675–691.
Tomasello, M., A. C. Kruger, and H. H. Ratner. 1993. Cultural learning.
Behavioral and Brain Sciences 16:495–552.
Uller, C. 2004. Disposition to recognize goals in infant chimpanzees
(Pan troglodytes). Animal Cognition 7:154–161.
Visalberghi, E., and D. M. Fragaszy. 1990. Do monkeys ape? In “Language”
and Intelligence in Monkeys and Apes, edited by S. T. Parker and K. R.
Gibson, 247–273. Cambridge: Cambridge University Press.
Watson, M., and L. Ecken. 2003. Learning to trust: Transforming difficult
elementary classroom through developmental discipline. San Francisco:
Jossey-Bass.
Whiten, A. 2000. Primate culture and social learning. Cognitive Science
24:477–508
Whiten, A., and D. Custance. 1996. Studies of imitation in chimpanzees
and children. In Social learning in animals: The roots of culture, edited
by C. M. Heyes and B. G. Galef Jr., 347–370. New York: Academic
Press.
Part 3

Culture and Sociality


nine

The Thought that Counts:


Interactional Consequences of
Variation in Cultural Theories of
Meaning
Eve Danziger

The Mopan Maya, indigenous swidden agriculturalists of Central


America, can strike outsiders as remarkable in their attitudes to
mental states. In particular, when things go wrong in the Mopan
world, perpetrators of crimes are found responsible and punished
according to the degree of damage or waste that they have caused,
rather than according to the degree to which their crime was committed
intentionally. This extends from full-blown adult wrongdoing at the
village judicial level to the misdemeanors of the private household (for
more on Mopan society, see Danziger 2001; Gregory 1984; Thompson
1930). Children and adults alike are punished according to the outcome
of their doings; the defense of “I didn’t mean it!” is considered irrelevant,
and therefore seldom attempted. And this is not merely an institutional
or interested stance. The same attitude of disregard for the mental states
of perpetrators is found in the everyday gossip of bystanders. In one
instance for example, I heard from one of his neighbors the story of a
man who had recently committed suicide. The man, it seems, had drunk
poison rather than face the machetes of fellow villagers, who were out
for his blood after his yearly agricultural burn had destroyed their cacao
trees. In the account that I heard, the fact that his fire had burned out of
control and that the destruction was not intended weighed nothing—
either in mitigating the reported anger of those wronged or in the
Culture and Sociality

narrator’s assessment of their likelihood to forgive the perpetrator. The


neighbor found it quite understandable that he would have made the
decision to kill himself under these circumstances. Most remarkable of
all, the nonintentional nature of this crime was something I only elicited
from the narrator through specific questioning. It was not originally
included as a reportable component of the narrative.
The phenomenon is found across the spectrum of Mopan misdemeanor
and reproach; linguistic wrongdoing is no exception. The uttering of
a falsehood is not excused even if the speaker believes at the time
of utterance that his or her statement is true. Children who indulge
in pretend play or fantasy story making are reproved—several adults
have told me that they hid such play from their parents as children.
Meanwhile, adult storytellers react with moral indignation when queried
whether their stories—including those involving talking animals or
other supernatural creatures—are literally true. The person who retells
a Mopan story is held responsible for its truth, and tellers are cautious
in thus committing themselves.
The basis for Mopan belief in a story tends to depend on the degree
of respect that a person holds for the individual from whom he or she
first heard the story. But although social analysts (Foucault 1980, see also
Evans-Pritchard 1976) might be quick to point out that in the Mopan
case, as in all others, the question of what counts as “true” depends
on social and political circumstances, Mopan practitioners do not see
things this way. Utterance truth is considered in Mopan ideology a
straightforward matter of word-to-world fit, and utterances are held
either to succeed or to fail at accurately describing some actual past or
present state of affairs (e.g., I was told that animals did speak in days
gone by). Stories and other statements that are not assessed as true
within this matrix are referred to in Mopan as tus (lies).
There exists no other candidate lexeme in Mopan for the notion of
“lying” or “stating falsehood,” and the translation “lies, lying” is the
only one ever offered for this form by bilingual Mopan speakers. Harshly
or mildly applied, a negative connotation is always present to some
degree in uses of this word. A characterization of another’s utterance
as tus, however, is based exclusively on the perceived truth value of
expressions and not on the intentional or belief states of the speaker.
This is so even when the speaker merely translates the opinions or
repeats the words of another (Danziger 1996b, 2001, 2002). Accordingly,
many cases of expression that might be categorized elsewhere as “errors”
are condemned in Mopan as tus. Accusations of tus are common in
Mopan face-to-face interaction and in gossip, and these accusations
Variation in Cultural Theories of Meaning

carry the negative charge we have discussed. Mopan social practice is


characterized by a tendency toward caution and reticence in interactions
with others (Gregory 1975), by the liberal deployment of evidential
and quotative particles in speech (Danziger n.d.) and by an accepted
interactional reliance on silence in cases of uncertainty (Danziger 1996a,
2001; cf. Basso 1970).
In the varieties of English that have so far been investigated, a false
statement is properly a “lie” only if the speaker is aware of its falsehood
(Coleman and Kay 1981; Sweetser 1987; for other, more Mopanlike,
possibilities within English however, consider Brice Heath 1982). It is
precisely the fact that, despite its ubiquity, the gloss “lies” is only a
partially adequate translation for Mopan tus that is of interest for the
present discussion.

Mopan Philosophy of Language


These distinctive Mopan social and interactional phenomena do not
seem to occur as effortful attempts to comply with a superficial or
externally imposed moral ideology. Rather, they emerge as a consequence
of deeply held and themselves largely unconscious beliefs about the
nature of language, mind, and the universe.
Mopan certainly know that others can have false beliefs and mistaken
understandings of the world. Pilot versions of standard tasks (see, e.g,
Astington this volume) show that Mopan children acquire understanding
of others’ false belief by school age and certainly well before puberty.
And it is quite straightforward in Mopan to describe another’s mental
state, using everyday predicates such as k’at (want), tz’okes (believe),
eel (know) and others. But in Mopan ideology one does not use one’s
understanding of the possibility that others can entertain false beliefs to
excuse falsehood. This is because a separate and sacred morality inheres
in the very relationship of spoken word to actual world. The nature
of the transgression involved in speaking falsehood is cosmological as
well as interpersonal.
The prohibition on the telling of falsehoods is an aspect of tzik
(respect), one of the most important moral forces in the Mopan universe.
Elsewhere (Danziger 2001), I have described how tzik forbids incest,
murder, unruliness, laziness, levity, and, crucially for our purposes, the
telling of lies. To violate these prohibitions is called in Mopan p’a’as,
a term whose translational range goes all the way from “teasing” to
“mockery” to “insult” and “blasphemy.”
This sacred aspect of the word-to-world fit is related to the desirability
of keeping the universe in good order and lends itself to supernatural
inversion. One Mopan man explained to me that he had been taught
as a child not to indulge in pretend play explicitly for fear that his false
utterances should come true as a result of his speech. Formal genres of
Mopan speech cannot be performed, even for the eager anthropologist,
except under the conditions appropriate for nonstaged performances.
This is because the utterances involved are believed to remain efficacious
regardless of context, and using them in the wrong circumstances
creates a word-to-world mismatch that is regarded as both dangerous
and blasphemous (Danziger 2001).
Overall, in Mopan philosophy, linguistic words and expressions are
considered to be related to their signifiers in ways that are ubiquitously
performative, and that transcend the volition of those who use them. It
follows that judgments about the morality of false expression is made
on the basis of perceived word-to-world fit, without consideration of
utterer intentions or belief states. Translated into a different vocabulary
(Searle 1965), we might say that in Mopan philosophy utterances have
locutionary but not illocutionary force.

Ideologies of Intention in Cross-cultural Context


The Mopan case has significant resonance in the ethnographic literature
on non-Western philosophies of language. There is clear evidence from
a number of societies and contexts that an utterer’s mental state is not
always or universally considered relevant to the interpretation of his
or her speech and action. Levinson (this volume) makes mention of
the large body of social theory that has recognized for generations that
legal responsibility in many societies is vested elsewhere than in the
individual and his or her intentions (see also Rosen 1995). Specifically
with regard to language, we know that in Samoa, for example, children
are shown that the mental states only of high-status others are to be
interpreted (Ochs 1982); Samoan politicians are held accountable for
broken commitments even when their failures are the result of others’
bad faith (Duranti 1992). In the Philippines, Ilongot warriors swear
binding oaths rather than making sincere promises (Rosaldo 1982).
DuBois (1987) shows that religious divination across cultures functions
in the absence of utterer intentions, whereas in Guatemala, one simply
shrugs one’s shoulders when asked to consider the motives and mental
states of others (Warren 1995). See Gaskins (this volume) for other
examples drawn from the study of cross-cultural socialization. Robbins
(2001) is particularly helpful in articulating the fact that the particular
ideology that considers intentionality as the key to linguistic production
and interpretation is that of historic “modernity” and as such a cultural
product like all other ideologies of language (see also Lakoff and Johnson
1980, Reddy 1993).1
Like these, the observations from Mopan can readily be directed to
cast doubt on the privileged role of intentionality as it is enshrined in
the philosophy of language in interaction (Grice 1989c; Searle 1965). A
ready riposte to such a cultural critique, however, lies in pointing out
that what a people believe they are doing and what an analyst observes
them to be doing may not be one and the same. If Samoans, Ilongot,
Mopan, and the rest experience their own mental states and appreciate
those of others, then any claimed disregard for such states as factors in
the conduct of interaction can be seen as a kind of false consciousness,
a “folk” understanding that the analyst may freely set aside.
I endorse in what follows the position that cultural ideologies (whether
of the Mopan or of the moderns) have little to do with the conduct
of most interaction. I intend to conclude however, that there are at
least some grounds for proposing that local philosophies about the
place of mind in linguistic meaning cannot always be disregarded. In
that connection, it will be fruitful to consider the cultural differential
not as an issue related to actual utterer intentions or the lack of them
(contra Nuyts 1994), nor even as a matter of audience belief that such
intentions may exist, but as a matter of the degree to which audiences
are willing to take them into account as a routine part of linguistic
interpretation. This shift of focus will lead us to conclude that local
beliefs and philosophies about language are not always mere linguistic
epiphenomena but may at times play a crucial role in determining the
types of meaning making that can successfully take place in a given
society.

Varieties of Meaning
Far from suggesting that local ideologies about language and meaning
might be irrelevant to the actual practice of interaction, several of
the most influential scientific articulations of the modernist ideology
actually entail the conclusion that speech participants who do not
subscribe to a mentalist theory of interaction should not show normal
patterns of interaction. This is because, in these articulations of the
theory, constant and relatively aware monitoring of the interlocutor’s
intentions is a sine qua non of successful interaction.
Grice’s seminal definition of linguistic meaning, for example, runs as
follows, “For some audience A, U intended his utterance of x to produce in
A some effect (response) E, by means of A’s recognition of that intention”
(1989c:122, emphasis added). In other words, for Grice, U (the utterer)
intends A (the audience) to recognize not just the signification of the
gesture U emits, but also U’s very intention to have that signification
(or another one) recognized. Note how this formulation requires that
U rely on A’s willingness to “recognize,” that is, guess at—whatever
it is that U actually intends. For Grice (1989b, 1982), this issue is far
from trivial. If audiences interpret meaning without regard to utterers’
communicative intentions, their interpretation is of the type he called
“natural” and that he exemplified in classic examples like “those spots
mean measles,” or “those clouds mean rain.” However, meaning that
relies on the “by means of” clause is “non-natural,” and it is this type
of meaning that Grice spent his career examining.
In one illuminating passage, Grice (1989b) discusses Searle’s example
of the U.S. soldier during World War II who, when captured by Italians,
attempts to convince them that he is actually a German officer, and
therefore not to be detained. The American speaks in German to the
Italians, who recognize but do not understand that language. Grice
argues that the case in which the Italians conclude that their prisoner
is a German officer from the very fact that he is speaking German is
qualitatively different from that in which they draw the same conclusion
based on their “recognition” of his intention to utter the specific German
sentence “I am a German officer.” The former case, although inferential,
does not require intention recognition. Only the latter, which does
require it, can be considered a case of non-natural meaning. In both
cases, according to Grice, we are free to suppose that the U.S. soldier
might have had certain intentions in speaking (including deceptive
ones), and even that the Italians might form hypotheses as to what
those intentions could be. Only in the second case, however, and not
in the first, do the hypotheses of the audience about the intentions of
the utterer make a difference to the meaning that the audience takes
from the utterance.
Note that both are cases of inference. The latter, in the prototypical
Gricean way, is an inference from what is assumed to be the speaker’s
desire to get his intention recognized. The former is an inference from
the assumption of “natural” (indexical, associational) compliance
with something—indeed with whatever it is that the U.S. soldier in
this example does not comply with—something like “speak your own
language.” The fact that the American can manipulate the expectation of
such compliance is evidence enough that the association in question is
not purely “natural.” But Grice himself tells us that is not what he means
by “non-natural.” Grice here seems to propose that we understand the
conventional as a variety of the “natural,” on the grounds that neither
involves the crucial diagnostic of the “non-natural”—the intention to
get one’s intention (and not just one’s signal) recognized.
In various other areas of his writings, Grice departs from the strict
natural–non-natural dichotomy by giving consideration also to
conventional
meaning (Grice 1989c, see discussion below). But the dichotomy,
with its embedded reliance on audience willingness to interpret the
utterer’s mental state, is among Grice’s most influential contributions.
In certain critical passages (Grice 1989a:30–31), Grice insists that all
interaction be amenable to a mental-state calculus. Certainly, the
dichotomy articulates well with other influential paradigms within the
philosophy of language (Searle 1965), as well as with folk versions of
the modernist philosophy.2

Conversational Implicature
Making full use of his notion of non-natural meaning, Grice elaborated
a theory of conversational coherence that has had enormous influence
on academic accounts of interaction ever since. For Grice (1989a:45–47),
conversations cohere because participants assume that all parties adhere
to a set of “conversational maxims’ as follows:
Quantity:
a) Make your contribution as informative as is required (for the
current purpose of the exchange)
b) Do not make your contribution more informative than is
required
Quality: Try to make your contribution one that is true
a) Do not say what you believe to be false
b) Do not say that for which you lack adequate evidence
Relation: Be relevant
Manner: Be perspicuous
a) Avoid obscurity of expression
b) Avoid ambiguity
c) Be brief (avoid unnecessary prolixity)
d) Be orderly
All parties adhere to these maxims under an umbrella rule referred
to as the cooperative principle. Apparent non sequiturs and nonsenses
in talk are resolved, under what I call the “strong view” of Gricean
inference, through intention guessing or mental-state simulation of
the interlocutor. So, for example, on hearing the exchange:
A: I am out of petrol
B: There is a garage around the corner [Grice 1989a:51]

Speaker A ruminates: “B’s remark Y seems irrelevant to my own prior


remark X. But it can’t be [by maxim of relevance]. So what could B have
had in mind that he or she wants me to see would make Y in fact relevant
to X?” Answer: “B thinks, or thinks it possible, that the garage is open,
and has petrol to sell.” (Grice 1989a:51, emphasis added).
To the extent that interpretation of such an exchange relies on quasi-
conscious attempts on the part of the audience to construct hypotheses
about the intentions underlying the utterer’s remarks (in sequences
like “B thinks, or thinks it possible...”), we should conclude that such
exchanges would be uninterpretable under philosophies like that of the
Mopan, in which the utterer’s intentions are considered irrelevant to
utterance interpretation. Exactly because the strong version of Gricean
implicature requires relatively aware strategies of intention-seeking on
the part of the audience, the theory does not allow us to dismiss as false
consciousness any explicit cultural belief system that would consider
such strategies to be inappropriate or irrelevant. (The possibility that
the Gricean procedure is taking place, but at a level inaccessible to
consciousness, will be considered later.)
Some of the more radical predictions of such a cross-cultural state
of affairs include the occurrence of strictly literal and very wooden-
seeming exchanges among individuals in some societies. Such wooden
exchanges are not reported from Mopan or from any other group
of people where a philosophical disinterest in others’ mental states
is found.3 Exchanges in Mopan do indeed seem to make use of the
assumption that utterances comply with some version of relevance,
manner, quality, and quantity. When it is demonstrated that utterances
do not comply, there is also a notion of violation, often explicitly
articulated (Danziger 2005). The failure of these most radical predictions
amounts to a demonstration that Gricean “non-natural” meaning in
its strongest form (that of aware and reflective deployment in everyday
conversation) is not the mechanism by which humans actually conduct
most of their linguistic interactions.
"Natural" Implicature
This conclusion converges with that strand of reasoning within
psycholinguistics
and linguistic pragmatics that has recently come to doubt
that the strong Gricean model actually accounts for most cases of
interaction, even in modern societies (Carston 2005). Scientific doubt
as to the validity of the view that intentions are conscious (or at least
readily accessible to consciousness) in the conduct of interaction has
arisen on the one hand from genuine philosophical skepticism as to
the plausibility of such an assumption (Bar-Om in press; Green 2003,
Sperber and Wilson 2002; see also Enfield this volume on the “no
telepathy” assumption, and Schegloff this volume), and on the other
hand from a series of experimental psycholinguistic results that do
not accord with the predictions of the view. When experimenters
carefully separate what utterers know about a state of affairs from
what utterers know that their audience knows (e.g., by placing certain
objects within sight of the utterer but ostentatiously out of sight of
the audience), utterers nevertheless use their egocentric knowledge,
apparently unconsciously, even when designing good-faith utterances
to the audience (see Barr and Keysar 2005 for recent review). But these
sorts of carefully controlled experimental situations do not represent
the norm for human interaction. In most situations, utterers who
use egocentric information in designing their utterances are on safe
interactional ground, because the same information is likely also to be
available to the audience (see also Clark, this volume, on the notion
of common ground).
From this type of perspective certain psycholinguists and philosophers
of language have revisited Grice’s notion of conversational implicature,
and find that normal conversational utterances might best be seen as
other than “non-natural” indicators of the fact that the conversational
maxims hold for a given conversational case (cf. Keller 1998; Sperber
and Wilson 1986). Keller (1998), for example, explains that because
the conversational maxims “must” be obeyed, the occurrence of any
remark B is for the audience a “symptom” of the compliance of this
remark with relevance and the rest. If taken this way, then maxim–
compliant–compatible interpretations of utterances do not require
recourse to guesses about what the speaker must have in mind, but
only to inferences about what is “naturally” entailed by the remark in
its symptomic relation to maxim compliance. (Speaker X ruminates
“Y’s remark B seems irrelevant to my own prior remark A. But it can’t
be [by maxim of relevance]. So what could be the case that would make
A in fact relevant to B?” Answer (from A’s own knowledge): A garage
will have petrol to sell.4
Under this analysis, a great deal of everyday linguistic interaction
could survive the loss of fully intention-laden philosophies of
meaning
among its users. The making of any utterance could be taken
ideologically as a “natural” symptom rather than as a non-natural and
reflexively intended statement that the utterer is in compliance with
the cooperative principle. Many forms of linguistic framing or layering
(Clark 1996; Goffman 1974) such as narrations, re-presentations and
quotations would also be possible under such a view, as long as the
fact of layering was made explicit through evidentials, quotatives, and
so on.
The strength of this kind of analysis lies in pointing out that the
relationship of linguistic utterances to states of psychological
commitment
is not necessarily conscious or intentional. Inference and
implicature
could take place without “non-natural” meaning. But even when
unconsciously produced, linguistic and other semiotic expressions are
not true associative symptoms or natural expressions of their meanings,
in the sense that spots are of measles and clouds are of rain. If they were,
then linguistic and social deception would be impossible. Violation of
Grice’s maxims should rarely if ever occur; most people should tell the
truth, and interlocutors should rarely mislead or deceive one another.
This is certainly not the case, either among the Mopan or in any other
society of which we have records. Not only is verbal deception a fact
among the Mopan, for example, but as we have seen, the suspicion of
it is widespread. (This is hardly surprising, recall, because any utterance
that turns out to be literally false will be categorized as blameworthy,
and few exonerating circumstances (“I didn’t mean it!”) exist.

Convention and Conversation


I have argued that it is exactly within a philosophy of language based on
non-natural meaning that local belief systems about language and mind
should make the most difference to the actual conduct of interaction.
The prediction from within the modernist theory is that interactants
who do not constantly calculate one another’s mental states should
be unable to interact in the way we consider characteristic of normal
humanity. But actual interaction is not variable in this way across
cultures. We must conclude that conscious consideration of others’
mental states is not a necessary part of human interaction. However,
an analysis based strictly on “natural” meaning also does not account
for the facts of interaction across cultures.
Let me return now to the question of whether conversation proceeds
everywhere as Grice proposed, but that audience interpretation of the
utterer’s mental state takes place below the threshold of consciousness
for both parties (I call this the “weak” Gricean view). If the process takes
place out of consciousness, then presumably there is no room for local
belief systems to come into play. This view reduces Gricean inference
to equivalence with the operation of conventional meaning.
Several theorists (Burling 1999; Haiman 1998; Wilcox 1999) deal with
the possible origins of conventional meaning by proposing that signs
that were originally fully “natural” (signifiers necessarily associated
with their signifieds) were freed from this necessity and deliberately—
that is, “non-naturally”—staged, possibly for deceptive purposes.
Conventionality arose when ritualized repetition of such stagings
led to habituation: precisely to the loss of conscious psychological
awareness in using them to convey a given meaning. Certainly today,
conventionalized semiosis relies for much of its effectiveness precisely
on the suppression of conscious intention in its use.5 Until conventional
meaning had been established—at the hypothetical moment when all
meanings were either necessary associations (“natural”) or deliberate
stagings (“non-natural”)—it could not yet be said that true human
language had developed. If we seek candidates then, for interactional
devices that all humans share and that are not widely used by other
animals, then the deployment of convention will be one that we will
be hard-pressed to ignore.
The signifying power of the conventional sign is not motivated by
any “natural” resemblance to or necessary association with its referent.
Neither does it reside in nonce efforts to guess at the other’s mental
state. Instead, it lies exclusively in the fact that there has taken place a
historical convergence on a particular form–meaning pairing in some
community of speakers This kind of inference from “prior arrangement”
is how a very great deal of human communication actually works (see
Clark, Enfield, Levinson this volume).6
Once conventions of meaning are in place, mutual and conscious
intention guessing is not necessary for interlocutors to decode one
another’s utterances. Non-natural meaning (intention guessing) need
come into play only when the signal in context is insufficient to yield
inferences about what is signified by other means. If the signal is indexical
or iconic, there may be sufficient nonmentalist grounds for inference
about meaning even if it is not conventionalized (the woman who
reaches for the betel basket in Enflield’s example in this volume). But if
the signal is arbitrary, then either convention or intention guessing will
be needed. And if convention is present, intention guessing is moot.
Only if convention is absent and the signal is arbitrary with respect
both to its context and its form, does intention guessing usefully come
into its own as a communicative tool. The type of case in question
now reduces, for example, to the relatively rare case of a person who
has lost access to most of the semiotic repertoire—perhaps an eyebrow
flicker occurs. His friends and family confer: “Was that a flicker? What
do you think he means? Maybe he’s thirsty?” and so on (see Goodwin
this volume). But these conjectures amount precisely to “expansions”
(Ochs 1982; see also Gaskins this volume). And these are the kinds of
interactional exchanges that we know to be cross-culturally limited
in their occurrence. Willingness to use intention guessing (pure non-natural
meaning) is clearly culturally constrained. This can be the case
because actual occasions on which no other semiotic option is available
are in practice relatively limited in their occurrence.
A “weak” version of the Gricean proposal (that conversation
proceeds
through unconscious assessment of the interlocutor’s mental
state) is analytically indistinguishable from an analysis in terms of
conventional meaning. Although Grice himself (1989a) limited the
use of “conventional meaning” in his discussions of implicature to
something like “literal meaning,” I am suggesting here and below
that we consider the very fact that interactants comply with the
Gricean maxims as a matter of convention. It is neither “natural” and
automatic—hence, the kinds of cultural variability described by Keenan
(1976) and discussed by Goffman (1983)—nor “non-natural” and fully
conscious. The operation of conventional meaning is equally well (or,
rather, equally poorly) philosophized by culturally local ideologies like
that of the Mopan, who hold that communication does not rely at all
on calculation of mental states, and by those—like that of Grice and
other moderns—who hold that that is all it relies on.

Flouts and Figures


I have reached a position from which I can claim the similarity of
conversational processes among all of humankind, despite the
documented
existence of quite divergent cultural philosophies about the
matter. The phenomena of signal motivation, common ground (Clark
this volume, Enfield this volume) and conventional meaning go far
to explain the operation of inferential conversational processes in all
societies, including our own. Before concluding, however, allow me to
pursue the point that there exist certain linguistic and conversational
phenomena that could logically only occur under a cultural theory
that sees linguistic meaning as intention laden—in Gricean terms
“non-natural.” These are the Gricean flouts and their Machiavellian
cousins, the hostile second guessing known as “reverse psychology.” We
should predict that such interactional forms do not occur where cultural
philosophies of language are unconcerned with the “recognizing” of
intentions.
Flouts occur when maxim violation under conditions of mutual
knowledge is deployed by an utterer for the purpose of stimulating in
the audience the making of guesses about what the utterer intends the
utterance to “mean” that would not arise under other mutual mental-state
circumstances. For example, the case in which U speaks falsehood,
but not only does A know that U’s utterance is false, A also knows
that U knows that A knows (“Juliet is the sun”). In such a case the
audience reasons from the mutual knowledge circumstances that the
utterer cannot intend to deceive, even though the utterance is false.
The audience now attempts to consider what U “may have had in
mind” in making the utterance, and ends by adopting a metaphorical
interpretation of U’s remark (Begres 1992; Grice 1989a; for a different
cultural case see Mitchell-Kernan 1972). Pragmaticists use the notion
of the flout to explain the occurrence of certain forms of verbal art—
including metaphor, fiction, and the kinds of dry wit at which Oxford
philosophers so excel—which rely on violation of the Gricean maxims
for their operation.
In many cases of routinized figurative speech, it has now been
experimentally
demonstrated that literal meaning is not computed before
figurative meaning (Gibbs 1994). But where flouts involve novel usages
rather than routinized ones, the strong Gricean process of conscious
reflection about the utterer’s mental state may indeed be in play (Giora
2003). To use these kinds of novel figurative forms, and especially
where delivery is made “deadpan,” without benefit of contextual or
paralinguistic cues as to the nonliteral nature of the remark, utterers
must rely on audiences’ willingness to consciously or semiconsciously
seek out the utterer’s probable mental state in making the utterance.
Audiences from other than mentalist philosophical traditions will not
in principle be willing to do this. We should predict then that these
forms of figurative meaning will be absent among those who in general
hold other than non-natural theories of linguistic meaning. 7
Mopan Flouts
If culturally variable ideas about intentionality make any difference
at all in interaction then, they should do so in the arena of the flout.
Even among the ethnographers of speaking (Hymes 1974), systematic
investigation of cultural distribution of figurative language forms in
relation to cultural philosophies of language has not ever really been
undertaken. But if local theories affect interaction, we would expect
to find little elaboration of novel metaphor and other literary genres
based on falsehood in societies where nonmentalist cultural theories of
language and communication prevail. If candidates for such phenomena
occur, I predict that they will not be understood as figurative but as
maxim violations.8
Although most of the research into cultural distribution of figurative
language types remains to be done, the prediction so far has some
plausibility as regards the Mopan (Danziger 2001, n.d.). Mopan people
value the performance of poetic and musical texts, and they regularly
enjoy festive displays of sensation and spectacle. Mopan Catholicism
dictates the celebration of annual feast days on which plaster and
ceramic statues of Christian saints are treated to household visits, ritual
baths, or public parades. Traditional Mopan theatrical dances oblige
performers to adopt the personae of deer, jaguar, and monkey. But
Mopan do not talk about these things as cases of metaphoric or symbolic
representation. On the contrary, to those involved, the statues really
are saints; the dancers really are deer. Supernatural consequences attend
on this fact. Specific rituals must be carried out, for example, to placate
the masks used in traditional dances, as a way of averting the danger
of death that is otherwise believed to accompany their use.
There is no institutionalized genre of fiction in the traditional Mopan
verbal repertoire. Mopan verbal humor for example concentrates around
puns (often indecent ones) rather than around improbable narratives.
And we have already seen that the stories that are narrated by Mopan
tellers are all believed to be strictly true (or taken as dangerous lies).
Where fiction has been introduced into the area, it has been taken as
truth and then—once its literal falsehood is revealed—as “lies” (Danziger
n.d.).

Conclusion
If in the future the Mopan (or other ethnographic) data continue to fit
the predictions, it would not mean that there is no universal human
interaction engine. But it could mean that there are fundamentally
different means of functioning within the “engine.”
Where reliance on the modernist cultural ideology in academic
philosophical theories has been overly heavy, we have been scientifically
and not just culturally wrong about two things: (1) how the universal
layers of the human interactional engine work [they are based on
sign motivation, common ground, and conventional meaning, not
on constant calculation of speaker’s nonce mental state] and (2) the
idea that we could consult our own intuitions for reliable information
about what the separable components of the engine are, and about
which pieces of the engine constitute the “basic model” and which
are optional extras.
Although certain parts of the Gricean system (basic assumption of
compliance with the maxims) operate the same way—namely, via
conventional meaning—in both Mopan and moderns, other parts of
“Gricean inference” may not be universal (mutual intention-guessing
and consequent byproducts in regimes of responsibility and in figurative
language preferences). In particular, those areas of Western linguistic
interaction in which the operation of Grice’s non-natural meaning is
most clearly to be observed—those involved in the verbal artistry of
pragmatic flout and Machiavellian reverse psychology—may constitute
specialized cultural traditions of verbal art that should not be allowed
too closely to inform either our investigations into questions of language
origins or our general theorizing about how language relates to mind
at the species level.

Acknowledgments
None of the research would have been possible without the assistance
and hospitality of the Mopan Maya people of the Toledo District,
Belize. Collection of Mopan data was supported at different times by
the Wenner-Gren Foundation for Anthropological Research (Grant
#4850), the Social Sciences and Humanities Research Council of Canada
(Award #452–87–1337), the Cognitive Anthropology Research Group
of the Max-Planck-Institute for Psycholinguistics, and the University of
Virginia. The Department of Archaeology, Belmopan, Belize, provided
help and support during various fieldwork periods. I would also like to
thank Herb Clark, Suzanne Gaskins, Stephen Levinson, and John Lucy
for helpful comments on earlier presentations of this material.
Notes
1. To the anthropological litanycan be added a roster of well-known accounts
from European history: The archaic Greeks had different views from our own
about the source of inspiration and the locus of responsibility for human
actions (Friedrich 1977; Snell 1953). The notion of individual subjectivity
developed in the late Middle Ages under the influence of the confessional,
increasing social mobility and Protestantism (Foucault 1978; Morris 1972;
Trilling 1974). Literacy and print allowed messages to remain fixed, and
therefore gave rise to the philosophical distinction between “objective” and
“subjective” for the first time (Olson 1991). And so on.
2. Keller (1998) gives an extended treatment of the general tendency in
Western philosophy to dichotomize in terms of “natural” and “intentional,”
without considering as a separate class the nondeliberate human products.
3. Although demonstrated cross-cultural difference in reliance on formal
marking of the nature of one’s evidence for assertions is perhaps a relevant
phenomenon (see Chafe and Nichols 1986; Irvine and Hill 1992).
4. We can go even further down the mentalist road without arriving at fully
“non-natural” meaning as articulated by Grice. Idealized Mopan Speaker X
could ruminate “Y’s remark B seems irrelevant to my own prior remark A.
But it can’t be [by Maxim of Relevance]. So what could Y (falsely) believe that
would make B in fact seem relevant to A?”). What is necessarily missing is the
crucial ingredient of reflexive communicative intention: the belief on the part
of X that Y intended X not only to understand the words and other signals
produced, but also, separately, to understand Y’s intentions in producing them
(see Levinson this volume).
5. Consider the fact that we have all mastered the complex phonological
and phonotactic systems of at least one human language. Such knowledge
is undoubtedly acquired, therefore conventional and not strictly “natural.”
Yet we rarely pause to contemplate whether our next utterance should begin
with an aspirated or an unaspirated consonant. If we did pause to consider
such things with any frequency, the system would cease to operate. This
property of ready habituation means that the other-than-natural provenance
of conventional signifiers is easily obscured in the intuitions of their users.
This fact underlies another candidate human universal: the phenomenon
of ethnocentrism, in which people everywhere believe that their culturally
particular ways of doing things are the only possible ones.
6. Recall from the U.S. soldier discussion that non-natural signals are not
the only kind that can trigger inferences in an audience, although only non-natural
signals trigger inferences about what the utterer intends, rather than
about what the signal itself suggests.
7. There are also relevant predictions for child development from the
suggestion that mentalism in interaction is a cultural speciality. Namely,
even in mentalist societies, children’s acquisition of flout-reliant figures
should be late and may have to be formally taught. The conduct of normal
conversational interaction should on the other hand be much earlier, and
should be independent of children’s mastery of the flout.
8. My observations have centered exclusively on issues in Mopan language
philosophy that relate to the Gricean maxim of quality (“Try to make your
contribution one which is true”). There is reason to believe that quality
has a special status among the maxims (Grice 1989a), but should explicit
local philosophies be found relating to the other maxims, I predict (among
other possibilities) one or all of: quantity—decreased appreciation and use
of hyperbole or litotes; relevance—no proverbs; and manner—absence of
mockery, sarcasm, or irony. Questions of “keying” (Hymes 1974) and framing
(Goffman 1974) will also be crucial.

References
Bar-Om, D. in press. Speaking my mind. New York: Oxford University
Press.
Barr, D. J. and B. Keysar. 2005. Making sense of how we make sense:
The paradox of egocentrism in language use. In Figurative language
comprehension: Social and cultural influences, edited by H. L. Colston
and A. N. Katz, 21–42. Mahwah, NJ: Erlbaum.
Basso, K. 1970. To give up on words: Silence in Western Apache culture.
Southwestern Journal of Anthropology 26:213–230.
Begres, S. J. 1992. Metaphor and constancy of meaning. Grazer
Philosophische Studien 43:143–161.
Brice Heath, S. 1982. What no bedtime story means: Narrative skills at
home and at school. Language and Society 11(1):49–76.
Burling, R. 1999. Motivation, conventionalization, and arbitrariness in
the origin of language. In The origins of language: What non-human
primates can tell us, edited by B. J. King, 307–350. Santa Fe: School
of American Research Press.
Carston, R. 2005. Pragmatic inference—Reflective or reflexive? Keynote
address of the 9th International Pragmatics Conference, Riva del
Garda, Italy, July 10–15.
Chafe, W., and J. Nichols (eds).1986. Evidentiality: The linguistic coding
of epistemology. Advances in Discourse Processes Series, vol. 20, vii –xi.
Norwood, NJ: Ablex Publishing.
Clark, H. H. 1996. Using language. Cambridge: Cambridge University
Press.
Coleman, L., and P. Kay. 1981. Prototype semantics: The English verb
“lie.” Language 57(1):26–44.
Danziger, E. 1996a. Parts and their counter-parts: Social and spatial
relationships in Mopan Maya. Journal of the Royal Anthropological
Institute, incorporating MAN 2(1):67–82.
——. 1996b. Split intransitivity and active-inactive patterning in Mopan
Maya. International Journal of American Linguistics 62(4):379–414.
——. 2001. Relatively speaking: Language, thought and kinship in Mopan
Maya. Oxford Studies in Anthropological Linguistics. New York:
Oxford University Press.
——. 2002. Making up our minds: Metaphor and intentionality from
a Mopan Maya perspective. Paper presented at the “Evaluation and
Personhood” session of the 101st Annual Meeting of the American
Anthropological Association, New Orleans, November 20–24.
——. 2005. Reflexive communicative intention in cross-cultural context:
The fate of the flout. Paper presented at the 9th International
Conference
of the International Pragmatics Association (Pragmatics and
Philosophy). Riva del Garda, Italy, July 15–20.
——. n.d. To play a speaking part: Some linguistic preconditions for
fiction.
DuBois, J. 1987. Meaning without intention. Papers in Pragmatics
1(2):80–122.
Duranti, A. 1992. Intentions, self and responsibility: An essay in Samoan
ethnopragmatics. In Responsibility and Evidence in Oral Discourse, edited
by J. Hill and J. Irvine, 24–47. Cambridge: Cambridge University
Press.
Evans-Pritchard E. E. 1976. Witchcraft, oracles and magic among the
Azande, abridged edition. Oxford: Clarendon Press.
Foucault, M. 1978. The history of sexuality, vol. 1. New York: Pantheon
Books.
——. 1980. Power/knowledge: Selected interviews and other writings, 1972–1977,
translated by Colin Gordon, Leo Marshall, John Mepham, and
Kate Soper; edited by Colin Gordon. New York: Pantheon Books.
Friedrich, P. 1977. Sanity and the myth of honor: The problem of
Achilles. Ethos 5(3):281–305.
Gibbs, R. W., Jr. 1994. The poetics of mind: Figurative thought, language
and understanding . New York: Cambridge University Press.
Giora, R. 2003. On our mind: Science, context and figurative language. New
York: Oxford University Press.
Goffman, E. 1974. Frame analysis: An essay on the organization of experience.
Cambridge, MA: Harvard University Press.
——. 1983. Felicity’s condition. American Journal of Sociology 89(1):1–53.
Green, M. S. 2003. Grice’s frown: On meaning and expression. In Saying,
meaning implicating, edited by G. Meggle and C. Plunze, 200–219.
Leipzig: University of Leipzig Press.
Gregory, James. R. 1975. Image of limited good, or expectation of
reciprocity? Current Anthropology 16(1):73–92.
——. 1984. The Mopan: Culture and ethnicity in a changing Belizean
community. University of Missouri Monographs in Anthropology,
no. 7. Columbia: Museum of Anthropology, University of Missouri,
Columbia.
Grice H. P. 1989a. Logic and conversation. Studies in the Way of Words,
22–40. Cambridge, MA: Harvard University Press.
——. 1989b. Meaning. Studies in the Way of Words, 213–223. Cambridge,
MA: Harvard University Press.
——. 1989c. Utterer’s meaning, sentence-meaning, and word-meaning.
Studies in the Way of Words, 117–137. Cambridge, MA: Harvard
University Press.
——. 1982. Meaning Revisited. Mutual Knowledge, edited by N. V. Smith,
223–243. New York: Academic Press.
Haiman, J. 1998. Talk is cheap: Sarcasm, alienation and the evolution of
language. New York: Oxford University Press.
Hymes, D. 1974. Toward ethnographies of communication. Foundations in
sociolinguistics: An ethnographic approach. Philadelphia, PA: University
of Pennsylvania Press.
Irvine, J., and J. Hill (eds). 1992. Responsibility and evidence in oral
discourse. Cambridge: Cambridge University Press.
Keenan, E. 1976. The universality of conversational postulates. Language
in Society 5(1):67–80.
Keller, R. 1998. A theory of linguistic signs. Oxford: Oxford University
Press.
Lakoff, G., and M. Johnson. 1980. Metaphors we live by. Chicago:
University of Chicago Press.
Mitchell-Kernan, C. 1972. Signifying and marking: Two Afro-American
speech acts. In Directions in sociolinguistics: The ethnography of
communication, edited by J. J. Gumperz and D. Hymes, 161–179.
New York: Holt, Rinehart and Winston.
Morris, C. 1972. The discovery of the individual 1050–1200. New York:
Harper and Row.
Nuyts, J. 1994. The intentional and the socio-cultural in language use.
Pragmatics and Cognition 2(2):237–268.
Ochs, E. 1982. Talking to children in Western Samoa. Language in Society
11(1):77–104.
Olson, D. R. 1991. Literacy and objectivity: The rise of modern science.
In Literacy and orality, edited by D. R. Olson and N. Torrance, 149–164.
Cambridge: Cambridge University Press.
Reddy, M. J. 1993[1979]. The conduit metaphor: A case of frame conflict
in our language about language. In Metaphor and thought, edited by
Andrew Ortony, 137–163. Cambridge: Cambridge University Press.
Robbins, J. 2001. God is nothing but talk: Modernity, language and
prayer in a New Guinea society. American Anthropologist 103(4):901–912.
Rosaldo, M. Z. 1982 The things we do with words: Ilongot speech acts
and speech act theory in philosophy. Language in Society 11:203–237.
Rosen, L. 1995. Other intentions: Cultural contexts and the attribution of
inner states. Santa Fe: School of American Research Press.
Searle, J. 1965. What is a speech act? In Philosophy in America, edited by
M. Black, 221–239. Ithaca, NY: Cornell University Press.
Snell, B. 1953. The discovery of the mind: The Greek origins of European
thought. Oxford: Blackwell.
Sperber, D., and D. Wilson. 1986. Relevance: Communication and cognition.
Oxford: Blackwell.
——. 2002. Pragmatics, modularity and mind-reading. Mind and Language
17(1–2):3–23.
Sweetser, E. 1987. The definition of “lie”: An examination of the folk
model underlying a semantic prototype. In Cultural models in language
and thought, edited by D. Holland and N. Quinn, 43–66. New York:
Cambridge University Press.
Thompson, J. E. S. 1930. Ethnology of the Mayas of southern and central
British Honduras, Field Museum of Natural History Anthropological Series
17, no. 2. Chicago: Field Museum of Natural History.
Trilling, L. 1974. Sincerity and authenticity. London: Oxford University
Press.
Warren, K. B. 1995. Each mind is a world. In Other intentions: Cultural
contexts and the attribution of inner states, edited by L. Rosen, 47–68.
Santa Fe: School of American Research Press.
Wilcox, S. 1999. The invention and ritualization of language. The origins
of language: What non-human primates can tell us, edited by B. J. King,
351–384. Santa Fe: School of American Research Press.
ten

Cultural Perspectives on Infant-


Caregiver Interaction
Suzanne Gaskins

Virtually all theories of human social development presume universal


outcomes. They aim to explain the development of the human
species as a whole, not the different trajectories of particular subgroups.
When different group outcomes are encountered (e.g., by gender,
socioeconomic status, or culture), they are viewed as superficial to
the underlying, shared developmental trajectory. Yet most researchers
also presume that one of the most significant characteristics of our
species is its long period of dependency during which many skills and
understandings necessary for successful adult functioning must be
learned (Bruner 1972). Much of this learning is informal in nature,
taking place during children’s everyday interaction with others. Finally,
it is also presumed (at least tacitly) that during this period of dependency
both the content to be learned and the everyday environment of that
learning are both culturally organized and culturally variable.
None of these three presumptions is particularly controversial on
its own, but they stand in an uneasy relationship to one another. The
idea of universal yet experientially influenced developmental outcomes
becomes suspect once the range of cultural variation in experience during
childhood is acknowledged. A viable argument about development
can only be made if at least one of the three presumptions (universal
developmental outcomes, influence of experience on development,
or cultural variation in experience) is discarded. The three viable
solutions become that (1) development is dependent on experience
but relevant experiences are universally present in all cultures (i.e.,
cultural variation is insignificant); (2) universal development is largely
Culture and Sociality

genetically determined, and therefore experience, including culturally


structured experience, is irrelevant (i.e., experience is insignificant);
or (3) learning-dependent development is variable in its process and
outcomes (i.e., outcomes are not universal).
Much of the research on early social interaction has adopted the first
solution, claiming that universal developmental outcomes arise from
children’s common interaction with social interlocutors (e.g., Bruner
1983; Gergely and Csibra this volume; Tomasello 1999; Trevarthen
1987). The particulars of interactions in Euro-American homes are
typically used to build the common model of social interaction, and
its characteristics are then drawn on to make specific arguments about
universal mechanisms of social development. This chapter reviews
the available evidence that this Euro-American–derived model of
social interaction is not universally applicable. On the basis of this
review, I conclude that this first solution is not viable: a model of
social interaction based solely on Euro-American interaction norms
cannot provide an adequate foundation for understanding human
infant social interaction. In the conclusion, the viability of the two
remaining solutions is discussed.

Cultural Variation in Infant Social Interaction


In Euro-American middle-class culture, infants are most often raised in
nuclear families with few siblings and with primary care being given by
the mother (or shared serially by mother and hired caregiver). Caregivers
and infants are often alone together and interact frequently just for the
pleasure of the interaction. Caregivers look and talk to their infants often,
holding the infant away from their bodies to allow them to look at the
infant face to face. They consider even very young infants as legitimate
social partners, often interpreting any behavior or sound from them as
meaningful. During the first year they play simple, turn-taking social
games like “peek-a-boo.” Caregivers call infants’ attention to objects
in the environment (or in books) by pointing to and naming them.
They also play with objects with the infant, simultaneously talking
about the activity. Caregivers make many social accommodations during
these interactions, including changing their speech to “motherese”
(baby talk). As the children begin to talk, caregivers work to understand
the sometimes unclear or incomplete utterances by interpreting the
children’s intentions, asking for clarification, and expanding on what
the infant has said. The overwhelming pattern is one of adult focus on
and accommodation to the infant.
Cultural Perspectives on Infant–Caregiver Interaction

In many other cultures, this pattern of social interaction does not


occur. It is perhaps unique in the world in the degree of engagement
between mother and infant. For any culture, the specific ways infants
are treated by their caregivers, and expected to respond, cannot be
understood without also understanding what deeper cultural factors are
organizing their everyday behavior. As shown below, where different
cultural factors are at work, such as the structure of the everyday life
world, the specific cultural beliefs about young children, and the general
social organization of the culture (Gaskins 1999), there will be different
patterns of interaction. The evidence provided from other cultures
demonstrates that the patterns of social interaction with infants that
have been presumed to be universal are quite culturally variable; they
are not appropriate candidates for universal learning mechanisms for
social development.

Patterns of Infant Social Engagement


Although infants may be biologically prepared at birth (or before) to
attend to and learn about social engagement, the particular patterns they
will be learning to express are culturally specific and quite varied. Infants
begin to learn the appropriate social rules through their coparticipation
in daily activities. There are three dimensions of engagement that infants
have to learn: expressing inner experience, influencing or responding
to another person’s behavior, and learning about and communicating
information about the world. Examples of culturally different patterns
for each of these dimensions given below demonstrate the dramatic
range of infant social experience.1

Expressing Inner Experience


The first dimension of engagement is expressing positive and negative
inner experience. Cultures differ in their expectations about when
expression of inner experience is acceptable for infants and when it
is not. They also differ in who is listening and in whether and how
someone responds. Euro-American newborns are allowed to express
inner experience at any time, and they usually receive a response that
acknowledges and interprets the experience and attempts to affect an
appropriate change in the environment. If a newborn cries, for example,
the caregiver typically responds to that cry, usually picking up the
baby, looking at the infant, vocalizing, and trying to fix whatever is
wrong. If a young infant smiles, the caregiver typically returns the smile
and verbalizes to the infant. These responses change subtly before the
infant reaches six months. For example, although holding is the most
frequent response for crying at four months, looking and talking are
more common by seven months (LeVine et al. 1994).
Yucatec Mayan infants experience some of these responses, but there
are also important differences. Yucatec Mayan caregivers are more
likely to respond consistently to negative expressions than to positive
ones (Gaskins 1990). The response is likely to involve holding but not
verbalization. Caregivers are particularly good at understanding the
source of the negative expression and therefore can quickly fix whatever
is wrong. They hold their young infants most of the time (90% of the
time at three months) and pay close attention to them. Their adeptness
at reading body cues means that often infants’ problems are addressed
before they escalate into overt negative expression of experience. This
is an intentional cultural strategy to shape the infant’s expression of
inner experience to produce a quiet baby, with little outward expression
of inner experience (either positive or negative). Yucatec Maya infants
at three months cry about half as often as Boston infants (Richman et
al. 1988). Colic is unknown; a cry that cannot be stopped is taken as
a sign of illness and is of great concern. Yucatec Mayan infants also
hardly ever laugh out loud. These patterns of caregivers’ responses are
more or less consistent during the infants’ first two years.
The Gusii of East Africa show a pattern of expression and response
that is similar to the Yucatec Maya. They also respond more to negative
expressions than to positive ones. LeVine (LeVine et al. 1994) has called
this the “squeaky wheel” response and argues that this is an adaptive
rule where there is a lot of infant illness and death. In the first year, if
there is a verbal response at all, it is likely to be a soothing one. Soothing
responses drop sharply at 12 months. Throughout the first year, there
is also constant body contact (100% at three months and 90% at ten
months), and Gusii caregivers respond consistently to cries, day or
night. Like the Yucatec Maya, Gusii infants cry about half as often as
Boston infants (Richman et al. 1988). Unlike the Maya, Gusii mothers
do not often look at their infants, in line with a cultural norm of gaze
avoidance (see below).
Responsiveness need not involve much effort to understand or respond
to the infant’s intent. Although Italian mothers were very responsive
to ten-month-old infants’ negative and positive expressions of inner
experience (New 1988), many of these responses did not actually comply
with the infants’ expressed desire or need. Through the subordination of
the Italian infants’ needs to the ongoing families’ routines, the infants
learn that although included in the ongoing events, they are not always
the most important participants. Similarly, the rural African American
caregivers of Trackton feel that they know better than the infants what
the infants need, so caregiving reflects the caregivers’ goals rather than
contingent responses to what the infants are saying (Heath 1983).
Where infant intent is important, cultures nonetheless differ in
their approaches. Yucatec Maya caregivers seek to understand and
respond to their infants’ intent to express negative experience, but
they base their response on direct evidence from the infants themselves.
In contrast, Euro-American caregivers have conversations with their
preverbal infants in which they carry on both sides of the conversation,
imagining what the infants would “want” to say. They are also willing
to take incomplete or unclear utterances by young children and try
to figure out what their intent is. These routines put the burden of
understanding (and the freedom to invent an understanding) on the
adult. In other cultures, such as the Kaluli (Schieffelin 1990) and Samoan
(Ochs 1988) cultures, infant intent is relevant only when the infant
can make that intent clear, because it is culturally inappropriate to
make an assertion about another person’s intent. Infant expressions and
early verbal utterances with unclear intent are not “translated” because
caregivers in these cultures are unwilling to guess or interpret what a
child is trying to communicate. Rather, the children are encouraged
to make their communications clearer, or they are just ignored (Ochs
and Schieffelin 1984).
Thus, in this first domain of expressing internal experience, we see that
infants by the end of their first year learn a set of culturally motivated
patterns about how and when to express their inner experience and
what kind of response to expect. Infants in different cultures learn a
different set of patterns, many of which are clearly in place by three
months. Over the first year, these patterns come to guide their behavior
in social interactions that communicate their internal feelings to another
person.

Influencing Another Person


The second dimension of engagement is influencing another person. The
Euro-American interaction rules transmitted to infants are quite complex.
First, the most common actions from either infant or caregiver are verbal
or otherwise distal, rather than through body contact. Infants come to
expect praise, encouragement, and leading questions to be convinced
to do something. Infants’ and caregivers’ roles are reciprocal—that is,
infants are seen as having a legitimate right to influence the caregiver
and vice versa. Under these circumstances, infants learn to pay attention
to the feedback provided by caregiver facial and verbal responses to their
behavior. Although caregivers have a high motivation to get infants
to conform to culturally valued behaviors, they also respect infants’
individual preferences. During some interactions, infants’ individual
preferences are given priority; during others, the caregivers’ desire
for cultural conformity takes priority. Finally, infants learn that they
have varying degrees of influence depending on the circumstances.
During playtime a caregiver may give the infant undivided attention
and primary control over what happens, during feeding the caregiver
will assume some control but be willing to negotiate, and during a car
ride, the caretaker will assert primary control, disregarding the infant’s
feelings about riding in a car seat. Because these rules are complex,
because the caregiver is often ambivalent about ignoring the infant,
and because the infant is allowed, and even encouraged, to negotiate,
these early interactions can be very difficult and emotionally draining
for both infant and caregiver.
Yucatec Maya infants have a very different set of rules to learn
about influencing other people. First, infants have the cultural right
to influence only those events that are directly related to the infant’s
own internal experience and well-being. The infant may always ask for
food, warmth, sleep, comfort, or physical closeness, and caretakers will
provide these things quickly and cheerfully, even when inconvenient.
The most minimal gesture may be effective (a reach, a single sound, or
even a glance), because the infant is being carefully monitored at all
times. Because there is a general acceptance of each infant’s individual
nature as being “just his (or her) way of being,” caregivers rarely try to
influence their infants in this domain. When care of the infant is at
odds with other responsibilities, caregivers may try to end an interaction
by distracting the infant, but if the infant objects, the caregiving will
continue until the infant is satisfied. The infant is irrelevant as an actor
in most other household events (even when included as an observer),
and the infant therefore neither attempts to interact nor is actively
engaged.
This pattern leads to minimal negotiation between infant and caregiver.
Infants quickly learn that the world is cleanly divided into events they
have primary influence over and events they have no influence over.
Many attempts to influence the action of another person, therefore, are
both successful and free of negotiation and emotional expression. The
motivation to pay close attention to caregiver responses is minimized,
because the only relevant condition is that the infant is expressing a
personal need or desire. If infants feel that their legitimate rights to
influence others are being ignored, they will complain vigorously and
persistently. Conversely, they almost never object to or resist any activity
not relevant to their immediate internal needs and desires.
Caregivers mediate infants’ actions in the world only out of concern
for the infant’s well-being and safety. One-year-olds are not considered
to have any responsibility for understanding a dangerous situation.
The caregiver simply either removes the infant or the dangerous object
from the situation, often with no explanation; the infant rarely objects.
Gradually, during the second year, a command or threat accompanies
these interventions until, finally, caregivers expect infants to obey
just a verbal response. Because caregivers do not think the child can
understand, they do not try to reason with the child, and they act with
little emotion and impatience.
A different pattern of social interaction is found with young infants
in a rural African American community in the southern United States.
They learn early on that even though they will be engaged regularly in
social interaction, they will often have little influence on how it unfolds
(Heath 1983). As mentioned earlier, caregivers reason that they know
a lot more than infants about what the infants need, so there is little
need to take into account any specific information an infant might be
trying to communicate. Also, infants are often engaged in nonverbal
social games but only at the caregiver’s whim, not the infant’s. Even
sleeping infants may find themselves awakened by a sibling who has
returned home from school ready to play with the infant. From about
14 months on, boys learn that although most of the time they are
ignored as conversational partners, they can be engaged by anyone in
the public space in a teasing routine where they will have to respond
to a series of verbal challenges. The only way they will be left alone
is if they can demonstrate that they understand the adult’s taunts by
responding cleverly to them. The explicit goal is to help the young
boys learn how to read between the lines of someone’s speech and to
defend themselves. For these infants, the motivation to understand
other people’s communications and intentions is extremely high.
The previous examples illustrate cultural variation in the general
organization of infant social interaction, but cultures vary as well in
some specific rules of interaction often assumed by developmental
psychologists to be universal. For instance, face-to-face interaction
with sustained eye contact with infants, which is often presumed to be
universal, is not a prevalent activity in many cultures around the world.
Yucatec Maya mothers and infants when nursing may sometimes glance
at each other, but these looks are usually not coordinated or sustained.
Both spend more time looking at other people or nearby animals or
simply staring out the door or into space. LeVine and his colleagues
explain compellingly how culturally inappropriate such mutual gaze
would be for Gusii in East Africa (LeVine et al. 1994), giving five different
reasons Gusii mothers would want to actively avoid eye contact with
their infants. As a result, Gusii mothers spend on average between one
and 12 percent of their interaction time looking at their three- to ten-
month-old infants (although in a comparison sample, Boston mothers
spent between 40% and 45% of their interaction time looking at their
infants). Other cultures are not specifically concerned about avoiding
eye contact but limit it nonetheless by orienting the child away from the
mother, toward other social participants in the household (Martini and
Kirkpatrick 1981). In cultures in which infants are tied to the mother’s
back high enough to see, the child shares a mother’s visual perspective
out on the world and are encouraged very early to observe and engage
in the social world beyond the mother.
In many cultures in which infants are encouraged to engage socially
with others, there is an additional social routine to help them enter
into conversations. Many languages have a word (or morpheme) that
is added to a sentence and that tells someone to repeat the sentence,
word for word (e.g., much like our word say in “Say ‘I want candy.’
”). For example, Schieffelin (1990) reports that once Kaluli children
begin to use understandable words, caregivers use such a form of direct
language instruction to tell children to repeat the caregiver’s sentences
to another person. Such usage is thought to encourage the child to be
assertive in an interaction, thus “hardening” the child’s speech. These
sentences are not simplified for the child and reflect what the caregiver
thinks should be said, not what the caregiver thinks the child wants to
say. The Kwara’ae, of the Solomon Islands, believe that such repeating
routines are a useful teaching technique beginning at six months, an
age at which they believe the infants can understand language even
though they cannot yet talk (Watson-Gegeo and Gegeo 1986). The
Samoans (Ochs 1988), the Quiche [K’iche’] Maya (Pye 1986) and the
Yucatec Maya (Lucy 1993) also use such a term. For the Yucatec Maya,
this technique is used for at least two distinct purposes. First, it helps
children negotiate interactions with other people, often carrying an
extra implicit message (from the caregiver) of instruction or criticism
to the listener. Second, it teaches young children to memorize verbal
passages verbatim so that they can reliably run errands and deliver
messages (similar to the Samoans). Using this technique, a two-year-old
becomes a reliable messenger.
A second specific rule of interaction that has been identified as
universal and developmentally relevant is the use of motherese. Ferguson
(1978) proposed that there are 17 different characteristics that make
up a universal type of distinctive speech to young children. Contrary
to Ferguson’s and others’ predictions, many of these characteristics
are not found in some cultures in which talk to children has been
closely studied. The Kaluli avoid such “bird talk” as it recalls the mythic
identification of birds with the souls of dead children (Schieffelin 1983).
Heath (1983) observed that rural African American’s speech to children
was quite similar to speech to adults; further, they criticized children
who used “baby talk.” The Samoans (Ochs 1988), the Quiche [K’iche’]
(Pye 1986), and the Gusii (LeVine et al. 1994) are also reported not to use
motherese. These observers argue that, in general, child-directed speech
sounds quite similar to adult directed speech. Some of the characteristics
of motherese may be culturally specific compensations for engaging in
talk with preverbal infants; in those cultures in which there is not a lot
of talk with preverbal infants, like the ones mentioned above, there
would be less need to use such characteristics with older infants.
In addition, in many cultures, the burden of understanding is on
children, not adults, because of rules of social rank. In Samoa not only
must the children work to be understood, but they must also explicitly
ask for clarification or simply wait when they do not understand
someone else. Very young children are explicitly trained how to pay
attention to what others are doing, how to deliver a message verbatim,
and how to report what they see, all skills they will need to enter as
a low-ranking person into the social system (Ochs 1988). Even more
demanding are those cultures in which speech is often used in ways
that do not express direct meaning and children are expected to be
able to read the nonverbal and contextual signs to understand the
deeper meanings. As discussed above, in the rural African American
community of Trackton, boys are intentionally trained through public
teasing and direct challenges to be able to react to subtle changes in
meaning, beginning soon after their first birthday (Heath 1983). And
among the Gusii, young children are expected to interpret the adults’
indirect style of speaking about certain topics, using the nonverbal clues
that accompany such discreet talk (LeVine et al. 1994).
We see that infants from different cultures have vastly different rules
to learn about influencing other people and about reading other people’s
intentions. In their first two years, infants learn who to look at, who are
appropriate social partners, what are appropriate behaviors (and in which
specific contexts), and what are legitimate topics. As they begin to talk,
they learn to incorporate verbal interaction into these understandings.
In some cultures they are excused from paying much attention to other
people’s intentions, although in others they are not considered social
partners until they do. There are also significant differences in cultural
attitudes about infants learning these rules. Infants in some cultures
have more pressure on them to master some rules quickly; others do
not. In some cultures they receive direct instructions about social rules,
sometimes accompanied by praise and punishment for their mastery
and failures; in others they are simply guided through interactions in
their daily activity; and still in others they are ignored as social partners
until they can master the rules on their own. Many of the underlying
assumptions and many specific patterns of interaction (such as social
games, face-to-face interaction, and motherese) experienced by Euro-
American infants that have been assumed to be universal are not.

Gaining and Exchanging Information about the World


The third dimension of engagement is gaining and exchanging
information about the world through social interaction. Euro-American
infants are encouraged to engage with the physical world assertively,
manipulating and experimenting with objects. Much of this exploration
of the physical world is done collaboratively with a caregiver who acts
as both goal setter and mediator between the infant and the world.
Caregivers offer objects designed for play that offer a challenge to the
infant. They talk a lot about what they are doing together and encourage
their infants to talk. And they facilitate the infants accomplishing the
goal by either simplifying the task overall or breaking it down into
subtasks that the infants can do, a process of “scaffolding” (Wood et al.
1976; see also Vygotsky 1978, 1987). The activity is often accompanied
by praise and encouragement, sometimes with the additional claim that
the infant has done the task “all by yourself!”
This focus on caregiver mediation of object interaction turns learning
about the world into a highly verbal, shared social activity, with two
active participants who consistently need to maintain joint attention
and to negotiate their roles. It gives the infant a great deal of information
about the world, much of which comes from the caregiver, and it conveys
to the infant not only that the physical world is interesting but that
exploration of it is best done with social support. It gives the caregiver
responsibility for finding activities for the infant and for setting goals.
It also links the pleasure of learning and mastery with the pleasures of
controlling someone else’s attention, pleasing someone important, and
receiving praise (including false praise). Thus, Euro-American infants
become dependent on a social audience—the “Mommy, watch me”
phenomenon. This pressure can become so strong that they fail to find
intrinsic pleasure in an activity, relying heavily on the rewards of social
attention and praise.
This pattern demonstrates the ambivalent goals of Euro-American
caregivers. Although they say that teaching their infants to be
independent
is their primary goal, they actively work to shape their infants’
behavior to match culturally valued activities at a rate that at least
matches, and hopefully surpasses, the average. To accomplish this,
they must strip the infant of true independence. Likewise, infants,
who strenuously assert their rights to do things independently, also
seek the facilitating scaffolding and praise of the caregiver. This complex
agenda requires both parties to expend a great deal of energy, causes
great tension, and often results in an escalation of emotion on both
sides. To recover, both caregivers and infants feel they need time when
their engagement with each other is minimized.
Japanese caregivers (almost exclusively mothers) interact with their
infants as much, if not more, as Euro-American caregivers, but their
goals as parents and their rules of engagement are somewhat different.
Instead of independence, they see the development of a mutually
dependent social relationship as their most important socialization
goal. Thus, Japanese mothers devote much of their daily energy toward
satisfying their infants’ basic needs to “convince” the infant to become
interdependent (Doi 1973). This accomplished, infants are expected to
comply voluntarily with their mothers’ wishes to please them. Similarly,
in terms of learning about the world, Japanese infants are taught that
interactions with people are more relevant than interactions with
objects. Although Euro-American mothers and infants spend more
time interacting with objects, Japanese mothers and infants spend
more time in strictly social interaction or in focusing on the social
meaning of objects (Bornstein 1989). Further, in an otherwise indulgent
environment, they are taught to act on the world in ways that take other
people’s feelings and needs into account (Clancy 1986). So, although
Japanese infants and caregivers have an intense pattern of interaction,
it is built on the ideals of an indulgent mother and a compliant infant
who attends to the intentions and feelings of other people.
The Yucatec Maya expectations about infants’ learning and exchanging
information about the physical world are quite different from either of
these two cultures. Yucatec Mayan caregivers recognize that infants like
to explore their physical surroundings and are interested in manipulating
objects. As soon as they can sit, infants are put on a cloth placed on the
ground, and from there, they are allowed to crawl and then walk with
a great deal of freedom. (There is always a caregiver who is carefully
watching ready to remove infant or object when there is a danger.)
Although infants are still immobile, they are given whatever discarded
household items happen to be nearby on the floor to explore. These are
offered in a minimal manner, usually placed next to the infant with no
talk, no smile, and no demonstration. Even with this paucity of objects
and minimal social engagement, 12-month-old Yucatec Maya infants
spend as much time in object play (35.7%) as Euro-American infants
do (36.7%). But in their play, they spend much less time exploring the
characteristics and uses of an object. In addition, even though overall
they spend almost as much time in social interaction overall as Euro-
American infants do, they pay almost no attention to other people
when they are playing with objects. They do not talk to others and
they do not look to others for advice or guidance in their exploration
of the physical environment (Gaskins 1990).
There are a number of specific interaction routines, like social games,
pointing to and naming objects, and demonstration of and playing with
objects, that have been seen as important insofar as they demonstrate
specific social understandings and have potential developmental
consequences (e.g., Bruner 1983; Gergely and Csibra this volume;
Tomasello 1999). All of these interaction routines presuppose that
caregivers and infants have abundant optional, playful, and non–need
based interaction. This is not true in many cultures, especially those
in which adult workload is high and it is considered inappropriate for
adults to play with children. In the Kaluli (Schieffelin 1990), Samoan
(Ochs 1988), Gusii (LeVine et al. 1994), and Yucatec Maya (Gaskins
1996) cultures, parents do not play with children. In all of these cultures,
in which parents are busy with economic and household
responsibilities,
play is inappropriate for someone of high status, and children
are expected to accommodate to the adult world, rather than adults
accommodate to an artificial child’s world. In all of these cultures, and
in many others, the amount of time spent during a child’s first two years
with caregivers in social games, pointing and naming, and mutual play
with objects with is dramatically lower than in Euro-American homes
and in some cases virtually nonexistent.
In general, across a wide range of cultures, children gain and exchange
information in quite variable ways. Euro-American infants are at one
end of the continuum in terms of the amount and complexity of social
interaction that accompanies exploration of the world in everyday
activities. Routines such as pointing and naming and mediation of
play with physical objects are not common in many cultures. For many
infants, exploration of the physical world happens independently of
social interaction and occurs more often through observation rather
than manipulation.

The Role of Others in Social Interaction


In many cultures, mothers are not the only or primary caregivers of
infants, so I should not confine my theoretical attention to mothers.
In many cultures in which non-nuclear families are the norm, there are
multiple adults and children who are regular caregivers for the infants
in the household. Such complex families also have more activities to
watch and more people to interact with. For example, one 12-month-
old Yucatec Mayan infant observed was in almost daily contact with
over 25 members of her family, all of whom provide some care: mother
and father, three siblings, two grandmothers, one grandfather, two great
grandmothers and one great grandfather, one great aunt, eight aunts
and uncles, and at least eight cousins. Under these circumstances, one
cannot understand the infant’s social world by looking only at the
interaction with a single “primary caregiver.”
Other adults might be expected to provide social interaction that is
culturally structured like the infants’ mothers’ interaction, although
there might be gender and generation differences. Yucatec Mayan
fathers, for instance, who are pulled into a caregiving role only for
short intervals, often take the opportunity to engage in much more
focused interpersonal interaction than other caregivers. They may talk
directly to the infants face to face and play physically with them, but
they usually sustain this interaction for no more than a few minutes. If
their caregiving responsibility lasts longer than that, they may become
distracted or bored and actually pay less attention to the infants following
their initial burst of interaction than is the cultural caregiver norm.
It has been argued that children, who may not yet understand the
cultural norms for social behavior or not yet be held to them, are likely
to interact with each other in ways that differ from adult patterns
(Weisner and Gallimore 1977). Those who are invested in the argument
that children develop social knowledge through playful interaction
have sometimes assumed that in the absence of maternal play, children,
whether as companions or as caregivers, would provide the proposed
necessary social experiences in their shared play activities. Heath
(1983) reports just this sort of behavior for the rural African American
community of Trackton, where infants are considered playthings by
their siblings and neighbors. But LeVine et al. (1994) found that this
assumption of playful routines by children was not true for the Gusii.
The type of behavior that child caregivers directed toward infants
reflected their role as responsible caregivers, not as potential playmates.
Similarly, Yucatec Maya child caregivers’ primary efforts are to soothe
and comfort the infants, not to stimulate them. Young children who are
not directly responsible for caregiving may not provide infants much
playful interaction either, because they are often directly ordered to leave
the infants alone in an attempt to minimize the infants’ stimulation and
likelihood that they will cry. It is not until the toddler begins joining
the other children at play in the yard that they are likely to be true
play partners, and then only to the extent that they can enter into the
older children’s play activity.
Thus, in most cultures, to understand infants’ social worlds completely,
one must look beyond mother–child interaction and recognize the
varied world of social partners available to the infants. One should not
assume that all adults will act like the mother nor that all children will
act differently. The limited information currently available suggests
that, in many cultures, legitimate caregiving responsibility requires
strict adherence to cultural norms about behavior toward the infant
no matter the age or status of the caregiver.

Conclusions about Cultural Diversity in Infant Social


Interaction
We can see in these ethnographic cases drawn from a wide range of
cultures that social interaction with infants is not universally of one
kind. Most of the evidence presented is not new, but it has often been
ignored by researchers who study child development from a psychological
perspective. Different standards for data collection (experiments vs.
observation in natural settings) and different targets of analysis (infant
capacity, i.e., what infants can do vs. infant expression, i.e., what infants
do do) explain some of this neglect of research on other cultures. But
it is also the case that these data are difficult to reconcile with popular
claims about development based on the argument of universal patterns
of social interaction.
Even if ethnographic data of everyday life are not sufficient for
psychologists’ purposes for demonstrating the development of infants’
capacities for social interaction, they suggest that researchers must
demonstrate (rather than assume) the claim that relevant infant social
experience is universally shared by infants growing up in these vastly
different worlds. As part of a revised research agenda, it must be clarified
whether an infant needs only minimal exposure to a key experience
(some sort of “low-threshold model”) to trigger development or needs
to be exposed through sustained everyday experience over time. In
addition, it should be clarified whether a particular experience is uniquely
important or rather only one facilitating experience from among a set
of possibilities. Claims about development of social capacities over time
also need to clarify if they are theoretically tied to specific ages of the
infant, represent a logical order of acquisition, or are merely possible
clusters of behavior. The more limited all of these claims, the more
likely they will be found to be universal across cultures, but the less
explanatory power they will have.
To collect more adequate data on infant social development and
interaction in other cultures is difficult. Each culture will have its own
unique distribution of characteristics that can potentially influence
social interaction patterns with infants. Thus, it is not as simple as
comparing the world’s cultures as an undifferentiated Other with
middleclass
Euro-American culture. For example, many peasant agricultural
cultures give primary importance to adult work, have multiple caregivers
for infants, place a higher priority on infant safety than on infant
stimulation, expect the child to accommodate to and participate in the
adult world, and expect young children to be obedient and responsible.
However, they do not all agree on more general social organizational
principles, including the importance of social rank, the appropriateness
of expressive or assertive social behavior, the practice of interaction
taboos, the legitimacy of asserting other people’s intentions, or the
balance between the importance of one’s responsibility to the group
needs versus responsibility to one’s own individual needs. In all cultures
the primary goal of parents’ socialization practices is the child’s mastery
of the culturally specific rules about how to interact with other people.
If the end goals differ from culture to culture, then the socialization
practices can be expected to differ also, even between cultures that
share many other beliefs.

Alternative Developmental Theories of Social


Interaction
The information about cultural variation in infant experience
presented
in this chapter leads us to question the viability of a theory of
development driven by universal social experience, but there are two
other possibilities to consider: (1) universal developmental patterns
are primarily driven by biological maturation (rendering culturally
structured experience irrelevant), and (2) culturally varied experiences
produce variable developmental outcomes. To evaluate these two
options, it would be necessary to determine from a variety of cultures
whether infant developmental outcomes are similar or vary.
Given the amount of information available about existing practices
of infant socialization across cultures, it is surprising that comparable
information about infant capacities and behavior are not available.
Ethnographic studies involving early socialization have devoted much
more attention to adult behavior than to infant behavior. Thus, we
simply lack the most basic descriptive data about infants. We do not
know whether children everywhere show the same social behaviors, and
at the same ages. For instance, we do not have much data from other
cultures on the onset of pointing, which can be construed as a social
developmental milestone demonstrating a capacity for joint attention
(see Tomasello, Liszkowski this volume). Furthermore, we do not know
much about how infants in other cultures use these capacities once they
acquire them or in what ways basic capacities come to be integrated into
more complex and culturally structured social interaction patterns.
It seems likely, given the need of human infants to be cared for and to
learn a great deal from their social environment, that some basic social
capacities must be biologically present at birth with others developing
later as a result of maturation. But the pregiven set may well be small,
because encoding fixed behaviors genetically seems at odds with the
generalized flexibility characteristic of human adaptation. So the list
probably would include only those capacities that are necessary and
sufficient for entry into some social world.
It seems equally likely that, given the amount of cultural variation in
adult social interaction patterns (e.g., Levinson this volume), those basic
capacities will be supported and expanded in culturally specific ways
through participation in everyday social routines, beginning early in
the child’s life. Thus, for instance, even if joint attention is reliably first
evidenced by pointing in all infants sometime around their first birthday
(making it a candidate for a universal and biologically determined
social capacity), it is likely to serve different communicative functions
for infants in different cultures, depending on the caregivers’ cultural
understanding about the nature of interaction with infants. In some
cultures in which extended conversations and play with objects are
highly valued activities with infants, pointing will become a significant
means of socially manipulating attention. In other cultures with limited
opportunities for infants to participate in conversations, it will be a
social tool that is rarely used.
Thus, a universally present social capacity need not yield universal
characteristics and functions. Because infants are biologically social
creatures, they are likely to have some universal social capacities that
mature over the course of the first two years. At the same time, because
they are also biologically cultural creatures, these innate social capacities
will be applied to culturally structured ends. That is, humans are not
designed to be “Social” in general but to be “social” in a particular,
culturally constrained way, just as they do are not designed to learn
“Language” in general but one particular “language”—the generic
capacity being manifestly in the service of acquiring highly specific
outcomes (cf. Goldin-Meadow this volume). Thus, whatever social
capacities infants gain as they mature will be both amplified and
constrained by the particulars of the cultural community in which they
live, with all the complexity described above. This process will be more
pronounced as children grow older and become more integrated into
the social world around them, until they become full adult members of
the culture. But even as early as the end of their first year (and probably
before), infants do not demonstrate “raw” expression of capacity but,
rather, an expression of capacity already heavily mediated by the specific
social–cultural environment.
Articulating the specifics about the balance between universal and
variable social interaction in infants will have to wait until there is
much more systematic information available about non-Western
infants’ behavior. That information needs to be both culturally valid
and comparable across cultures, a combination that presents significant
methodological challenges. But it is an important step forward to
recognize that we already have evidence that social interaction with
infants varies significantly across cultures and to question, as I have
here, the modeling of developmental theories of social interaction that
rely heavily on the specific practices of Western cultures.
Note
1. Many of these examples come from work that is 20 or more years old, when
much of the work on cultural differences in infant socialization was done. The
insights into the range of variation in infant social interaction that these older
descriptive studies provide do not diminish with age. Moreover, because this
evidence has been around for quite a while, it is even more remarkable that it
has not been incorporated more thoroughly into developmental theory.

References
Bornstein, M. H. 1989. Cross-cultural developmental comparisons:
The case of Japanese-American infant and mother activities and
interactions. What we know, what we need to know, and why we
need to know. Developmental Review 9:171–204.
Bruner, J. 1972. The nature and uses of immaturity. American Psychologist
27:688–704.
——. 1983. Children’s talk. New York: W. W. Norton.
Clancy, P. 1986. The acquisition of communicative style in Japanese.
In Language socialization across cultures, edited by B. Schieffelin and
E. Ochs, 213–250. New York: Cambridge University Press.
Doi, L. T. 1973. The anatomy of dependence. Tokyo: Kodansha
International.
Ferguson, C. 1978. Talking to children: a search for universals. In
Universals of Human Language, vol. 1, edited by J. Greenberg, C. A.
Ferguson, and E. A. Moravscik, 203–224. Stanford, CA: Stanford
University Press.
Gaskins, S. 1990. Mayan exploratory play and development. Ph.D.
dissertation, Department of Education (Educational Psychology),
University of Chicago.
——. 1996. How Mayan parental theories come into play. In Parents’
cultural belief systems: Their origins, expressions, and consequences, edited
by S. Harkness and C. M. Super, 345–363. New York: Guilford.
——. 1999. Children’s daily lives in a Mayan village: A case study of
culturally constructed roles and activities. In Children’s engagement in
the world: Sociocultural perspectives, edited by A. Göncü, 25–61. New
York: Cambridge University Press.
Heath, S. B. 1983. Ways with words: Language, life, and work in communities
and classrooms. Cambridge: Cambridge University Press.
LeVine, R., S. Dixon, S. LeVine, A. L. Richman, P. H. Leiderman, C. H.
Keefer, and T. B. Brazelton. 1994. Child care and culture: Lessons from
Africa. Cambridge: Cambridge University Press.
Lucy, J. A. 1993. Metapragmatic presentationals: Reporting speech with
quotatives in Yucatec Maya. In Reflexive language: Reported speech and
metapragmatics, edited by J. A. Lucy, 91–125. New York: Cambridge
University Press.
Martini, M., and J. Kirkpatrick. 1981. Early Interactions in the Marquesas
Islands. In Culture and early interactions, edited by T. M. Field, A.
M. Sostek, P. Vietze, and P. H. Leiderman, 189–213. Hillsdale, NJ:
Erlbaum.
New, R. S. 1988. Parental goals and Italian infant care. In Parental
behavior in diverse societies, edited by R. LeVine, P. M. Miller, and M.
M. West, 51–64. San Francisco: Jossey-Bass.
Ochs, E. 1988. Culture and language development: Language acquisition
and language socialization in a Samoan village. Cambridge: Cambridge
University Press.
Ochs, E., and B. Schieffelin. 1984. Language acquisition and socialization:
Three developmental stories and their implications. In Culture and its
acquisition, edited by R. Shweder and R. LeVine, 276–320. Chicago:
University of Chicago Press.
Pye, C. 1986. Quiché Mayan speech to children. Journal of Child Language
13:85–100.
Richman, A. L., R. A. LeVine, R. S. New, and G. A. Howrigan. 1988.
Maternal behavior to infants in five cultures. New Directions for Child
Development 40:81–97.
Schieffelin, B. 1983. Talking like birds: Sound play in a cultural
perspective
. In Acquiring conversational competence, edited by E. Ochs and B.
Schieffelin, 177–184. London: Routledge and Kegan Paul.
——. 1990. The give and take of everyday life: Language socialization of
Kaluli children. Cambridge: Cambridge University Press.
Tomasello, M. 1999. The culture of human cognition. Cambridge, MA:
Harvard University Press.
Trevarthen, C. 1987. Universal co-operative motives: How infants
become to know the language and the culture of their parents. In
Acquiring culture: Cross-cultural studies in child development, edited by
G. Jahoda and I. M. Lewis, 35–90. London: Croom Helm.
Vygotsky, L. S. 1978. Mind in society: The development of higher psychological
processes. Cambridge, MA: Harvard University Press.
——. 1987[1934]. Thinking and speech. New York: Plenum Press.
Watson-Gegeo, K., and D. Gegeo. 1986. Calling-out and repeating
routines in Kwara’ae children’s language socialization. In Language
socialization across cultures, edited by B. Schieffelin and E. Ochs, 17–50.
New York: Cambridge University Press.
Weisner, T. S., and R. Gallimore. 1977. My brother’s keeper: Child and
sibling caregiving. Current Anthropology 18:169–190.
Wood, D., J. Bruner, and G. Ross. 1976. The role of tutoring in problem
solving. Journal of Child Psychology and Psychiatry 17:89–100.
eleven

Joint Commitment and Common


Ground in a Ritual Event
William F. Hanks

In this his chapter, I explore human sociality from the viewpoint of


language and communicative practices. It is a truism that speaker,
addressee, and other participant roles are social relations, and that
a key part of sociality is the disposition to interact with others, and
the fact of doing so. Whether one agrees with Schegloff (this volume)
that interaction is “the primordial site” for sociality, it is certainly a
privileged locus of observation for anyone concerned with human
social life. Language and talk are pervasive in ordinary social life,
across the spectrum from inner speech to face to face dialogue, service
encounters, talk around the workplace, doctors visits, worship, hanging
out, telephone conversations, and a seemingly endless variety of other
settings. The sheer diversity of contexts in which communicative practice
takes place requires that any human language be flexible enough to
adapt to widely disparate and changing circumstances. It must also
combine in systematic ways with gesture, gaze, physical contact, the
spatial and perceptual field of talk, background knowledge, and other
modalities, which codetermine the referents and conveyed meanings of
utterances (Hanks 2005). The multimodal language of interaction must
provide the semiotic resources for participants to integrate and manage
a vast amount of disparate contextual information (Kendon 1992; see
also Goodwin, Keating, and Levinson in this volume). Moreover, they
must do so in concert with others, attending as best they can to each
other’s understanding of what is going on.
One of the striking features of talk is that much of what is conveyed
is tacit and must be inferred from context, including beliefs the parties
Culture and Sociality

ascribe to one another. The actual form of utterances and communicative


gestures radically underdetermines what they convey. This in turn
implies that, to resolve inferences and conveyed meanings, parties to
talk must attend to one another and to other aspects of the setting. Over
a series of important studies, Clark (1992, this volume) has shown that
this mutual orientation works through common ground, participatory
commitments, and joint activities. These and other features of joint
action provide fine-grained evidence of the forms and conditions of
sociality in practice.
The study of interaction faces several methodological challenges.
The path from observation to recording to transcription and analysis is
fraught with selectivity and discontinuities. The more densely speech
is embedded in gesture, space, time, and other dimensions of context,
the more difficult it is to delimit it. Yet any recording or transcription
is, by definition, a delimiting of its object (Bucholtz 2000; Cicourel
1992; Ochs 1979). Thanks to the work of Goodwin (2000), Haviland
(1993), Kita (2003), and others, it is widely recognized that speech and
gesture are inextricably connected, but it is less clear how to capture
the connections between speech and the broader context (a problem
addressed by Goodwin and Keating this volume; cf. Cicourel 1992).
The fact that much of what counts as context is tacit only compounds
the difficulty. Like parties to talk themselves, analysts of talk are forced
to make inferences about the inferences of others, often relying on
background knowledge nowhere explicitly signaled in the speech–
gesture stream. Conversation analysts have made great advances in the
empirical study of such phenomena as sequential organization, repair,
and the formulation of persons and places in ordinary talk (Schegloff
1987). Yet the intelligibility of talk to its participants relies on other,
less formally definable factors. These include intersubjective relevance
(perceived or inferred), the history of interactions between the parties,
the nonverbal setting and other features of context that appear nowhere
on a transcript, no matter how exacting or comprehensive it is.
In this chapter, I focus on a small part of this broader set of problems.
Although I examine a single extended example of interaction between
speakers of Yucatec Maya, the guiding questions can be stated more
generally. How do interactants integrate the various dimensions of context
into joint action that is both intelligible and effective (at whatever
level these are defined in situ)? How do they produce common ground
and mutual knowledge, even when they are separated by significant
gaps or asymmetries in their respective knowledge? The example we
will analyze is a ritual interaction, of a type called in Maya tíich’k’àak’
Joint Commitment and Common Ground in a Ritual Event

(illumination). In what follows, I call this divination.1 The parties are a


ritual specialist or “shaman,” a patient and a group of spirits whom the
shaman addresses with the aid of divining crystals, also called sáastúun
(light stones). Divination involves both ordinary conversation and ritual
speech, as well as combinations of the two.
Although ritual speech may appear far removed from questions
of elemental sociality in talk, it provides an opportunity to examine
integration and the production of common ground in detail. Like many
therapies, divination is an interactive process between an expert, a
nonexpert patient, a technical apparatus, and other consulting experts,
in this case spirits (cf. Cicourel 1992, 2001). It combines three distinct
interactive frames: (1) the patient–shaman interaction, which occurs in
ordinary Maya and may include other copresent parties, such as family
members of the patient; (2) the shaman–spirit interaction in the ritual
registers called réesar (prayer) and chíikó’ob (signs); and (3) the three-
way interaction between patient, shaman, and spirits, which combines
ordinary and ritual speech with the esoteric language of spirits (both
verbal and visual). How are the distinct frames and the participant roles
they define integrated in the interactive process?
The product of divination is a coherent understanding of the problem
at hand, derived by joint action among all parties, and formulated by
the shaman. Divination is a form of diagnosis. It is achieved in the face
of radically partial information and major discrepancies between what
the participants know about themselves, one another, and the problem
at hand. Although there is extensive common ground between patient
and shaman, starting with their copresence and common language,
much of what the shaman does remains opaque to the patient. The
patient is often a stranger whose biography is unknown to the shaman.
The shaman is familiar with his spirit helpers and intimate with the
technology embodied in his altar, whereas the patient is not. The
shaman has a reputation of which patients are aware, whereas many
patients arrive anonymous. The knowledge divide is paired with the
complementarity of speech acts the two can legitimately perform: the
patient expresses distress, uncertainty, and assent; whereas the shaman
induces the patient’s expression, performs prayer, and asserts diagnosis.
The shaman authoritatively “presents” the patient at the altar, but the
patient performs no such ritual act.2 The asymmetry of their respective
knowledge bases corresponds to the complementarity between their
speech act roles. How then do they arrive at a joint understanding of
the problem?
I argue that the integration is produced through a combination of
linguistic, semiotic, and perceptual resources combined over the time
course of the episode. The divinatory setting provides a publicly available
built space and a stable interactive field in which different situations
are embedded. Through indexicals, descriptive categorizations, gaze,
and gesture, the shaman repositions himself among the three frames
cited above. He makes many subtle inferences about things unseen and
overcomes the gaps in common ground with the patient, by what I call
induced commitments. Although the patient never comes to understand
what the shaman is doing, he or she is drawn into the process as an
active coparticipant. The key to the shaman’s success is his ability to
induce the patient to engage in the process, to experience recognition
and to assent to statements about his own life, including ones he cannot
verify. The patient must commit both to the process and to the validity
of its results. The shaman cannot merely assume either commitment,
but must bring them about. Although I concentrate in this chapter
on the shaman’s methods in doing this, it is clear that the patient
also faces the reciprocal task of maximizing the shaman’s engagement
in his complaint. What we see in such interactions is the meticulous
management of sociality as a way of transforming the experience of the
patient. Analogous processes are at work any time interactants must
produce (and not only presuppose) common ground, and especially
whenever one party seeks to convince the other(s) despite significant
gaps in their relevant knowledge. What makes divination a revealing
example of sociality in action is that it foregrounds factors and processes
that are present in most ordinary talk but that are often backgrounded
or conflated.

The Divinatory Setting


The clinical episode typically starts when the patient arrives at the
shaman’s home, alone or accompanied, and presents a condition in
need of treatment. In the example discussed here, a man in his forties
has arrived with his wife (see Fig. 11.1), complaining of aches and pains.
The two talk with the shaman in hopes of learning what the problem is
and fixing it. During this preliminary phase, the three are in a face-to-
face interactive situation in which each can see, hear, and understand
the other. The focus of interaction is the problem, of which the shaman
forms a preliminary unstated assessment based on the patient’s speech
and appearance. In his expert gaze, the patient’s clothing, body posture,
breathing, facial expressions, skin tone, and myriad other features are
Figure 11.1. DC (the shaman) and patient’s wife (number 200).

potential indices of the problem and its likely causes (cf. Goodwin 1994).
Relatively little is explained and there is no physical exam, but a vast
amount of observational information is gathered by the shaman and
subsequently used as a basis for inferences about the patient.
The setting unfolds in the back room of the shaman’s house, in
front of his altar, a table on which there are numerous saints’ images,
candles, flowers, tins of herbal medicines, and his divining crystals
kept in a gourd filled with blessed water (see Fig. 11.2). The altar is on
the East wall of the room, facing East, and the saints’ images all face
West, looking into the room. In the shaman’s practice, these details
are part of a vast universe of background knowledge, both declarative
and procedural (Hanks 1990, 1996). The altar is a complex instrument
with which he engages daily and of which every part has an elaborate
history and rationale. The patient can see all of the objects and is
aware that they are significant, but has minimal knowledge of what
is signified, most of which is esoteric or explicitly kept secret. Like a
hieroglyph, the altar and the speech that occurs at it are known to be
meaningful but their meaning is mostly unknown to the patient. He
can also see in the shaman’s behavior that the altar is his intimate space,
but can only postulate the history behind this intimacy. He bases his
postulation on the shaman’s displays in the present and reputation
from past performances and the opinions of others (most or all of
whom are not present). The patient also has a personal history that
brings him to this place at this time, but the shaman can only infer
it. Whereas the shaman has a reputation based on years of practice,
the patient is often an anonymous stranger on a first visit. An expert,
the shaman can monitor both the patient and the signs in the space
with acuity far beyond the patient’s ability to monitor. In this sense,
the situation is by design asymmetric from the outset. It takes place
on the shaman’s ground.

Interaction in the Divinatory Setting


At a moment of his own choosing and without preliminaries, the
shaman turns his body away from the patient and toward the altar.
Sitting before it, he places the crystals directly in front of himself, lights

Figure 11.2. DC’s (the shaman) altar (number 465).


a candle and prepares to perform. At this point, the setting has shifted,
and the different footings of the participants are amplified into what
will become two different situations. The shaman rests both arms on
the altar and keeps his right index finger in tactile contact with the
crystals, his gaze forward. He draws a breath and begins to pray. This is
the onset of phase 2 in Table 11.1. The patient sits behind him, on the
other side of the room, quiet or conversing sotto voce with others present,
while the shaman prepares for réesar (prayer).3 Copresent overhearers
typically continue to interact among themselves and include family
members or other patients waiting their turn. No longer interacting
with the shaman, the patient is an overhearer who can see the shaman
from behind and knows approximately what he is doing but not how he
does it. For the shaman, the patient has become an overhearer, a third
person object that he will present to spirit addressees using prayer and
the crystals. Although his gaze, manual engagement and attention are
all focused on the altar, the shaman maintains subsidiary awareness of
the patient’s physical presence behind him and monitors his audible
expressions (speech, breathing).
After a brief pause, the shaman begins to perform. This consummates
the division of the setting into two separate situations, one in which
he is interacting with divine addressees through the crystals, and the
other in which the patient and any overhearers are copresent but
excluded from detailed understanding of the process. This is now phase
2 in Table 11.1. The two phases correspond to distinct situations not

Table
Table 11.1.
11.1. Phases
Phases in
in a
a divination
divination and
and treatment
treatment11

Phase Activity Onset

Phase 1 Dialogic lead in [unrecorded]


Phase 2 Opening prayer: Divination [AV.07.18.18]
Asks patient's name and town [AV.07.20.00]
Sign of Cross ends opening prayer [AV.07.21.07]
Phase 3 Examine crystals IAY.i >7.21.23|
Supplementary prayer [AV.07.21.34]
Render diagnosis [AV.07.22.08]
Phase 4 Closing prayer: Divination [AV.07.27.56]
Phase 5 Dialogic transition to treatment [AV.07.28.27]
(remaining 18 minutes of episode omitted)
1
Onset numbers refer to data: [Audio VideoTape.Number.Minutes.Seconds].
only because they are defined by separate interactions but because
the initial asymmetries in the relevant knowledge of patient and
shaman in phase 1 are crystallized into different kinds of cognitive and
physical monitoring in phase 2. Shaman, patient, and overhearers are
still physically copresent, but the shaman is engaged in an interactive
practice opaque to the others. With the help of the crystals, he can
monitor the patient in ways that he or she could never monitor him
in return (Goffman 1972).
The heart of phase 3 is the use of the crystals for diagnosis. The
crystals are a source of enlightenment and understanding, which is
what the Maya term tíich’k’àak’ (illumination) refers to, and why the
crystals are called “light stones.” They embody the shaman’s ability
to monitor in unique ways. They are clear round spheres similar to
marbles and the shaman holds them, one at a time, in front of the lit
candle. He addresses the spirits in speech, and they respond in two
ways, images in the crystals and words that pass through his mind.
Neither the images nor the silent words are perceivable by anyone
else, but they provide vital information to the shaman, who receives
them as chíikó’ob’ (signs). Combined, these signs provide additional
premises for inferences as to the patient’s private experiences, which the
shaman articulates in the subsequent diagnosis. With the crystals and
their signs, the shaman delves into the private details of the patient’s
body and living circumstances in ways unknown to all but himself.
Seated behind him, the patient has become an object for whom his
actions are mostly “black boxed.” The new participants are what the
shaman calls yuntsiló’ob (lords), ‘íik’ó’ob (winds) or espiritus (spirits),
whom he addresses in the special register réesar (prayer). He makes
them present with the following prayer, in which the bold portions
are primary performative expressions.
1. Opening prayer in divination
1.1 [AV.07.18.19] #Por la señal de la santa cruz del nuestro enemigo
By the sign of the Holy Cross from our enemy
1.2 librenos señor Dios Pàadre en el nombre del Padre, del Ihoh
free us lord god father in the name of the father, of the son
1.3 de los espiritu santo.
Of the holy spirits
1.4 #Pádre mio sáss insíipi ‘uch in kuta tuchun amèesa a
tíiptik
My father forgive my sins I sit before your altar that you
show
1.5 tó’on usáasi le yóok’o kàa’ ‘impresentartik usuhuy kristal
us the light of the world, I present the holy crystals
1.6 le yùum balan ‘ìik’ó’o xan be’òora tuyóo’l amèesah,
of the jaguar lords now atop your altar,
1.7 #’úuch ink’áatk e poder bakan ti yùun t’uh balam ti xóoh
balam
I request the power of lord Tup Balam, of Xoh Balam
1.8 ti ‘ah balan túun ti piris túun balam ti síit’bolon túun
‘ah k’ìin chan
of Ah Balan Tun, of Piris Tun Balam ot Sit Bolon Tun,
Ah Kin Chan
1.9 ti ‘ah k’ìin coba xan, ‘ah k’ìin kolonte’ tz’íib. [AV.07.18.55]
of Ah Kin Coba, Ah Kin Kolonte Dzib
1.10 # táam bakáan inki’t’anké’ex xan tuchi’ ‘umèesah Cristo
Jesus
I am sweetly speaking at the edge of the altar of Christ
Jesus
1.11 bin xan ‘in yúum,
also my lord,
1.12 # ‘utíal immáansik tunoh ak’abé’ex xan untúul u’ìiho bin
Jesu Cristo
So that I send to your right hand a son of Jesus Christ
1.13 up’atmáha’ tulú’umi le k’ebáan.
whom he left on the earth of sin.
1.14 # le bakáan tuyóo’la xan timmáansih kwèenta bakan ti
asàantoilé’ex
That is also why I make account to your saints also
1.15 ak’uchbalé’ex xan tuséeblakil xan tasàanto krìistalé’ex ‘utíal
that you all come quickly to your holy crystals in order that
1.20 arébisàartiké’ex ten ulú’umi le kwèerpóoh.
You review for me the earth of the body
1.21 #’utí’al aréebisàartké’ex ten uk’í’ik’el, ‘utí’al aréebisàartké’ex ten
So that you review for me the blood, so that you review for
me
1.22 yíik’al (.) ‘utíal a presentarké’ex ten takrìistalé’ex. [AV.07.19.29]
the breath, so that you present him for me in your crystals.4
This kind of prayer is a register distinct from ordinary Maya.5 It is
almost intelligible to an ordinary Maya speaker, because it shares many
features with it yet is made strange and even opaque by the delivery and
esoteric knowledge it expresses. It is chanted in rhythmic groups, each
delivered with a single extended exhalation. The breath group is without
internal pauses, in a single rhythm and with an overall intonation
contour: high pitch on the first few syllables, flat intonation for the
subsequent ones, and a rise in pitch at the end. Although there are
variations in speed of articulation, prayer performed for ritual treatment
is always more regular in pace, and usually faster, than ordinary speech.
For the most part, it is clearly articulated. The morphosyntax includes
the distinctive verbal auxiliary ‘úuch ([it] has occurred; 1.4, 1.7), use of
verbal affixes including ki’ (sweetly; 1.10) and compound verbs ‘ok’oh-
’óoltik (weep-desire [it]) and k’áat-máatik (request-borrow [not shown]).
These elements are available in ordinary Maya, and the syntax follows
regular word order, but their use and conveyed meanings are peculiar to
prayer. There are several phrases that are similarly restricted. In prayer,
presentar (to present) is used as a verb of speaking “I hereby present”
(1.5) and “you present in your crystals” (1.22). It has the special meaning
of presenting a patient for diagnosis and labels the divinatory act as
a whole. Similarly, rébisàartik (to review, go over) designates the act
that spirits are asked to fulfill, “turning their gaze” on the patient and
showing what they see. The expression máansik cuenta (to spread the
word) here means to state the patient’s problem. These expressions
would be recognizable and appear coherent to any speaker of Maya, but
their usage is esoteric and patients know it. Even those with long-term
exposure with réesar typically say they do not understand it.
Unlike ordinary conversational Maya, the réesar register is marked
by the near absence of deictics referring to objects in the immediate
situation. The shaman uses first person singular throughout this portion
of the event, but also refers to himself in the third person subsequently.
The interactive setting is never construed as “here,” but always “there on
the earth of sin” (1.13) and the patient, although copresent, is demoted
to an indefinite third person “a child of Jesus Christ who he left on the
earth of sin.” Compared with the terse definiteness of deictic construal in
ordinary speech, prayer is euphemistically vague, with the consequence
that it would require even more inferential work for a patient overhearer
to resolve the references. At the same time, certain evidential particles
used occasionally in conversation are abundant in prayer. These include
bin (reportedly; 1.11, 1.12) and bakáan (apparently; 1.7, 1.14), both of
which index that the discourse is based not on current evidence but
on supposition or secondhand knowledge. This further cuts the prayer
talk away from the current situation and would undermine any attempt
to verify its epistemic basis. The recurrent use of these particles, along
with xan (also; 1.9, 1.11, 1.15), also contributes to the driving, repetitive
rhythm of the performance.
These aspects of the prayer register are critical to the shaman’s ability
to engage the patient. Rather than being in a different language (such
as liturgical Latin for English speakers), Maya prayer is familiar and
almost intelligible to nonspecialists. The phonology, morphology, and
syntax display that this is the same language the patient speaks, giving
it the appearance of clarity. Yet any patient who actually attempted
to grasp it would confront opacity and have a hard time resolving
the references. The repleteness of the symbols on the altar, including
the crystals themselves, and the obvious systematicity of the shaman’s
speech and gestures invite interpretation. They reinforce the conviction
that something meaningful is happening. At the same time, and for the
same reason, they thwart understanding and oblige the patient to take
“on faith” that the shaman is engaged in an effective practice.
The shaman explicitly states his purposes in prayer, in terms almost
understandable to the patient who witnesses the prayer: “in order to pass
the word, in order that you review for me the earth (flesh), the blood,
the nerves (spirit).” From the shaman’s perspective, this statement of
purpose formulates the motive of the divination for the spirit addressees;
it says what he wants them to provide for him. This is preparatory to his
subsequent statement of an explicit question. The reference to blood,
flesh, and nerves indexes that this is a divination for a physical malady
and not, say, a lost object. Like an ordinary speaker clarifying his or
her purpose in posing a question, the shaman here appeals to the spirit
addressees’ ability to grasp his intention in addressing them. He also
displays his aim to the patient. This has two intended consequences.
First, the patient is reassured that the shaman is doing what he (the
patient) has requested. This is important because shamans are known
to play tricks and an individual’s credence must be developed. Even
though he does not fully understand what is happening, the patient
must be comfortable that it is a divination for his condition, and not
some other action.
The second intended consequence of this part of the prayer is that
the shaman engages the spirits on a proper footing (Goffman 1983).
Although the patient lacks the special knowledge to judge the properness
of performance, he can recognize that what the shaman is doing is a
statement of intention. For the shaman, the spirits are fully intentional
beings whose intervention he must request and induce through prayer.
Extending the logic of social interaction, he attributes purpose to the
spirits, and the ability to grasp his own purpose (cf. Levinson this
volume). Like the Gricean speaker, the shaman aims to induce his spirit
interlocutors to understand him, by means of grasping his intention to
get them to understand. Moreover, when images appear in the crystals,
he treats them as the perceivable signs of Gricean intentions on the
part of the spirits: to interpret the signs, which are extremely vague,
he must assume that they are responses provided “on purpose” to his
own intended inquiries (Hanks 2001). This dovetailing of motives is
necessary to secure the sense of all signs that move from the spirits
to the shaman.6 It implies a robust theory of mind that links the two
parties (cf. Astington, Levinson, and Pyers in this volume).
The crystals also provide a medium for intentionality in the more
abstract sense of aboutness relations. The visual perturbations that occur
in the crystals are not seen as mere perceptibles, but as signs in the
Peircean sense: “representamens” that stand for something in some way.
They are expressed by the spirits and they stand for the patient’s plight.
Typically they are iconic signs, with a wisp of smoke standing for a force,
its motion standing for the motion of the force in the world, and its
location standing for the source and trajectory of the motion. It is as if
the crystals provide a real-time image of the symptoms or their causes,
in which both are represented as movement. Like the video displays
described by Keating (this volume), the crystals produce counterparts
of the objects they represent, but unlike online videography, the source
of the images is a spirit agent rather than another person. Also unlike
Keating’s examples, the crystals provide images of the copresent patient,
not of the distant interlocutor. It is critical to the shaman’s success that
the counterpart relation between the images in the crystals and the
experience of the patient be secured (Sweetser and Fauconnier 1996). If
this counterpart relation fails to emerge or breaks down, the divination
itself will fail.
The next step in inducing the patient to enter into the divinatory
frame is to get him to provide a key piece of information: his name
and provenience. After the opening prayer, the shaman must actually
present the patient, and this requires the proper name and home town.
Of all the information the shaman can infer, this must be provided by
the patient. The interaction is shown in (2). When the shaman asks
the patient his name (2.3), he responds with first name only (2.4), a
minimal form, just as he states home town with the minimal “here”
(2.10). Both elicit calls for greater specificity by the shaman (2.5, 2.11),
prolonging the patient’s speaking engagement in the process. Once
satisfied, the shaman returns to réesar (2.13–2.16), but when he reaches
the point of restating the name, he hesitates and asks for yet another
confirmation (2.16). This time the patient’s wife, a ratified overhearer,
is the one to respond (2.17), and she and the shaman interactively
complete his naming of the patient in the prayer, which resumes in
réesar register (2.20). One effect of this sequence is to enhance the
participation of patient and wife in the prayer itself, which sustains their
participatory commitment to its success. Whether or not the shaman
actually remembers the name, it is strategically effective to request it,
because in addressing the patient and his wife, he keeps them focused
on his words. When he repeats the name in prayer, the patient witnesses
himself interpolated into the prayer.
2. The asking of the patient’s name [AV.07.19.54] (DC = shaman)
2.1. DC # bín inkín máans uk’àaba tunoh ‘ak’abé’e ‘inyuúun, ká
arébisàarté’ex
Now I will pass his name to your right hand my
lord, that you review him
2.2 ten tuséeblakil ((glance in direction of patient, over
right shoulder, downward))
for me right away
2.3 máax ‘ak’àaba’= What’s your name?
2.4 Patient =Míiguel.
Michael.
2.5 DC ((shift gaze closer to patient)) Míiguel máax?
Michael who?
2.6 Patient Miguel Hupche’. [AV.07.20.05]
Michael Hupche.
2.7 DC hnng? ((shifting gaze still more in direction of
patient))
2.8 Patient Miguel Hupche’.
2.9 DC ((shifts gaze back to divining crystals)) (.) kux tuún
akàaha?
What about your home town?
2.10 Patient wayne’.
It’s here
2.11 DC wayileche’?
you’re from here?
2.12 Patient hnng. [AV.07.20.10]
2.13 DC ‘inyúum, ‘ink’áatk epodèer ti ‘adyòosila’ páadreh, ‘ikil
immèentk e
My lord I request the power of your gods father as I
do this
2.14 rébisàar be’òoráa (0.1) tukàahi ‘óoxk’utzcàa’ ikil int’ank
u’ahkáaka’ lú’umi
review now, in the town of Oxkutzcab as I address
its earth guardians
2.15 xan be’òora ‘uprésentatik ten e kwèerpo xan kukalantik
tulú’umi k’ebá’an.
Now they present to me the body that they guard on
the earth of sin.
2.16 (0.2) Míiguel,
2.17 Wife (0.2) Hupche’.
2.18 DC ((shifting gaze to right, toward patient’s wife))
Miguel=
2.19 Wife =Hupche’.
2.20 DC Hu`upche’. ‘uk’àaba’ leti e kwèerpóo. (.) le bin
kimpresentartik tunoh
Hupche is the name of the body, which I present to
the your right hands
2.21 ak’abé’ex. (1.0) réebisàarté’ex ten ulú’umil. (.)
rébisàarté’ex uk’íik’el.
review for me the earth, review the blood
2.22 rébisàarté’ex yíik’al. (.) chikbesé’ex ten takristalé’ex.
[AV.07.20.50]
review the breath, make a sign to me in your
crystals.
2.23 k’ubé’ex ten. le k’ohá’ani kupadasertik e kwèerpóo’.
Deliver to me the sickness the body is suffering
2.24 # ‘é’esé’ex ten ‘inyúum, (1.5) yéete le podèer p’atá’an té’e
xan tumèen xan
show me my lord with the power left there by
‘udyòosi yoók’ol kàab.
The lords of the earth.
2.25 #tuk’aba’ Cristo Jesus, Dyos Padre ‘espiritu santu. .
[AV.07.21.07]
in the name of Christ Jesus, god the father, holy
spirit.

In the course of drawing the patient further into the prayer by asking
his name, the shaman also integrates the two interactive situations
into a single frame. One way he achieves this is by shifting between
prayer addressed to spirits (2.1–2.2, 2.13–2.16, 2.20–2.25) and ordinary
dialogue addressed to copresent persons (2.3–2.12, 2.16–2.19), with
alternating intervals devoted to each. The orientations of his body keep
the two distinct, facing toward the altar in prayer versus turning his head
in the direction of the patient when addressing him or his wife. The
two orientations correspond to two complementary faces of divination
as a genre, which addresses both patients and spirits. Regardless of
orientation, shaman’s right index finger never loses contact with the
crystals.7 His first bid for the name occurs at 2.2, in the course of a
normal breath group in prayer, marked by the quick head turn and
ordinary question intonation. When he must subsequently cite the
name in prayer, he seems to have forgotten it. After a two second pause
he states the first name only and pauses again, which the wife hears
as a request for completion (2.17). This response by the wife to a short
pause is clear evidence that she is attending closely to the prayer and
its timing.
When he returns to prayer at 2.20, the full name is syntactically
integrated into an utterance addressed to the spirits, yet delivered
with the prosody of ordinary Maya (2.20–2.25). The phrasing (“the
name of the body, I present it to your right hand, review the earth,
the blood, the spirit”) indexes the register as prayer, but the ordinary
prosody and the absence of spirit names make it readily intelligible to
a Maya speaker. This suggests a second strategy of integration: starting
with his bid for completion in 2.16 through 2.23, the shaman’s speech
is intermediate between the two situations, just as it is intermediate
between the registers that index them. He is, as it were, in both frames
at once. The result is to bind the two situations together and secure
the counterpart relation between the copresent patient and the body
presented to the spirits. Even if the patient does not understand all
that is going on, he understands that the shaman is presenting him,
and the need for full name and home town officialize this fact, further
reinforcing his credibility.
What ensues over the next 3 minutes and 14 seconds is the pivotal
three-way interaction in which the shaman, the patient and the spirits
coparticipate to derive a diagnosis that the patient will ultimately ratify.
This segment of the clinical episode is pivotal because the patient’s
participation is transformed, from an attentive overhearer called on
to give precise public information (name and town), into an agent in
his own diagnosis. Drawn in by the near intelligibility of the register,
the semiotic density of the altar and the shaman’s gestures, and the
occasional question, the patient builds up a conviction in the plausibility
of the process. This basic conviction is transformed further in subsequent
phases, but it is the basis of his ultimate ratification of the diagnosis.
Part of this is shown in phase 3.
3. Phase 3: Combined dialogue with spirits and patient
3.1 DC #tuk’aba’ Cristo Jesus, Dyos Padre ‘espiritu santu. .
[AV.07.21.07]
in the name of Christ Jesus, god the father, holy
spirit
3.2 (( taking two crystals, one a time, in right hand and
squinting at them))
3.3 (10.8) trés dòos.
Three two
3.4 # (5.0) kí’ichpam kó’olebi sáasil ak’ab ‘indio mayab’
yum papal kòol chak
Beautiful Lady Sasil Akab Indio Mayab Lord Papal
Kol Chak
3.5 yun chiri’ chabo’ sàanto chabol señor yúun sala’ . (1.0)
sáasikunten (.)
Lord Chiri Chabo Holy Chabol Lord Sala, enlighten
for me
3.6 ‘usàanto kristal le yum balan ‘ìik’ó’o tuk’aba’ Crìisto.
[AV.07.21.48]
the holy crystals of the lord jaguar spirits in the
name of Christ
3.7 ((taking one crystal in right hand, squinting at it))
3.8 (6.0) kwatro sìinko. ((nods twice; takes another
crystal in right hand))
four five
3.9 (9.0) ‘àah. Pwes (2.0) leti e k’oháani yàan tech a’.
Oh, so this illness you’ve got, [AV.07.22.05]
3.10 (2.0) ‘ump’ée ‘ìik’ kumáan tutohil ‘a’estòomagóoh. (1.0)
Kuyahkúuntk
a wind is passing directly in your stomach. It pains
3.11 ‘a’estòomagóoh kuprovokarkech. (1.5) ump’éeh ‘ìik’.
(1.0)
your stomach, it makes you nauseous. A wind
3.12 Le ‘ìik’ (1.0) kukrusàaro’o’, (2.0) má’ má’alo’ bá’al i’
(2.0)
the wind crossing there, it’s not a good thing.
3.13 kutz’ó’okol ubin té’el o’, ku’áatakartik xan apú’uch.
[AV.07.22.37]
after it goes there, it attacks your lower back
3.14 Patient: ‘impú’uch. =
My lower back
3.15 DC = hàah. (1.5) Kuluk’ul té’e tapú’ucho’, (.) kubin
taserebro, tapòol.
Yeah. It leaves your back there, it goes to your brain,
your head.
3.16 Patient hnng.
3.17 DC (2.0) ‘Esyaskèe, (1.0) bey utsikbatik ten té’ela’.] (1.5)
‘esyaske tech
So that’s what it’s telling me here. So you’re the one
who
3.18awohe xane’, bix yúuchu tech . ((toss of gaze over right
shoulder towards patient behind, onset at bix)).
who knows how it happens to you.
3.19 Pero tene’ mináang—táan intzolik tech (.)
but me. There’s no—I’m explaining to you
3.20 leti e’ bá’a= hé’e [AV.07.22.54]
the thing here-
3.21 Patient = kyúuchl en túunee, (1.0) ku(ya’a) le impòol yéetel
impachk’abe’. =
So it’s happening to me, my head and back hurt
3.22 (patient uncrosses arms and gazes down)
3.23 DC =hàah
Yeah
3.24 Patient: yéet innak’. [AV.07.23.03]
and my belly
3.25 DC hàah yét anak’.
Yes and your belly.
3.26 Patient: ‘estòomagóoh, múnk’amik misbá’ah.
Stomach. Can’t hold anything down.
3.27 DC: hàah. (1.0) hé’ebix, (1.0) tuyá’aka’, bey, ‘awá’ak ten
xan yúuchu tech.
Yes, just, as it says here, so, you tell me it happens
to you.
3.28 (2.0) kutzikbatik-
It tells it-
3.29 Patient -hnng [AV.07.23.17]
Coming to the end of the opening prayer at line 3.1 (which corresponds
to 2.8 above), the shaman takes two crystals in his right hand, one at
a time, leaning forward and squinting as he studies them. After ten
seconds he utters the numbers “three two.” Five seconds later, he returns
to another breath group of réesar, takes another crystal in his right hand,
scrutinizes it for six seconds and utters “four five,” nodding twice as he
speaks. After nine more seconds of silence he begins to formulate the
problem in ordinary Maya for the patient, starting with the patient’s
stomach, with his nausea (3.11) and moving to the back ache he suffers
(3.13). At the mention of his back, the patient repeats “my back.” The
shaman takes this repetition as confirmation of his initial statement of
symptoms, and adds that there is headache as well (3.15), which the
patient confirms with sublexical hnng. At this point the shaman asserts
that this is how he sees it “right here” (in the crystals), that the patient
is the one who knows how it actually happens. The latter statement
effectively calls forth a more explicit confirmation by the patient that
what he feels is what the shaman has said (3.21, 3.24, 3.26) In these
turns at talk, the patient restates the initial diagnosis and ratifies its
accuracy. He has become a principal in the diagnostic process.
Several aspects of this sequence bear on both the management of
the patient’s beliefs and the problem of integrating the two frames of
interaction into one. At the outset, when the shaman states numbers
(3.3, 3.8), it is flagrantly opaque what he is counting and why. The breath
group of réesar that intervenes between the two number statements is
itself mostly unintelligible, spoken soft and rapid and citing spirits
unknown to all but adepts. The pacing of the shaman’s turns at talk
has slowed down and there are long silences between his utterances.
He nods twice in certainty (3.8), although there is no clue what he is
certain of or why. At this point the patient is maximally excluded from
understanding the expert practice of the shaman, but is attending to
a process he believes to be meaningful. He is simultaneously a ratified
witness of the performance, and its main object.
After another long pause, the shaman proceeds to make statements
about the condition that precisely formulate the patient’s felt symptoms,
starting with what hurts. The inferences that lie behind these statements
are based at least partly on his observation of the patient’s expressions
of pain, but he never appeals to observational evidence of a sort the
patient would recognize. By telling the patient what he feels, he elicits
confirmation from him, a new step toward inducing the patient’s
conviction in the accuracy of the process. The patient is called on to
endorse statements by another about his own pain. At this point, he
“unloads” the symptoms: “my head, my upper back, my belly ache,”
and the shaman once again asserts that just as it says (in the crystals),
so it occurs to the patient (3.27).
By the time the patient ratifies this last statement (3.29), he has
common ground with the shaman regarding his physical symptoms,
if not their causes. He has seen that the shaman can tell him where he
hurts, and he has collaborated in the process by ratifying the shaman
and repeating the same symptoms in his own words. The back and
forth between the two synchronizes them (cf. Enfield this volume)
and sustains their joint engagement. The elicitation of confirmation
builds the patient’s conviction that the shaman can perceive things no
“ordinary” interlocutor would. Along the way, he develops what Clark
(this volume) calls a “joint commitment” to the divinatory project. All
of this elicited commonality supports the counterpart relation between
the crystals, whatever the shaman sees in them, and the patient’s body.
It also enhances the shaman’s credibility for the case at hand, which sets
the stage for the next step. Having provisorily convinced the patient
that he is on the right path, the shaman goes on to make more bold
and difficult claims about him. The joint convictions and commitments
established so far carry over into the tenuous terrain to be crossed. For
the sake of brevity, this segment of the transcript is omitted but can be
summarized as follows.
Over the course of 3 minutes and 24 seconds, the shaman makes
a series of statements about the patient. As he is speaking he looks
intently at the crystals, thus splitting his attention between listening
to the patient and watching for the signs of the spirits. Not only is his
pain caused by “spirit, wind” affecting his body, but it is connected
to his job and the land he owns. The patient continues to register the
shaman’s claims with consensual responses, and the shaman proceeds
to build up a more detailed picture of the patient’s life. He tells him
approximately,
you have lands, where you have a traditional house [apsidal, palm roof].
Somewhere on the land there is a cave, an abandoned limestone pit,
or other hole leading underground. In that hole the evil spirits reside
and from it they emanate, striking you and any other animate in your
household [human or animal] who happens to be in the wrong place at
the wrong time. This spirit is very old and it demands offerings. It is evil
and endangers your entire household.

Throughout this remarkable series of assertions, the patient nods


and consents with monosyllabic responses. By the end, the shaman
has told him that the spirit moves past the corner of his house, using
the semantically precise term moy (end: which is the rounded end
of an apsidal-shaped human dwelling) to denote its path. The man’s
short responses synchronize to the shaman and he clearly experiences
recognition with what he is being told (cf. Enfield, and Goodwin in
this volume on synchronization). Once he has been fully engaged, the
shaman unilaterally turns away without comment, and returns to the
closing prayer of the divination proper.
The shared conviction in the accuracy of the process, and the joint
commitment to the divinatory project took shape as the shaman
made claims about the patient’s physical symptoms. Stomach, back,
and head pains could be verified by the patient, whereas what follows
is mysterious and beyond direct verification. The register is ordinary
Maya throughout and it is vital that the patient experience it as fully
intelligible, however obscure its rationale. To be interactively effective,
however, the shaman’s statements require more than intelligibility.
They need to be buttressed by the patient’s conviction that they are
true, even in those aspects that remain unknowable. This conviction
picks up where common knowledge and understanding leave off. It
resolves unintelligibility with a leap of faith. In ordinary conversation,
Maya speakers are skeptical and challenge one another regularly. It is
an article of common sense that people make claims that turn out to
be untrue (cf. Danziger this volume). Yet in this setting, the shaman
must induce the patient to suspend skepticism and endorse nonintuitive
propositions. The sequential development of this conviction is key.
As Clark (this volume) points out, interaction builds participatory
commitments among the parties. In this case, the credibility of the early
claims about symptoms is carried forth by the cumulative participatory
commitments of the parties, enacted in their coengagement in talk. For
this to occur, the shaman must successfully manage the integration of
the two situations in the divinatory setting. Integration feeds common
ground, and common ground enhances integration.
The de facto division between what the shaman and patient know
is partly offset by the shaman’s demonstration that he has access to
esoteric knowledge. When the shaman formulates the patient’s pain, he
must experience recognition sufficient to warrant investing his belief
in the grander claims to come. From the outset, the crystals play a
pivotal role, because they provide the counterpart signs through which
the shaman sees the patient’s symptoms and their broader causes. The
spirits, signing through the crystals, are the ultimate “principal” of the
shaman’s statements. Like the shaman himself, the crystals participate
in both situations. In the ordinary interaction with the patient and his
wife, they stand for the spirits as ratified overhearers. Their constant
presence throughout the episode is indexed by the shaman’s finger in the
holy water, with which he never loses contact. In the ritual interaction,
the crystals are a means of address to the spirits and the patient is in the
role of overhearer. The crystals are therefore the mediating technology
that links the two situations and their respective participants.
The work of integration is observable in the details of talk. The two
situations are articulated in this phase using the same semiotic resources
we saw earlier. The dual orientation of the shaman’s body indexes
two different footings, with arms and torso to the altar (situation 1)
and head occasionally glancing toward the patient (situation 2). The
language is precise. When interacting with the patient, the shaman
invariably refers to the signs in the crystals (situation 2) using the
immediate deictic forms in a’ (3.9, 3.17). This is ordinary practice in
which the a’ forms are used to index objects saliently more accessible
to the speaker than to the addressee (Hanks 1990, 2005). By contrast,
when he refers to the patient’s body directly, he uses deictics in o’ (3.12,
3.13, 3.15), just as would be expected in ordinary usage when a speaker
refers to an addressee’s body. This alignment is maintained throughout
the episode, including the extended dialogue not reproduced here, in
which there are four more tokens of a’ and 12 more tokens of o’. The
juxtaposition of the two deictic construals maintains the counterpart
relation between the images in the crystals and the actual body and
home of the patient. After the early series of statements about the
patient’s body construed in o’, the shaman elicits explicit confirmation:
“this’s how they’re telling me here-a’. So you know too how it happens
to you” (3.18). All of this is spoken in ordinary Maya, and from the
perspective of the shaman–patient interaction. On cue and without
hesitation, the patient reconfirms (3.21), and the two proceed to jointly
reformulate the symptoms, taking turns. By line 3.29, the patient has
effectively assumed responsibility for the statement of his symptoms,
which enhances both his participatory commitment to the process and
his growing conviction in its accuracy.
By the end of phase 3, the shaman has spelled out the whole scenario,
complete with the underground spirits in the man’s living space. The
patient has come to presuppose the truth of the diagnosis and is worried
about the prognosis. At the end of this segment, the shaman links the
evil in the earth directly to the man’s pain, saying it is the cause of the
dóolor (pain) the man páadesertik (suffers). The patient assents, pauses,
and asks if there is any cure. Example 4 picks up at this utterance (4.1)
and continues through the end of the divination (4.26).
4. Phase 3: Shaman and patient dialogue
4.1 Patient (4.0) héuts’áa uts’àake’? [AV.07.26.41]
Is there medicine for it?
4.2 DC yàanH (.) k ts’áak uts’àak. (1.0) tumèen ts’óok k ‘ilik.
(1.0)
Yes, we will give the medicine, because we’ve seen it
4.3 chéem bá’ax e’, kinwáaik tech, yéetel uchan (.) tùukuli
(1.0) ká ‘úutzak tuláakah
Only thing, I’m telling you, with (proper) thoughts
so everything gets better
4.4 wá yàan apàalal e’, kanáanteh, (1.0) wá yàan abyèenes,
if you have kids, guard them, if you have
possessions
4.5 ti bá’alóo bèey tawotoch o’, kanáanteh, (1.5) wá
kanáant afáamilyàah.
Like your house, guard it, or guard your wife.
4.6 (1.0) kyanta tech tyèempoh, (1.5) kak’ahóohtik upáahta
awutzkíintik
(When) you have time, you see you can improve
4.7 alú’umo’, (1.5) héetz’lú’umteh. (1.5) ká páahtak u-
‘alìibreta ti e
your land-o’, do a Hets Luum, so you can be free
from
4.8 bá’al hé’ela’. (2.0) ká lúu’sá’aki’ =
this thing here-a’, so it is banished
4.9 Patient =hm
4.10 DC yéete hetz’lú’ume’, kuyutztah. (2.0) uyutztah.
With the Hets Luum, it’ll get better, it’ll get better.
4.11 Patient hm
4.12 DC hàah. (2.0) lelo’ ‘ástah le k’ìin ((quick glance towards
patient))
Yeah. That one, until the time when
4.13 kupáahtal amèenke’. Be’òoráa’ bíin inkín ‘il wáa
kintz’ahkech (3.0)
you can do it. Now I’ll see if I can cure you.
4.14 kup’a’ta penyèente elo’ . (1.0) leti e tzikba mèená’an ten
a’,
That will be left pending. This talk made to me-a’,
4.15 letie’ kintzikbatik techa’. [AV.07.27.48] (8.0)
is this that I’m telling you-a’.
Lines 4.1–4.15 bring phase 3 to closure. Like all of the shaman–patient
interaction, this is in ordinary Maya with standard pronunciation and
little technical language. The shaman’s speech is deliberately slow and
punctuated by relatively long pauses (1–3 s) between clauses. It is vital
to his current purpose that the patient understand what he is saying
and take seriously the danger posed by his situation. The pacing gives
him time to absorb the meaning while conveying that the shaman is
speaking deliberately. Having assured him he would be cured, he warns
him to guard his children, if he has any, to guard his possessions like
his home, if he has any, and his wife, who was copresent. The unstated
message is that they are all in danger because of the movement of
the spirit in the yard. By shifting the focus away from the man’s own
physical symptoms to his domestic sphere, the shaman evokes the
patient’s concern and responsibility for his nuclear family. At the same
time, he refocuses attention on the land, which was the object of his
most far-reaching and unverifiable claims. To the extent that he succeeds
in linking the man’s fear for his family with the diagnosis of malevolent
spirits in the land, he has further reinforced the patient’s conviction in
the truth of the divination. At this point, it is a truth sustained not by
understanding what is known but by fear of what is unknown.
The tie to the land serves as a preparatory condition for the shaman’s
recommendation of the ultimate treatment, which he formulates in 4.7
and 4.10. The term héetz’lú’um (fix earth) is the name of an expensive,
time and labor consuming ritual in which a domestic space is cleansed
of malevolent spirits. The cost is such that the patient would have to
save money to pay for it. The shaman knows this and is careful to assure
the man that he need not do it right away, but should do it when he
can. Thus, the broader dangers and their treatment are left as a future
horizon to which the man will orient. The symptoms of being struck
by a spirit are sufficiently diffuse and common, that this diagnosis will
be available as an explanatory framework for any number of problems
that may subsequently arise in the extended household.
The shaman’s final turn at ordinary talk shows a new pattern of deictic
usage (4.12–4.15). The two tokens of o’ (4.12, 4.14) both refer to previous
talk, rather than referring to the patient’s home or body. This is typical
of closing sequences in Maya interaction, in which anaphoric deictics
resume the preceding talk, construing it as common ground as in “OK,
that’s it.” Such usage construes preceding talk as common ground,
readily accessible to both parties. In the last two sentences the shaman
uses a’ forms, and this is also a shift (4.14–4.15). The first sentence, “this
explanation made to me-a’ (by the spirits),” maintains the pattern of
construing the crystals and their signs as a’. But the last sentence breaks
the pattern by also construing the shaman–patient interaction as a’:
“this is what I’m telling you-a’.” We would have expected “that is what
I’m telling you-o’,” because the reference is to the preceding talk. This is
the first token in the entire episode in which the shaman construes his
ordinary interaction with the patient using an a’ deictic. The parallel
construal of the ritual situation and the ordinary face-to-face situation
equates the two indexically. Similarly, the precise syntactic parallelism of
the two statements reinforces the identity between them. Like his earlier
management of the counterpart relations between signs and bodies,
this is a way of solving the integration problem. But this utterance goes
beyond aligning and comparing. It simply collapses the two situations
into a single construal. This move culminates the work of integration
by merging the two situations into a single indexical ground.
After eight seconds of silence, the shaman unilaterally turns away
from the patient toward the altar and shifts into réesar register,
with the expectable prosody and particles. He thereby redivides the
indexical ground into two situations, excluding the patient from the
ritual interaction with spirits. The first breath group (4.16–4.18) uses
expressions recognizable to any Maya speaker, but whose meaning
would be obscure to nonspecialists. The second breath group (4.19–4.24)
would be almost entirely unintelligible and unfamiliar to the patient,
because it consists mostly of secret spirit names. It is delivered sotto voce
and very fast, obviously not intended for the patient’s ears. The brief
toss of holy water is clearly meaningful, but like so much else on the
altar, its meaning is unknown. After the final sign of the cross, there is
silence for 13 seconds, during which the patient sits alone and absorbs
what has transpired. At least for the moment, his physical suffering,
his concern for his family, and his participatory commitment to the
ongoing divination have converged on a conviction. What we have
been told really is what is happening. Quietly, he asks his wife “So what
will we do (now)?”
5. Phase 4
4.16 # ‘in yúum, ‘úuch ink’áatk e podèer ti xan tuláakal asàanto
bakan
My lord I’m requesting the power of all of your saints
4.17 tasíih ubèeyntisyòon bakan inlíi’slaht’áantikó’o tuláakal e a
sàanto
you gave the blessing, I raise them all up all of your saints
4.18 tinhúuntartó’ob xan tuchun amèesah, [AV.07.28.07]
Who I gathered together at the foot of the altar
4.19 # ((sotto voce)) ‘utía’al insutik yíik’a bakan yun ht’úup
balam xóo’ baláam
So that I return the spirits of Lord Tup Balam Xo
Balam
4.20 ‘ah balam túun ((toss holy water with right index
over right shoulder))
Ah Balam Tun
4.21 p’iris túun balam síit’ bolon túun ‘ah k’ìin chan ‘ah
k’ìin koba’ xan ‘ah k’ìin
Piris Tun Balam Sit Bolon Tun Ah Kin Chan Ah kin
Coba and Ah kin
4.22 kolon te’ tz’iíb. Táan bakan insutkó’o ((glance up at
images on altar))
kolonte Dzib. I am returning them
4.23 tuk’àaba’ Crìisto Jesus ((making sign of cross over
crystals with right hand))
in the name of Christ Jesus
4.24 tuk’àaba’ dyòos pádreh ‘espiritu sàanto. ((soft))
[AV.07.28.22]
in the name of god the father holy spirit
((places the crystals back on raised portion of altar,
clears throat))
4.25 Patient (13.0) ((sotto voce to wife)) bix túun hé’ kmèenk e’
So what’ll we do?
4.26 Wife ((unintell.))

Conclusion: Integration and Induced Commitment


In this chapter, I have examined the local integration of multiparty
talk in a ritual context. The example is exotic but the problems it
poses are relevant to much, if not all, of human interaction (Schegloff
1987). Integration is always a problem because communicative practice
takes place in changing, often fragmented circumstances, in which
participants engage. Multiple situations must be coordinated any time
there is mediation, like the telephone, computer displays, or televideo
(Hutchins, Keating in this volume), when expert and nonexpert
interactions are combined, or when the access of the parties to the
setting is significantly asymmetric. Gaps in common ground are in play
any time one party has relevant knowledge the other lacks, especially
when this knowledge is consequential to the one who lacks it. Divination
gives the problem a special salience, by degree and by design. The
shaman’s challenge is not only to bring about shared knowledge of the
case, but to induce commitment on the part of the patient. He must
bring the patient to accept what he cannot know or even understand.
A similar challenge faces Western doctors, lawyers, mechanics, expert
witnesses, and any party whose goal is to “bring the other on board”
with a description.
The resources for integration and shared commitment are multiple.
The built space in which our divination took place is a stable, publicly
available and semiotically replete setting. The two situations are
embedded in it, and this anchors them in a broader framework.
Although shaman and patient have different understandings of this
space, both know that what is happening is a divination in which
speech and gestures are especially meaningful. They speak a common
language, with a common repertoire of indexical construals, epistemic
particles, basic lexicon, and grammar. Even the esoteric register of
prayer is recognizably Maya and almost intelligible to the patient, in
fragments. Both know from encounters with church and state that the
proper name and the town of residence jointly officialize individual
identity. The shaman uses these shared semiotic resources to integrate
his esoteric engagement with the spirits with his ordinary engagement
with the patient and his wife. His strategies are familiar from ordinary
talk: sequential juxtaposition to synchronize the two situations (Enfield
this volume), counterpart relations by which an object in situation 1 is
the counterpart of an object in situation 2, precise use of indexicals to
maintain distinction and coordination between the two situations, and
finally, he equates the two (4.14–4.15). The shaman reads the patient’s
body in the presence of his pain and of the signs in the crystals. He
meticulously develops the counterpart relations between the two so
that the wisp of smoke in the crystal is the counterpart of the wind in
the patient’s body. As he manages this multistranded integration, he
induces the patient to participate and commit to the process, which
gives rise to a conviction about his own suffering: this is what I am
suffering.
The patient’s conviction encompasses both what he knows and what
he does not know and cannot verify. It is “participatory” in Clark’s sense
(this volume): the patient is committed to the process and convinced
of its outcome, insofar as the shaman is committed and convinced. As
a basis for interaction, such reciprocal commitments are stronger than
common knowledge or understanding. They induce a leap of faith
whose extent is greater than what is known. The semiotic resources on
which the shaman depends are mostly mundane, even if his use of them
is masterful. He establishes commonality with the patient, naming his
physical symptoms without being told. Thus, the patient experiences
recognition, which further confirms the shaman. He is drawn into the
process and his participatory commitment to it is carefully cultivated
by the shaman. By getting him to restate his symptoms in his own
words, the shaman makes the patient a principal in the diagnosis. His
participatory commitment builds through the event from the low level
of backchannel to the high point of jointly producing the speech acts
of naming, stating symptoms, and formulating the cause. Analogous
kinds and degrees of participant involvement are part of all human
interaction. What the present case illustrates is that sociality need not
entail common knowledge, but still gives rise to joint commitment.

Notes
1. On ritual speech compare DuBois (1986) for a Mayan example, and
Zeitlyn (1995) for a contrasting case.
2. See Bourdieu (1991a, 1991b) for discussion of “authorized” speech and
its relation to performative effectiveness. See also Chafe (1993).
3. Yucatec Maya consonant phonemes are /p, t, k, p’ t’, k’, b, s, x, h, tz, ch, tz’,
ch’, m, n, w, y, l, r/, where /’/ = glottal stop following a vowel and glottalization
following a consonant, /b/ = voiced bilabial implosive, /x/ = voiceless alveo-
palatal fricative, /h/ = voiceless glottal fricative, /tz(‘)/ = (ejective) voiceless
alveolar affricate, and /ch(‘)/ = (ejective) voiceless palatal affricate. Syllable
nuclei are made up of combinations of five vowels (i, e, a, o, u), three tones
(high ´ , mid [no accent], low ` ), length, and glottalization. Length is indicated
by the doubling of a vowel, and glottalization is indicated by an intervocalic
glottal stop ‘ . The canonical vocalic patterns are /i, e, a, o, u/, /íi, ée, áa,
óo, úu/ ìi, èe, àa, òo, ùu/, and /í’i, é’e, á’a, ó’o, ú’u/. However, short vowels
with tones also occur and are derived either by grammatical processes or by
paralinguistic ones. Glottalization is also realized as creaky voice or even by
eliminating the glottal stop completely. The latter case results in a long vowel
with high- to mid-falling pitch but remains distinct from the (nonglottalized)
high tone series /íi, ée, áa, óo, úu/, which is pronounced variably with rising or
falling pitch. Spellings of place-names such as Oxkutzcab are orthographically
unmodified from their Spanish spellings.
4. Transcription of Audio Visual Tape 07, recorded in Oxkutzcab, Yucatan,
Fall 1991. In Don Chabo’s home, in the altar room. A man and woman have
arrived requesting a treatment for the man, who is in physical distress. Present
are also Hanks and Peter Thompson, a filmmaker who recorded the episode.
The layout of the room is shown in Fig. 11.2. The recording starts at the onset
of the divination. Ritual speech delivered in unmeasured breath groups,
inhalations marked by cross hatch (#).
5. On register and stylistic differentiation, see Eckert and Rickford 2001,
and Irvine 2001.
6. The earliest treatment of the dovetailing of participants’ motives, and
the reciprocity of perspective that it implies, is Schutz (1967).
7. DC’s care to not lose touch with the crystals is motivated by the idea
that once prayer is started, “the thread should not be broken,” that is, the
connection to the spirits should not be allowed to lapse until the end of the
event.

References
Bourdieu, P. 1991a. Authorized language: The social conditions for
the effectiveness of ritual discourse. In Language and symbolic power,
edited and introduced by J. B. Thompson, 107–116. Cambridge, MA:
Harvard University Press.
——.1991b. Description and prescription: The conditions of possibility
and the limits of political effectiveness. In Language and symbolic power,
edited and introduced by J. B. Thompson, 127–136. Cambridge, MA:
Harvard University Press.
Bucholtz, M. 2000. The politics of transcription. Journal of Pragmatics
32(10):1439–1465.
Chafe, W. 1993. Seneca speaking styles and the location of authority.
In Responsibility and evidence in oral discourse, edited by J. Hill and J.
Irvine, 72–87. Cambridge: Cambridge University Press.
Cicourel, A. 1992. The interpretation of communicative context:
Examples from medical encounters. In Rethinking context: Language
as an interactive phenomenon, edited by A. Duranti and C. Goodwin,
291–310. Cambridge: Cambridge University Press.
——. 2001. Le raisonnement médical. Paris: Seuil.
Clark, H. H. 1992. Arenas of language use. Chicago: University of Chicago
Press.
DuBois, J. 1986. Self-evidence and ritual speech. In Evidentiality: The
linguistic coding of epistemology, edited by W. Chafe and J. Nichols,
313–336. Norwood, NJ: Ablex.
Eckert, P., and J. R. Rickford (eds.). 2001. Style and sociolinguistic variation.
Cambridge: Cambridge University Press.
Goffman, E. 1972. The neglected situation. In Language and social context:
Selected readings, edited by P. P. Giglioli, 61–66. New York: Penguin.
——. 1983. Footing. Semiotica 25:1–29.
Goodwin, C. 1994. Professional vision. American Anthropologist
96(3):606–633.
——. 2000. Gesture, aphasia and interaction. In Language and gesture,
edited by D. McNeill, 84–98. New York: Cambridge University
Press.
Hanks, W. F. 1990. Referential practice, language and lived space among the
Maya. Chicago: University of Chicago Press.
——. 1996. Exorcism and the description of participant roles. In Natural
histories of discourse, edited by M. Silverstein and G. Urban, 160–220.
Chicago: University of Chicago Press.
——. 2001. Exemplary natives and what they know. In Paul Grice’s
Heritage, edited by G. Cosenza, 207–234. Turnhout: Brepols.
——. 2005. Explorations in the deictic field. Current Anthropology
46(2):191–220.
Haviland, J. B. 1993. Anchoring, iconicity and orientation in Guugu
Yimithirr pointing gestures. Journal of Linguistic Anthropology 3:3–
45.
Irvine, J. 2001. “Style” as distinctiveness: The culture and ideology of
linguistic differentiation. In Style and sociolinguistic variation, edited by
P. Eckert and J. R. Rickford, 21–43. Cambridge: Cambridge University
Press.
Kendon, A.1992. The negotiation of context in face to face interaction.
In Rethinking context, edited by A. Duranti and C. Goodwin, 323–334.
Cambridge: Cambridge University Press.
Kita, Sotaro (ed.). 2003. Pointing: Where language, culture, and cognition
meet. Mahwah, NJ: Erlbaum.
Ochs, E. 1979. Transcription as theory. In Developmental pragmatics,
edited by E. Ochs and B. Schieffelin, 43–72. New York: Academic
Press.
Schegloff, E. A. 1987. Between micro and macro: Contexts and other
connections. In The micro-macro link, edited by J. C. Alexander, B.
Giesen, R. Münch, and N. J. Smelser, 207–234. Berkeley: University
of California Press.
Schutz, A. 1967. The phenomenology of the social world. Northwestern
University Studies in Phenomenology and Existential Philosophy. Evanston,
IL: Northwestern University Press.
Sweetser, E., and G. Fauconnier. 1996. Cognitive links and domains:
Basic aspects of Mental Space Theory, in Spaces worlds and grammar,
edited by G. Fauconnier and E. Sweetser, 1–28. Chicago: University
of Chicago Press.
Zeitlyn, D. 1995. Divination as dialogue: Negotiation of meaning with
random responses. In Social intelligence and interaction: Expressions
and implications of the social bias in human intelligence, edited by E.
Goody, 189–205. Cambridge: Cambridge University press.
twelve

Habits and Innovations: Designing


Language for New, Technologically
Mediated Sociality
Elizabeth Keating

My goal in this chapter is to show how a new technology is used as a


resource for communication, and how in the process of organizing
its use, people alter the communication system itself. This illustrates
an important process whereby social actors are not only shaped by
cultural practices but reshape cultural practices through cooperative
interaction, and the role of tools in motivating and mediating change.
How do interactants transform aspects of a symbolic system? How do
microlevel innovations relate to building new shared conventions while
cultural ideas retain “fidelity” over time (Sperber this volume) in spite
of adaptations to local resources and conditions?
In this chapter, I look at a new interactional context (see Figs 12.1
and 12.2)—in which deaf signers use visual language to communicate
in cyberspace—for what it can tell us about how new conventions
emerge as adaptations to new constraints and possibilities. Humans use
various multimodal semiotic systems to maintain as well as build new
realities and meaningful relationships across interactions. New sites of
“interindividual territory” (Volosinov 1986) impact social interaction
through people’s actions within them, as commonly shared procedures
are used to create new representations, subjectivities, and sequences. The
success of new procedures relies on close attention to recipient design,
which emerges as a key factor (see also Enfield this volume), as well as
collaboration in the management of new contextually based indexical
Culture and Sociality

Figure 12.1. Old technology: text typing tool.

relationships, understanding and exploiting representational properties


of the body, and jointly managing properties of the interactional space
within which activity occurs.

Macro Versus Micro


Scholars from a range of disciplines have tried to understand and theorize
the link between “macrolevel” (widely shared social phenomena) and
“microlevel” actions (contingent and creative acts related to particular
situations), considered a “long standing problem in social science”
(Frumkin and Kaplan 2002). Even in the field of distributed artificial
intelligence (DAI), researchers are trying to develop agent theories to
apply to large sized multiagent systems, to understand “what is essential
for the emergence of structure from micro-interactions” (Schillo et
al. 2001). How new processes emerge from the routine practices and
typifications of everyday life, and how interactive and organizational
procedures transform microevents into macrosocial structures (Cicourel
1981; see also Foucault 1980) is of interest to anthropologists, sociologists,
and others who study human behavior.
The nature of the relationship between actions by individuals in
specific settings and the constitution and reconstitution of social
institutions is complicated because of a number of factors. Individual
Habits and Innovations

Figure 12.2. New technology: Face-to-machine with signs.

actions are varied, complex, and depend on context and audience for
interpretation. Social actors do not follow rules so much as respond
to and choose from a set of contingent expectations that vary with
context and participant frameworks. Effects at the system or society
level are at times neither intended nor predicted by what happens at
the individual level (Giddens 1984). The outcome of a momentary
interaction is “something none of the parties can plan in advance” but
is a “contingent product” (Levinson this volume). Another problem
is how reliable the replication of human practices by individuals
can be even when this is the goal, and thus how unreliable accurate
transmission of culture among members of a society is (Sperber this
volume). It may be that we do not imitate others so much as incorporate
observed outcomes of their actions (Tomasello this volume) into our
own planning processes.
Although research in face-to-face interaction and the ethnography
of communication has challenged long-standing macrosociological
assumptions, theories, and methods (Collins 1981:81; Knorr-Cetina
1981:1; Levinson 2005; Schegloff 2005) and called for a rethinking of a
system level approach concerning social institutions and sociocultural
change, it is not clear how to evaluate those aspects of shared practices or
institutions that are not directly observable in interaction (see Schegloff
1987). An exclusively microinteractional orientation to understanding
culture and society has been criticized as actor focused, reductionist,
trivial, situated, and subjective. The dichotomy between macro and
micro itself has been characterized as having relevance only within
scholarly discussions or only relevant for analysis (Alexander and Giesen
1987:1). Yet it seems clear that the emergent and dynamic nature of a
single, contingent interaction is dependent on and complimentary to a
cross-event set of experiences with established categories or routines for
interpretation that maintain an orderliness that characterizes human
sociality and without which local intersubjectivity would not be possible
(Hanks this volume). The close examination of microinteractions
(Schegloff 1968, this volume) or investigation of relationships between
language and cross-event, nonlinguistic problem solving (Levinson
2003, 2005) reveal culturally based habits and routines with historical
dimensions that groups share, “a past which survives in the present
and tends to perpetuate itself into the future by making itself present
in practices structured according to its principles” (Bourdieu 1977:82).
Institutional structures influence the actions of humans through
regularized systematized procedures (Duranti 2003) that are seen as
noncontingent, but these are also adaptable to new situations, as the
deaf signers I discuss below show.
When new technologies or tools are introduced that disrupt
conventional
procedures or require new ones, we can investigate the collaborative
production of new shared systems and the means for reestablishing
coherence, as well as look at which resources from past interactions
are marshaled into new ones. The Internet and computer-mediated
communication (CMC) in addition call into question significant
aspects of what shared might mean, for example, in the new ability to
communicate identically, yet individually with (and receive individual
responses from) large numbers of people simultaneously, as well the
ability to replicate with much greater accuracy.

The Development of Tools


Tools are an important part of “culture” and the emergence of (or
roots of) early human sociality is understood to be tools (Richerson
and Boyd 2004; see also Boyd and Richerson, Gergely and Csibra,
and Tomasello in this volume). Tools shape ways of doing and ways
of thinking, communicating, and imagining (Hutchins this volume;
Vygotsky 1978). Humans have historically invented many types of
cultural tools, both material (cooking equipment, farming implements)
and symbolic (language, stories). Tool-based hunting created a novel
set of adaptive pressures, and cooking technologies may have led to
increased brain size (some anthropologists claim that tubers—and the
ability to cook them—prompted the evolution of large brains, smaller
teeth, modern limb proportions, and even male–female bonding;1
others cite animal products as essential in the evolution of the large
human brain, see Aiello 1995). Our various tools can be thought of as
making up a “cognitive technology” (see also Goodwin, and Hutchins
in this volume). Tools and technologies engender not only types
of recognizable interactions, but become elements in macrosocial
economies or ecologies of power and knowledge (see Goodwin this
volume), when societies limit who can use tools or when. As an example,
recently in the business school at the University of Texas students were
given laptop computers, considered an advanced tool to enrich the
learning environment. Students were expected to bring their laptops to
class. However, many professors as they walk into class often say “please
close your laptops,” reconstituting the traditional classroom structure
and interrupting students’ engagement in interactions in spaces other
than those under the control of the teacher. Stability in form and intent
is produced in spite of locally new conditions. The teacher mediates
new adaptations to local resources and conditions within the classroom
through a conventionally recognized routine, a directive issued by
one culturally authorized to control classroom activities. In the site I
discuss below there are challenges to the organization of interaction
that foster innovations and new cultural and linguistic practices. Local
contingent actions (such as those discussed below) impact the use and
the reproduction of shared systems of understanding (see Danziger
and Gaskins in this volume for discussions of local meanings). New
affordances for action and new environments for action created by new
technologies influence subsequent interactions as the tool is explored
and exploited by users who challenge, question, and build coherence
through the establishment of new procedures and shared reciprocal
perspectives.

Communication Technologies
New technology leads to new permutations in what Suchman (1993)
has called the production of a coherent relation between a normal
order of events, and this can have revolutionary, unplanned effects.
Although Alexander Graham Bell was initially interested in creating a
technology that would help deaf people learn to speak, his invention of
the telephone actually disadvantaged deaf people communicatively and
left them out of one of the most important communication advances
of the last century, the ability to collapse spatial constraints and to
carry the human signal across vast distances. The deaf community did
not have telephone technology until nearly 100 years after the hearing
community, when an acoustic coupler that could link a telephone to
a teletype machine and produce text over private phone lines became
available, invented by a deaf physicist, Robert Weitbrecht (see Fig.
12.1 for the recent version of this text-based technology), who had
to overcome powerful opposition from the telephone company. An
earlier telephone device for deaf people, invented by a peer of Bell, the
Telautograph (messages were handwritten at one end of a wire with a
pen-lifting mechanism and were reproduced automatically on the other
end through the use of a stylus and a wide sheet of paper) was never
adopted (for discussion of aspects of language use in deaf communities,
see Goldin-Meadow, and Pyers in this volume).
The technology that has resulted in deaf people being able to project
a sign-language signal across vast distances, a simplified video camera
for web interfaces (webcam) and software to transmit digital images,
was itself originally designed for remote monitoring of a coffee pot. The
webcam tool first emerged, so the account goes, in the early 1990s in
the offices of the computer science department of Cambridge University,
designed by researchers who wanted to be able to have a remote view
of their communal coffee pot to see if it was empty and whether a trip
via several flights of stairs was worthwhile. They trained a digital camera
on the faraway coffee pot and the image was sent over the network.
Webcam technology or “prosthetic” eyes now are used for not only
viewing locations remote from the viewer such as towns worldwide
(where they are sometimes mounted on buildings and streamed
through the internet) but for human interaction (Keating 2005). The
deaf community discovered the potential for vastly improved complex
visual-language transmission (over the previous text typing system), but
early adopters had to adapt signing to new spatial properties.
Webcam-recorded, computer-mediated space is radically different
from “real” space in terms of affordances for sign-language interaction.
The field of vision of the webcam lens is restricted in size compared with
human vision, for example, but less restrictive in terms of place. The
body, an important resource for displaying and assessing social meanings
and for organizing joint action (see Goodwin 2000, this volume) and
displaying joint commitments to activities (Clark this volume), is
constrained in how it can display itself in computer-mediated space,
although it is enabled to project itself around the world. Computer-
mediated space affords communication but restricts communicative
actions. Building a common ground (Enfield this volume) requires new
efforts and experiments by signers. Interactants using sign language
via webcam and computer must coordinate and understand a number
of different and related representations, including mechanical effects
of actions versus human ones. In terms of sign language, a language
dependent on the use of space, differences between three-dimensional
(3D) face-to-face interaction and two-dimensional (2D) space are
significant and meaningful. In creating comprehensible sign in 2D
space, interactants recruit procedures from existing or established
communicative environments and practices and adapt them. They
alter language features, manipulate and organize a visual interaction
space of computer-mediated images, icons, and texts, their own and
others’ bodies in space, and utilize a new metaperspective of their own
signing. They learn to recognize properties of 2D space, as well as how
the computer represents 3D space in 2D. They modify their signing
speed, code choice, and adjust important spatial reference and person
reference forms (Keating and Mirus 2003). They work together over
multiple interactions to develop reciprocity of perspectives and an
independent understanding of how individual actions are transformed
by the technology.

Disruptions of Habitual Practice


To illustrate some of the enormous challenges of visual human–
computer interaction and face-to-machine space, hearing users have
regularly reported difficulties with video in computer-mediated space.
Although image transmission offers all of us, deaf and hearing, the
possibility of more complex symbolic interaction by adding many visual
communication features (gaze, facial expressions, head shakes, etc.),
research has consistently reported the dissatisfaction of (hearing) users
and their view of the “limitations” placed on meaningful, engaged
communication (Debourgh 1999; Simonson et al. 2000) with video
transmission. Webcams in 1999 were described as “a rare and nonpreferred
mode of Internet interaction” among hearing people (McKenna and
Bargh 1999:265), and in 2005 that was still the case. Students who use
interactive, video-based systems for distance education or work-related
activities report problems in satisfactorily fostering copresence and
engagement (Mazur 2000). They experience lack of recognition (through,
e.g., eye gaze), problems with getting a turn at talk, and exclusion. The
use of voice activation to gain the attention of an instructor in a virtual
classroom instead of nonverbal signals (hand raising) violates politeness
or appropriateness norms. For hearing participants who have been
studied in contexts using video transmission, video-mediated gaze is
often experienced as “ineffectual” (Heath and Luff 1993; Keating 2000)
compared with face-to-face interaction, because people are unable to
alter their perception of the other through their own bodily orientation,
that is, the perception of the image in front of them provided by the
computer screen is not affected when they move their own body (Heath
and Luff 1993). Video transmission can distort the appearance of a
person’s bodily conduct and affect participants’ ability to monitor the
conduct of the other peripherally (Heath and Luff 1993). Research has
shown that for hearing users CMC evidences a decrease in elaborations
of plans and more metalanguage related to problems encountered using
the new medium (Condon and Cech 1996; see also Keating 2000). In
addition to these challenges for hearing communicators, as mentioned
previously, differences between 3D and 2D space make some signs
incoherent for deaf communicators. Yet signers show great flexibility
and adaptability in resolving new challenges.

Innovations and Adaptations


One effective stimulation for adaptation happens when the
technologically
mediated transformation from 3D to 2D space results in the
failure of some signs to be recognized or coherent, for example, signs
in which meaning is conveyed through a spatial representation that
is flattened or disappears in 2D space (e.g., a movement from signer
toward interlocutor). This leads signers to adapt some signs as well as
to invent new grammatical spatial relationships, for example, the use
of certain reference terms. Signers innovate as a result of learning about
visual properties of 2D sign space. They show flexibility by altering
hand position, head position, and path of movements. For example,
one signer turns his head to the side to make clear the hand position in
relation to the nose from the side view. Another signer repositions his
hand while signing THREE so the thumb can be more clearly seen in
2D space.2 Another person signs MEXICO with an upward movement,
rather than a movement toward the camera or interlocutor so that
that this crucial change in spatial relationship can be more clearly
seen in 2D space. Other examples are the sign BABY, usually signed
at waist level, moved upward and signed at shoulder height (see Fig.
12.3), and the sign for NOW moved upward to the shoulder from the
chest area. This adaptation makes the signs visible in the reduced size
Figure 12.3. Rose signing BABY at chin height rather than waist height.

of space transmitted by the webcam and computer. Similarly, the sign


SON, usually signed with contact on the opposite hand at waist level,
was signed with contact on the biceps and shoulder raised. The video
frame in Fig. 12.4 shows the sign PROBLEM (usually signed at chest
level directly in front of the speaker’s body) being signed far outside
the usual sign space, in front of the webcam, which is situated on top
of the computer. The sign is produced in another person’s sign space
to position the sign for optimal transmission by the webcam. The sign
space is adapted as well as where the sign is produced.
Signers experiment with established language forms and other habitual
forms of representation, and they also quickly exploit properties of the
new medium to create new forms and new ways to communicate or
represent ideas. Signers notice, for example, that the web camera lens
reproduces objects that are made closer to it as larger (unlike a mirror
or real space). They exploit this property to make new meanings, by at
times positioning their hands quite close to the webcam (see Fig. 12.5),
creating larger hands and larger signs. The use of this new property
of technologically mediated representation can be used for emphasis,
clarity, and in response to other-initiated repair. For example, in the
case of emphasis, a YES sign made near the camera appears much bigger.
Figure 12.4. Sign made is oriented to webcam on top of computer.

Designing language output for a recipient now includes new possibilities


and properties and habituating to new reciprocal perspectives. In 3D
interaction, by contrast, to signal emphasis signs are produced in a
larger space (e.g., the sign “yes” made with the whole arm rather than
the wrist). This concept is transformed and reinvented in 2D space as a
larger body part rather than a larger movement. In face-to-face interaction
this would not work, but with the computer, although the hand gets
bigger it does not get any closer to the addressee. The constraint of
the arc of a movement that can be made for emphasis is transcended
through recruiting new properties of the space (how size of object can
be manipulated by distance from the camera lens). Computer-mediated
signing space has brand-new interesting properties for sign production.
The alteration of size of objects by the camera lens is also exploited to
increase the size of fingers during fingerspelling (the manual alphabet
used, e.g., for names of people, streets, and places), a form of recipient
design to enable letters to be easier to perceive and parse (and to reduce
repairs).
The nature of interactional space and the body’s relationship to the
other is significantly different in computer-mediated space. To adjust to
this new reality, interlocutors also change how they point to communicate
deictic relationships, not in terms of their actual position vis-à-vis their
interlocutor’s image but in terms of the position of the webcam and
Figure 12.5. Rose signing O-K near webcam location.

how their sign relative to what they are pointing at will be reproduced
on the screen in “virtual” space. For example, one participant raises her
thumb and begins to point directly behind her (at her husband), but
then turns her hand so that her thumb is pointing directly to the side,
where her husband is in the 2D world of the screen. In the computer-
mediated image, she is pointing at her husband, when actually he is at
least two feet behind her. A new idea of how the recipient views space
differently from the signer is evident here. In other examples, signers
point directly at the webcam when signing YOU rather than pointing
at their interlocutor as they would do in face-to-face interaction. Signers
look at the screen, but then point at the webcam, and sometimes they
both look at and point to the webcam, another case of redesigning
language with recipient design in mind—in this case the webcam lens.
Names (as in “vocatives”), usually only used in ASL to refer to people
not present, are recruited to disambiguate addressees when there is
more than one possible addressee (e.g., when more than one person is
sharing the addressee space directly in front of the webcam as in Fig.
12.2) because gaze is not as effective as a marker of addressee because of
reduced clarity of gaze direction of signers and how gaze is represented
in technologically mediated space. The webcam lens also has reduced
peripheral vision capabilities compared with humans. Participants on
the same “side” of the interaction must sit side by side, which reorganizes
important aspects of signed conversational interaction.
A key tool in signers’ efforts at adaptation is a technological property
that gives them a new type of feedback on how understandable their
communication is, and whether they need to adjust their signing, for
example, if their signs are being made outside of the viewable space.
They see the effects of technological mediation on their signs through
a mirror image of themselves (see Fig. 12.6). In effect they see how the
webcam’s “eye” represents them and their language production and
a copy of the visual representation that their interlocutor perceives.
This provides them with an immediately testable means for learning
reciprocity of perspectives. Signers can judge the effect of certain
relationships in reduced size, 2D space and experiment with the efficacy
of new forms of hand position, torso, and face orientation to produce
understandable communication. Because the other’s perception of
oneself is available, self–other mapping is facilitated (see Astington this
volume), as linguistic behavior must take into account the perspective
of the recipient. In successful interaction, a simulation of the other’s
simulation of oneself is involved (Levinson this volume); here such
an understanding is technologically enhanced. The meaning of this
mirror image for social interaction cannot be taken for granted. It is a
novelty not at first oriented to by participants, as excerpt 1 shows. In
excerpt 1, Ben notices that Ned’s signing space is out of web camera
range (see Fig. 12.7). Thus, he cannot see Ned’s signs. Ben breaks down
the solution of this problem, a reciprocal-perspectives problem, into
a sequence of moves for Ned to understand and repair the problem.
The problem to be solved is that Ned is situated so that not enough of
his torso is visible to the webcam lens, and therefore a good part of his
signing space is invisible to Ben.

Excerpt 1
01 Ben moves both his arms downward.
02 Ben: CAMERA Ben moves his wrist downward
03 A-LITTLE-BIT
04 Ned: WHICH? YOURS OR MINE?
05Ned leans toward the computer and tilts the camera
downward
Figure 12.6. The signer (right) has a mirror image view of his own actions.

Figure 12.7. The top image shows the signer’s sign space is not in camera
range (note difference between the two participants).
06 Ben: ASK (? unclear fingerspelling)
07 Ned: Ned readjusts the camera position
08 Ben: OK FINE STAY

In this interaction Ned shows his lack of understanding of both the


affordances and constraints of the new tool by asking which webcam must
be adjusted: “YOURS OR MINE?” (line 04). He has not yet habitualized
or incorporated the use of a “mirror image” or representation of his
own sign production available on the screen in front of him.
As in the case with Ben, in judging their optimal body position vis-
à-vis the webcam signers still rely on habitual or conventional
procedures
for monitoring recipient design and comprehensibility. When
prompted by their co-conversationalist, however, about problems in
communication, they learn to utilize the machine’s view of their own
actions. The presence of a mirror image of the self is also exploited in
terms of enabling complex new participation frames (e.g., face to back).
Two copresent interlocutors can converse when not actually face to face
(in 3D space) by using the onscreen image, as when a woman signs with
both her husband and their friend on the computer screen, although her
husband is actually behind her, a position that would usually exclude
him from participating with her in signed interaction. From behind
her, he can see (on the computer screen) what she is signing (as in a
mirror), and she can see him. Signers also exploit a new ability or
affordance to exclude their face-to-machine partner from communication.
They can remove their hands from the parameters available to the
webcam “sight” to participate in private side conversations with others
offscreen.
Interactants also exploit novel multimedia properties of computer-
mediated interaction. They use text messages concurrently with sign
language (see Fig. 12.8). Through text messaging signers can subvert
the limited participation constraints of the machine space in terms
of number of locations they can communicate with. Because only
two video signing spaces are currently available (one-to-one
pictures),
others can and do participate via text messaging with video
coparticipants.
Some interactants solve the reduced dimension spatial representation
problem with the move from 3D to 2D by switching codes, in this case
to a linguistic code not so dependent on spatial relationships. Signed
English was used among native and fluent ASL signers, especially in
the early period of webcam technology in which the quality and speed
of the transmission of images was poor. In signed English, signs are
Figure 12.8. Text messaging and signing.

ordered according to the syntactic rules of a spoken language, and


signed English does not take advantage of or depend as much on spatial
resources as ASL. Englishlike signing has, for example, English word
order and includes articles or modals. In the example below, Rose (a
fluent ASL signer) signs to her friend Teri (a person who is fluent in and
highly favors ASL) a question containing a fingerspelled English modal
(“did”) and the signed form of the English preposition “to” in her first
formulation of a question in “D-I-D YOU GO TO P-T-A” (line 02).3 In
fingerspelling, “D-I-D” Rose puts her hand close to the webcam. Teri is
busy with her infant and misses the utterance. Rose repeats her question
about the PTA meeting, this time using a more ASL-like construction:
YOU GO P-T-A (with eyebrow grammar to signal a yes–no question),
line 03. This shows how signers introduce conventionalized language
forms and structure into unconventional spaces. Meanings produced
in face-to-face interaction through facial grammar, such as eyebrow
movement for questions are often produced manually in webspace,
for example, the use of the bent index finger for signaling a question
(see Keating et al. n.d.).
Excerpt 2 (underlining indicates englishlike grammatical
constructions)
01 Rose: SORRY INTERRUPT YOU. I NOT-KNOW THAT YOU
NURSE BABY D-
02 (laughs) YES. D-I-D YOU GO TO P-T-A. UNDERSTAND
ME?
03 YOU GO P-T-A? D- NOT YET?

Using signed English ameliorates the problem of the lowered capacity


of the medium to support the properties of a 3D signing space, but in
turn can result in the unintended consequence of the production of
a locally understood concept of “formality” (or type of institutional
frame) because Englishlike signing is related in the U.S. deaf community
to formality.
Lucas and Valli (1992) were surprised to see the use of Englishlike sign
in conversations between ASL signers in an experimental face-to-face
interview situation they conducted (1992:63), and they attributed this
to an association among U.S. signers between Englishlike signing and
formality and accommodation. Choice of language features regularly
constructs differences in context, just as context can shape the choice
of language features (Duranti and Goodwin 1992). This is one means
at work linking macrosystems and situated microinteractions.
Not only must signers adapt aspects of language but they must adapt
their notion of what a social partner is. This is shown through explicit
means in new socialization practices. Theories of the properties of the
social “other” or theories of mind are built on imitation, gaze following,
pointing, joint attention, shared reference, and so on (Astington this
volume; Liszkowski this volume). An important part of human sociality
is how unskilled members of a group learn new skills and appropriate
actions from skilled members (Boyd and Richerson, and Gergely and
Csibra in this volume). In Fig. 12.9, a deaf mother kneels on the floor
to establish joint attention with her two-year-old son. After gaining his
attention, she transforms the field of joint attention to include new
participants not physically present in the room. She directs her son’s
attention toward the computer screen by pointing. She supports his
arm as he mimics her action, and she models affect by smiling at the
computer screen. She orients him to a figure within a ground, “a boy”
on the screen. She models an appropriate next action, a greeting, by
waving at the screen image. She moves her attention back and forth
from the screen to engage in face-to-face interaction with her son.
She is organizing the child’s perception and actions through eye gaze,
language, by pointing, by relating the two spaces, and by interpreting
the computer-mediated image as a social being and recipient. Here a
single interaction introduces a protocol for future interactions with
technologically mediated selves and interpretations with new concepts
of what human sociality entails. Tomasello (this volume) stresses
the importance of “sharing and helping” or “shared intentionality”
(see also Liszkowski this volume) or the ability to understand others
intentionally in the attainment of goals between humans, described also
as joint commitment by Clark (this volume). The mother’s intervention
is relevant to the success of the interaction (see, e.g., Gaskins this
volume).
The actions in Fig. 12.9 are not new (waving, pointing) to the child
(see Liszkowski this volume), but old actions have new results. The
mother’s waving in real space is reproduced simultaneously in a small
window on the screen. Next to that window is another window of a
mother and child (the friends). Although these windows are separate
and side by side, the interactants appear to understand what is going
on in the other window. What is new to the child and to the adults
is the understanding of how to transform their habitual behaviors
of organizing and understanding interaction to respond to new
interactional demands and which procedures produce coherence within
this very new, side-by-side window environment. These procedures are
developed and shared.
New types of human interaction are linked through language use to
established culturally relevant systems of meaning. Not only do people
create based on moment-by-moment challenges and experiences with
new tools, but they imagine and plan radically new possibilities in new
discursive spaces.
The early CMC literature is replete with ideas about new affordances,
and a new form of human sociality or society based on a particular

Figure 12.9. Modeling individual aspects of a joint activity.


property of the technologically mediated interactions in chat rooms
and other contexts, that is, anonymity or disguise. Many internet
users initially expressed what is now interpreted as a utopian vision
of a new sociality in which “old” cultural distinctions, habits of
interpretation,
and practices could be transformed and an opportunity
existed for transcending established economies of power and status
based on categories such as age and gender. Computer-mediated space
was envisioned as a space with potential for interactants to become
“liberated” from cultural conventions and category interpretations and
“master tropes” such as gender, age, ethnicity, and so forth, permitting
for the first time a kind of creative, selective self-presentation and
social engagement among equals (equality being culturally valued in
the United States). However, social interactions via computer and the
Internet in, for example, online communities, instead now regularly
utilize recognizable procedures for communicating categories that are
conventionally used in interpretation of actions (see, e.g., Barnes 2003).
Age and gender categories as well as affect as resources for interpreting
communicative acts were “missing” in the experience of CMC users (as
key codes and interpretation resources). Thus, interpretive practices can
be sustained through radical changes in the nature of participation. This
is also shown when people describe computer programs that produce
animated pedagogical agents as having “personality traits” such as “a
very stern personality” (Johnson 2003:251). To interpret the mental
state of the other or impute a mental state to another (see Astington, and
Pyers in this volume) discussed by a number of authors in this volume,
is considered a key skill that separates humans from primates.

Conclusion
Computer tools provide new spaces and opportunities for human
sociality.
New technological realities have resulted in rapid adaptations,
altering some important aspects of how we interact with others and
organize and share information. CMC includes not only text-based
synchronous and asynchronous symbolic possibilities, but a new kind
of face-to-face or face-to-machine interaction, with voice and images
of people in real time projected into machine-mediated spaces. The
computer environment significantly alters time and space relationships.
In the case of sign language, the alteration of space is even more
significant. New conventions for communication are emerging as well as
new social networks, relationships, new possibilities for representation
and reproduction (the ability to copy and disseminate), memory storage
and recall, information management, and other aspects of human
sociality. Interactants, in this case signers, alter the communication
system itself, including language features, participation frameworks,
and the notion of a social partner, as they manipulate and organize a
new interaction space. They learn to recognize properties of computer-
mediated space, to utilize a new metaperspective of their own signing, as
well to negotiate new modes of sociality. When social actors incorporate
new technologies into their practices they are motivated not only by
problems that arise in managing new collaborative activities but also by
new enablements to transcend environmental constraints. Microlevel
actions are highly influenced by shared repertoires learned over a
lifetime, however, they show a surprisingly fast adaptability to new
resources and conditions.

Acknowledgments
Two research assistants contributed enormously to this study, Gene
Mirus and Chris Moreland. Many thanks to the other contributors of
this volume for stimulating comments and discussion. Thanks also to
three anonymous reviewers.

Notes
1. Science, Volume 283, Number 5410 Issue of 26 Mar 1999, pp. 2004–2005.

2. I use the conventional practice of capitalizing sign glosses.


3. PTA stands for Parent Teacher Association.

References
Aiello, L. C. 1995. Expensive-tissue hypothesis: The brain and digestive
system in human and primate evolution. Current Anthropology
36(2):199.
Alexander, J. C., and B. Giesen. 1987. From reduction to linkage: The
long view of the micro-macro debate. In The Micro-macro link, edited
by J. C. Alexander, B. Giesen, R. Münch, and N. J. Smelser, 1–42.
Berkeley: University of California Press.
Barnes, S. 2003. Computer-mediated communication. Boston: Allyn and
Bacon.
Bourdieu, P. 1977. Outline of a theory of practice. Cambridge: Cambridge
University Press.
Cicourel, A. 1981. Notes on the integration of micro and macro levels
of analysis. In Toward an integration of micro- and macro-sociologies,
edited by K. Knorr-Cetina and A. Cicourel, 51–80. London: Routledge
and Kegan Paul.
Collins, R. 1981. Micro-translation as a theory-building strategy. In
Advances in social theory and methodology: Toward an integration of
micro- and macro-sociologies, edited by K. Knorr-Cetina and A. Cicourel,
1–48. Boston: Routledge and Kegan Paul.
Condon, S. L., and C. G. Cech. 1996. Functional comparisons of face
to face and computer mediated decision making interactions. In
Computer-mediated communication: Linguistic, social and cross-cultural
perspectives, edited by S. Herring, 65–80. Philadelphia: Benjamins.
Debourgh, G. 1999. Technology is the tool, teaching is the task. Student
satisfaction in distance learning. College Teaching 47(2):70–73.
Duranti, A. 2003. Agency in Language. In Readings in Linguistic
Anthropology
, edited by A. Duranti, 451–472. Oxford: Blackwell.
Duranti, A., and C. Goodwin. 1992. Rethinking context: Language as an
interactive process. Cambridge: Cambridge University Press.
Foucault, M. 1980. Power/knowledge. New York: Pantheon Books.
Frumkin, P., and G. Kaplan. 2002. Institutional theory and the micro-macro
link. Cambridge, MA: Harvard University Press.
Giddens, A. 1984. The constitution of society: Outline of the theory of
structuration. Cambridge: Polity Press.
Goodwin, C. 2000. Action and embodiment within situated human
interaction. Journal of Pragmatics 32:1489–1522.
Heath, C., and P. Luff. 1993. Disembodied conduct: Interactional
asymmetries
in video-mediated communication. In Technology in working
order, edited by Graham Button, 35–54. London: Routledge.
Johnson, W. L. 2003. Interaction tactics for socially intelligent
pedagogical
agents. IUI ‘03, January 12–15, 2003, Miami, Fl. USA ACM
1–58113–586–6/03/00001.
Keating, E. 2000. How culture and technology together shape new
communicative practices: Investigating interactions between deaf
and hearing callers with computer-mediated videotelephone. Texas
Linguistic Forum 43:99–116.
——. 2005. Homo prostheticus: Problematizing the notion of activity
and computer-mediated interaction. In Theories and models of language,
interaction, and culture (special issue of Discourse Studies), edited by A.
Duranti, 7(4–5):527–546.
Keating, E., and G. Mirus. 2003. American Sign Language in virtual
space: Interactions between deaf users of computer-mediated video
communication and the impact of technology on language practices.
Language in Society 32:693–714.
Keating, E., T. Edwards, and G. Mirus. n.d. Cybersign: Impacts of new
communication technologies on space and language.
Knorr-Cetina, K. 1981. Introduction: The micro-sociological challenge
of macro-sociology: Towards a reconstruction of social theory and
methodology. In Advances in social theory and methodology: Toward an
integration of micro- and macro-sociologies, edited by K. Knorr-Cetina
and A. Cicourel, 1–48. Boston: Routledge and Kegan Paul.
Levinson, S. C. 2003. Space in language and cognition: Explorations in
cognitive diversity. Cambridge: Cambridge University Press.
——. 2005. Living with Manny’s dangerous idea. In Theories and models
of language, interaction, and culture (special issue of Discourse Studies),
edited by A. Duranti, 7(4–5):431–454.
Lucas, C., and C. Valli. 1992. Language contact in the American deaf
community. San Diego, CA: Academic Press.
Mazur, J. 2000. Applying insights from film theory and cinematic
technique to create a sense of community and participation in a
distributed
video environment. Journal of Computer-Mediated Communication
5(4). Internet publication, http://www.ascusc.org/jcmc/vol5/issue4/
mazur.htm, accessed November 9, 2005.
McKenna, K. Y. A., and J. A. Bargh. 1999. Causes and consequences of
social interaction on the Internet: A conceptual framework. Media
Psychology 1:249–269.
Richerson, P. J., and Boyd, R. 2004. Not by genes alone: How culture
transformed human evolution. Chicago: University of Chicago Press.
Schegloff, E. A. 1968. Sequencing in conversational openings. American
Anthropologist 70(6):1075–1095.
——. 1987. Between micro and macro: Contexts and other connections.
In The Micro-macro link, ed. by J. C. Alexander, B. Giesen, R. Münch, and
N. J. Smelser, 207–234. Berkeley: University of California Press.
——. 2005. On integrity in inquiry. . . of the investigated, not the
investigator. In Theories and models of language, interaction, and culture
(special issue of Discourse Studies), edited by A. Duranti, 7(4–5):455–480.

Schillo, M., K. Fischer, and C. Klein. 2001. The micro-macro link in


DAI and sociology. In Multiagent-based simulation: Second International
Workshop on Multiagent-Based Simulation Boston MA, USA, July, 2000,
LNAI, vol. 1979, edited by S. Moss and P. Davidsson, 133–148. Berlin:
Springer-Verlag.
Simonson, M., S. Smaldino, M. Albright, and S. Zvacek. 2000. Teaching
and learning at a distance: Foundations of distance education. Upper
Saddle River, NJ: Merrill-Palmer.
Suchman, L. 1993. Technologies of accountability. Of lizards and
aeroplanes. In Technology in working order: Studies of work, interaction, and
technology, edited by G. Button, 113–126. New York: Routledge.
Volosinov, V. N. 1986. Marxism and the philosophy of language. Cambridge,
MA: Harvard University Press.
Vygotsky, L. 1978. Mind in society: The development of higher psychological
processes. Cambridge: Cambridge University Press.
Part 4

Cognition in Interaction
thirte n

Meeting Other Minds through


Gesture: How Children Use their
Hands to Reinvent Language and
Distribute Cognition
Susan Goldin-Meadow

The premise of this book and the conference that led to it is that our
mentally mediated and highly structured way of interacting with one
another is what makes us uniquely human (Enfield and Levinson this
volume). Over the course of generations, we have developed patterns of
social organization and values that set the stage for each new generation
of children to interact in human ways. Indeed, children inherit a world
of social organization that scaffolds their development and releases
them from reinventing with each new generation the patterns that make
us uniquely human—they can borrow the wheel from their elders.
One of the most pervasive aspects of social organization is human
language. Every human culture discovered thus far has developed a
linguistic system that is shared by all of its members and pervades the
way those members interact with one another. Even deaf cultures that
do not have access to the aural modality develop linguistic systems,
albeit in the manual modality. These signed languages provide the
medium of interaction for deaf individuals within a community and
define Deaf culture (Padden and Humphries 1988). When children,
be they deaf or hearing, acquire the language of their parents, they do
more than learn a conventional code—they take important steps toward
becoming functioning members of their society.
The question I address in this chapter is what happens when a child
does not have access to the shared conventional language of his or her
Cognition in Interaction

community? Would such a child be able to interact with members of


the community? And if so, would these interactions with other humans
serve as a scaffold, allowing the child to reinvent the linguistic structure
that has come to epitomize what is unique about humans?
We know that children do not invent a linguistic system to
communicate
if they have been raised by animals (Skuse 1988) or by humans who
treat them inhumanely (Curtiss 1977). This is the first hint that human
interaction may be essential to language and not the other way around.
But children raised by animals not only lack human interaction, they
also lack access to conventional language. Would children who do not
have access to conventional language but do have access to humans
willing to interact with them be able to invent a linguistic system?
I begin this chapter by describing children in just such a situation—
children who have not been exposed to a conventional language model
but in all other respects are raised in a typically human environment.
The children are deaf with hearing losses so severe that they cannot learn
the spoken language that surrounds them. Moreover, they are born to
hearing parents who have not exposed them to a sign language. Despite
the lack of a usable model for conventional human language, these
children interact and communicate with other humans and use gesture
to do so. Even more striking, the children’s gestures exhibit many of
the structural and functional properties found in human language.
This phenomenon of language creation in deaf children tells us that
an individual child can reinvent the linguistic wheel, or at least its
rudimentary aspects—as long as the child can interact with humans
who behave humanely.

The Gestures Children Produce when they have no


Language: Using Gesture to Reinvent Language
The Deaf Children's Gestures Exhibit Linguistic Structure
Deaf children born to deaf parents and exposed from birth to a
conventional
sign language such as American Sign Language (ASL) progress
through stages in acquiring sign language as naturally as hearing children
acquiring a spoken language (Newport and Meier 1985). However,
90 percent of deaf children are not born to deaf parents who could
provide early exposure to a conventional sign language. Rather, these
deaf children are born to hearing parents who, quite naturally, expose
their children to speech (Hoffmeister and Wilbur 1980). Unfortunately,
it is extremely uncommon for deaf children with severe to profound
Meeting Other Minds through Gesture

hearing losses to acquire the spoken language of their hearing parents


naturally, that is, without intensive and specialized instruction. Even
with instruction, deaf children’s acquisition of speech is markedly
delayed when compared either to the acquisition of speech by hearing
children of hearing parents, or to the acquisition of sign by deaf children
of deaf parents. By age five or six, and despite intensive early training
programs, the average profoundly deaf child has limited linguistic skills
in speech (Mayberry 1992). Moreover, although many hearing parents
of deaf children send their children to schools in which one of the
manually coded systems of English is taught, some hearing parents
send their deaf children to “oral” schools in which sign systems are
neither taught nor encouraged; thus, these deaf children are not likely
to receive input in a conventional sign system. Under such inopportune
circumstances, a child might be expected to fail to communicate, or
perhaps to communicate only in nonsymbolic ways. However, this
turns out not to be the case.
I, along with my colleagues, have studied ten American and four
Chinese deaf children who were unable to acquire spoken language and
were not exposed to sign language. All of the children used gesture, called
“home signs,” to communicate and those gestures exhibited properties
that are fundamental to natural languages. The linguistic properties
that appear in the deaf children’s gesture systems can be considered
“resilient”—likely to crop up in a child’s communications whether or
not that child is exposed to a conventional language model. Table 13.1
lists the resilient properties we have identified thus far (Goldin-Meadow
2003b). There may be many others—just because we have not found a
particular property in a deaf child’s home-sign gesture system does not
mean it is not there. And there are likely to be linguistic properties that
the deaf children cannot invent, properties that can only be invented
by a community of gesture users (Goldin-Meadow 2005; see Pyers this
volume, for a description of what can happen when a group of home
signers come together and develop a shared sign system that is then
passed onto a new generation of signers). I begin by describing the
word- and sentence-level properties that the deaf children developed
in their gesture systems.

Words
The deaf children’s gesture words have five properties that are found in
all natural languages. The gestures are stable in form, although they need
not be. It would be easy for the children to make up a new gesture to fit
Table
Table 13.1
13.1.. The
The resilient
resilient properties of
of language

The resilient property As instantiated in the deaf children's gesture systems


Words

Stability Gesture forms are stable and do not change


capriciously with changing situations
Paradigms Gestures consist of smaller parts that can be
recombined to produce new gestures with
different meanings
Categories The parts of gestures are composed of a limited
set of forms, each associated with a particular

meaning
Arbitrariness Pairings between gesture forms and meanings
can have arbitrary aspects, albeit within an

iconic framework
Grammatical Functions Gestures are differentiated by the noun, verb,
and adjective grammatical functions they serve

Sentences
Underlying Frames Predicate frames underlie gesture sentences
Deletion Consistent production and deletion of gestures
within a sentence mark particular thematic roles
Word Order Consistent orderings of gestures within a
sentence mark particular thematic roles
Inflections Consistent inflections on gestures mark
particular thematic roles
Recursion Complex gesture sentences are created by
recursion

Redundancy Reduction Redundancy is systematically reduced in the


surface of complex gesture sentences

Language Use
Here-and-Now Talk Gesturing is used to make requests, comments,
and queries about the present
Displaced Talk Gesturing is used to communicate about the
past, future, and hypothetical
Narrative Gesturing is used to tell stories about self and
others
Self-Talk Gesturing is used to communicate with oneself
Meta-Language Gesturing is used to refer to one's own and
others' gestures
every new situation (and that appears to be what hearing speakers do
when they gesture along with their speech; cf. McNeill 1992). But that
is not what the deaf children do. They develop a stable store of forms
that they use in a range of situations—they develop a lexicon that is an
essential component of all languages (Goldin-Meadow et al. 1994).
Moreover, the gestures the children develop are composed of parts
that form paradigms, or systems of contrasts. When the children invent
a gesture form, they do so with two goals in mind—the form must
not only capture the meaning they intend (a gesture–world relation),
but it must also contrast in a systematic way with other forms in their
repertoire (a gesture–gesture relation). In addition, the parts that form
these paradigms are categorical. For example, one child, David, used
a fist hand shape to represent grasping a balloon string, a drumstick,
and handlebars—grasping actions requiring considerable variety in
diameter in the real world. The child did not distinguish objects of
varying diameters within the fist category, but did use his hand shapes
to distinguish objects with small diameters as a set from objects with
large diameters (e.g., a cup, a guitar neck, or the length of a straw) that
were represented by a C-shaped hand. The manual modality can easily
support a system of analog representation, with hands and motions
reflecting precisely the positions and trajectories used to act on objects in
the real world. But the children do not choose this route. They develop
categories of meanings that, although essentially iconic, have hints of
arbitrariness about them—that is, the boundaries between categories are
not drawn in the same places in the children’s gesture systems (Goldin-Meadow
et al. 1995).
Finally, the gestures the children develop are differentiated by
grammatical
function. Some serve as nouns, some as verbs, some as adjectives.
As in natural languages, when the same gesture is used for more than
one grammatical function, that gesture is marked (morphologically
and syntactically) according to the function it plays in the particular
sentence (Goldin-Meadow et al. 1994). For example, if a child were to use
a twisting gesture in a verb role, that gesture would likely be produced
near the jar to be twisted open (i.e., it would be inflected), it would
not be abbreviated, and it would be produced after a pointing gesture
at the jar. In contrast, if the child were to use the twisting gesture in
a noun role, the gesture would likely be produced in neutral position
near the chest (i.e., it would not be inflected), it would be abbreviated
(produced with one twist rather than several), and it would occur before
the pointing gesture at the jar.
Sentences
The deaf children’s gesture sentences have six properties found in all
natural languages. Underlying each sentence is a predicate frame that
determines how many arguments can appear along with the verb in the
surface structure of that sentence (Goldin-Meadow 1985). For example,
four slots underlie a gesture sentence about transferring an object, one
for the verb and three for the arguments (actor, patient, and recipient). In
contrast, three slots underlie a gesture sentence about eating an object,
one for the verb and two for the arguments (actor and patient).
Moreover, the arguments of each sentence are marked according to
the thematic role they play. There are three types of markings that are
resilient (Goldin-Meadow and Mylander 1984; Goldin-Meadow et al.
1994):
(1) Deletion—The children consistently produce and delete gestures
for arguments as a function of thematic role; for example, they are
more likely to delete a gesture for the object or person playing the role
of transitive actor (soldier in “soldier beats drum”) than they are to
delete a gesture for an object or person playing the role of intransitive
actor (soldier in “soldier marches to wall”) or patient (drum in “soldier
beats drum”).
(2) Word order—The children consistently order gestures for arguments
as a function of thematic role; for example, they place gestures for
intransitive actors and patients in the first position of their two-gesture
sentences (soldier–march; drum–beat).
(3) Inflection—The children mark with inflections gestures for
arguments
as a function of thematic role; for example, they displace a verb
gesture in a sentence toward the object that is playing the patient role
in that sentence (the “beat” gesture would be articulated near, but not
on, a drum).
In addition, recursion, which gives natural languages their generative
capacity, is a resilient property of language. The children form complex
gesture sentences out of simple ones (Goldin-Meadow 1982). For
example, one child pointed at me, produced a “wave” gesture, pointed
again at me, and then produced a “close” gesture to comment on the
fact that I had waved before closing the door—a complex sentence
containing two propositions: “Susan waves” (proposition 1) and “Susan
closes door” (proposition 2). The children systematically combine the
predicate frames underlying each simple sentence, following principles
of sentential and phrasal conjunction. When there are semantic elements
that appear in both propositions of a complex sentence, the children
have a systematic way of reducing redundancy, as do all natural languages
(Goldin-Meadow 1982, 1987).

The Hearing Parents' Gestures do not Exhibit


Linguistic Structure
Hearing parents gesture when they talk to young children (Bekken
1989; Iverson et al. 1999; Ozcaliskan and Goldin-Meadow 2005; Shatz
1982) and the hearing parents of our deaf children were no exception.
The deaf children’s parents were committed to teaching them to talk
and therefore talked to their children as often as they could. And when
these parents talked, they gestured. Perhaps parents’ gestures served as
a model for their children’s gestures.
To find out, my colleagues and I looked at the gestures that the hearing
mothers produced when talking to their deaf children. We looked at
them not like they were meant to be looked at (i.e., integrated with the
speech they accompanied), but as a deaf child might look at them. We
turned off the sound and analyzed the gestures using the same analytic
tools that we used to describe the deaf children’s gestures. We found
that the hearing mothers’ gestures do not resemble their children’s
and indeed do not have structure at all when looked at from a deaf
child’s point of view (Goldin-Meadow and Mylander 1983, 1984, 1998;
Goldin-Meadow et al. 1994, 1995).
The fact that the hearing parents’ gestures look so different from
their deaf children’s underscores two points. First, the languagelike
structure we see in the children’s gestures cannot be traced back to
the parents’ gestures. Even if the deaf children are using their parents’
gestures as a starting point for their gesture systems, they are clearly
going well beyond that starting point, transforming the gestures they
see into a system that looks very much like language. Second, the deaf
children are producers of a linguistic system that they never receive.
They see the cospeech gestures that their hearing parents produce,
but they produce their own languagelike gestures. This is a very odd
communicative situation, one in which parent and child do not share
a common language and do not have an obvious mechanism for
establishing common ground (Enfield this volume)—yet parent and
child do manage to communicate, perhaps because they can call on
the conversational mechanisms that Schegloff (this volume) considers
to be universal to all human interaction.
Parent and Child Communicate Nonetheless
Talk about the Present and Nonpresent
Like children learning conventional languages, the deaf children make
requests, but they do so using gesture, and their parents comply (or at
least they comply no less than parents of any young child). The parents
comply because the children’s gestures are relatively transparent when
interpreted in context. As an example of a request, one child pointed
at a nail and then produced a “hammer” gesture to ask his mother to
bang on a nail. In addition, the children make comments on the here
and now that are also relatively easy for the hearing parents to interpret.
For example, a child produced a “march” gesture and then pointed at
a mechanical toy soldier to comment on the fact that the soldier was,
at that very moment, marching.
Talking about the here and now is important, but what language does
particularly well is allow speakers to make reference to objects and events
that are not perceptible to either the speaker or the listener—displaced
reference (cf. Hockett 1960). Displacement allows us to describe a lost
hat, to complain about a friend’s slight, and to ask advice on college
applications. If we were to communicate only about what is immediately
in front of us, it is not at all clear that we would need as complex and
productive a system as language is.
The deaf children are able to use their gestures to talk about the nonpresent,
but these communications require a bit more interpretive work
on the part of the children’s hearing parents. In their earliest references
to the nonpresent, the children describe what they know about an
object or action and go beyond what is visible in doing so. One child
pointed at a football, pointed at a rubber ball, and then produced a
“kick” gesture to comment on the action that is characteristically done
on footballs and rubber balls but that was not taking place at the time.
Next, the children refer to events that take place prior to or after the
communicative act but still during the observation session, that is, they
refer to proximal events. After blowing a large bubble, one child pointed
at the bubble and produced an “expand” gesture. Finally, the children
refer to events in the past, events in the future, potential events, and
even fantasy events. As an example, one child produced the following
string of gesture sentences to indicate that, in preparation for setting
up the cardboard chimney for Christmas, the family was going to move
a chair downstairs.
1. Point at chair—“move away” gesture.
2. Point at chair—point downstairs.
3. “Chimney” gesture—“move away” gesture—“move here” gesture.
Gloss: We’re going to move the chair away. We’ll move it downstairs.
We’ll move the chair away and move the chimney here.

Despite the absence of a shared linguistic code, the deaf children


succeed in evoking nonpresent objects and events. They are able to do
so primarily because their hearing parents know a great deal about their
worlds and use that knowledge to interpret their gestures—the mother
knew the Christmas ritual and was able to provide the context within
which her child’s gestures made sense. This process is reminiscent of
interactions described by Goodwin (this volume) between a severely
aphasic man and his family members. The difference is that the man
with aphasia was at one time a fluent language user and, indeed, still
understands everything that is said to him. He and his communication
partners can draw on their shared linguistic knowledge to negotiate
meaning. In contrast, the deaf children do not know their parents’
spoken language or, for that matter, any conventional language at all.
The fact that they manage to communicate with one another is that
much more striking.

Telling Stories
Narrative is one of the most powerful tools that human beings possess
for organizing and interpreting experience. Not only is narrative found
universally across cultures (Miller and Moore 1989), but no other species
is endowed with this capacity. Moreover, narrative emerges remarkably
early in human development. Children from many sociocultural
backgrounds, both within and beyond the United States, begin to
recount
their past experiences during the second and third years of life.
The deaf children told stories but used gesture to do so. They told
stories about events they or others experienced in the past, events
they hoped would occur in the future, and events that were flights of
imagination (Phillips et al. 2001). For example, one child produced the
following simple narrative in response to a picture of a car. His mother
confirmed the tale by telling it later in her own words.
1. “Break” gesture—“away” gesture [= narrative marker]—point at
dad—“car-goes-onto-truck” gesture (flat right hand glides onto back
of flat left hand)
2. “Crash” gesture—“away” gesture [= narrative marker]
Gloss: Dad’s car broke and went onto a tow truck. It crashed.

Note that, in addition to producing gestures to describe the event


itself, the child produced what we have called a narrative marker. The
child recognized that he was not describing an event that was taking
place in the here and now. Rather, he was describing a real event that
happened in another time and place. The child indicated this stance
with an “away” gesture—a palm or point hand extended or arced away
from the body (see Morford and Goldin-Meadow 1997). This gesture
was used exclusively in narratives and served to mark a piece of gestural
discourse as a narrative in the same way that “once upon a time” is
often used to signal a story in spoken discourse.

Talking to Oneself and Talking about Talk


In addition to using their gestures to communicate with others, the
deaf children used their gestures for a number of the other functions
that language serves. These functions are not particularly frequent
even in the communications of young children learning conventional
languages. Indeed, all of our examples of these functions come from
David, the child on whom we had the most observations. Although
found in only one child, it is impressive that a child could extend his
homemade gesture system to cover these rather sophisticated linguistic
functions.
We occasionally saw David using his gestures when he thought no
one was paying attention, as though “talking” to himself. Once when
David was trying to copy a configuration of blocks off of a model,
he made an “arced” gesture in the air to indicate the block that he
needed next to complete the design. When offered a block that fit this
description, David ignored the offer, making it clear that his gesture
was not intended for anyone else but him. It seems extremely unlikely
that a child would invent a language to talk to him- or herself. Genie
who was left alone with no one to talk to for the first 13 years of her
life did not, for example, invent a language to share thoughts with
herself (Curtiss 1977). However, it is striking that, once having invented
a language to communicate with others, children are able to use that
system to communicate with themselves.
Another important use of language is its metalinguistic function—
using language to talk about language. Language is unique in providing
a system that can be used to refer to itself. It requires a certain level of
competence for a child to say, “the dog smells.” It requires an entirely
different and more sophisticated level for that child to say, “I said ‘the
dog smells.’ ” The child must be aware of his or her own talk and be
able to report on that talk. David did, on occasion, use gesture to refer
to his own gestures. For example, to request a Donald Duck toy that
the experimenter held behind her back, David pursed his lips, referring
to the Donald Duck toy. He then pointed at his own pursed lips and
pointed toward the Donald Duck toy. When the experimenter offered
him a Mickey Mouse toy instead, David shook his head, pursed his lips
and pointed at his own pursed lips once again. The point at his own
lips is roughly comparable to the words “I said,” as in “I said ‘Donald
Duck.’ ” It therefore represents a communicative act in which gesture
is used to refer to a particular act of gesturing and, in this sense, is
reminiscent of young hearing children’s quoted speech.
David also used gesture to comment on the gestures of others. For
example, at one point we asked David and his hearing sister to respond,
in turn, to videotaped scenes of objects moving in space. David was using
his gesture system to describe the scenes, and his sister was inventing
gestures on the spot (see Singleton et al. 1993). David considered his
sister’s response to be inappropriate on a number of the items, and he
used his own gestures to correct her gestures. The sister extended her
index finger and thumb as though holding a small object to describe
a tree in a particular segment. Reacting to his sister’s choice of hand
shape, David teased her by reproducing the hand shape, pretending
to gesture with it, and finally ridiculing the hand shape by using it to
poke himself in the eyes. His sister then shrugged and said, “okay, so
what should I do?”—a reaction that both acknowledged the fact that
there was a system of which David was the keeper, and admitted her
ignorance of this system. David then indicated that a point hand shape
(which is an appropriate hand shape for straight thin objects in his
system, and therefore an appropriate hand shape for a tree) would be
a correct way to respond to this item. Thus, David not only produced
gestures that adhered to the standards of his system, but he used his
gestures to impose those standards on the gestures of others.
These examples are remarkable in that they indicate the distance
David has achieved from his gesture system. It is one thing for a child
to gesture to achieve a goal or make a comment, that is, to use gesture
for a specific communicative act. It is quite another for the child to
recognize that he is gesturing and to call attention to his gestures as
communicative acts. David was able to treat other peoples’ gestures
as objects to be reflected on and, at times, corrected. Moreover, he
was able to distance himself from his own gestures and treat them as
objects to be reflected on and referred to. He therefore exhibited in his
self-styled gesture system the very beginnings of the reflexive capacity
that is found in all languages and that underlies much of the power of
language (cf. Lucy 1993).

The Challenge of a Nonshared System


The deaf children could have used gesture only for the basics—to
get people to give them objects and perform actions. Indeed, when
chimpanzees are taught sign language, the only purpose to which they
seem to put those signs is to request objects and activities (Greenfield
and Savage-Rumbaugh 1991). Request gestures are the easiest for others
to interpret simply because context often makes it obvious what the
child wants. But the deaf children do much more with their gestures.
They use them to comment not only on the here and now but also on
the distant past, the future, and the hypothetical. They use them to tell
stories, to talk to themselves, and to talk about their own and others’
gestures. In other words, they use them for the functions to which all
natural languages are put. These functions are a challenge, not only for
the children but also for the children’s hearing parents.
The challenge for the children is to take the cospeech gestures that
they see their hearing parents use and transform those gestures into
a structured system that functions like language. In this regard, it is
noteworthy that language structure and language function seem to
go hand in hand in the deaf children’s gesture systems, although the
developmental relationship between the two is far from clear. For
example, the functions to which the deaf children put their gestures could
provide the impetus for building a languagelike structure. Conversely,
the structures that the deaf children develop in their gestures could
provide the means by which more sophisticated languagelike functions
can be fulfilled. More than likely, language structure and language
function complement one another, with small developments in one
domain furthering additional developments in the other.
The challenge for the deaf children’s hearing parents is to be able to
interpret the children’s gestures enough so that the two can communicate
with one another without the benefit of a shared linguistic code (cf.
Goodwin this volume). This challenge is made more serious by the fact
that the parents have placed their children in oral training and do not
particularly want them to be gesturing—they want their children to be
learning to talk. As a result, the parents pay little conscious attention
to their children’s gestures. Surprising as it may seem, gesture in the
deaf children’s homes is rarely acknowledged and, in this sense, is no
different from the gestures that hearing children (and indeed all hearing
speakers) produce along with their talk.
In the second part of this chapter, I explore the role that gesture plays
in human interaction for individuals who know a conventional spoken
language. Gesture may not be acknowledged, but it has an impact
on communication nevertheless. Gesture often conveys information
that is different from the information conveyed in speech and offers
a window onto thoughts that do not fit neatly into the categories
offered by conventional language (Goldin-Meadow and McNeill 1999).
Gesture externalizes a speaker’s unspoken thoughts and, as a result, is
an important part of the multimodal repertoire that humans rely on
when they interact with one another (see chapters by Hutchins and
Goodwin this volume).
To explore the role that cospeech gesture plays in human interaction,
I turn to the gestures that hearing children produce during a teaching
interaction. The gestures children use when they explain their solutions
to a problem often reflect an implicit understanding of the problem not
fleshed out in their speech. Importantly, teachers are sensitive to the
gestures children produce—they alter their instruction as a function of
those gestures, providing input that has the potential to help the child
develop a more articulated understanding of the problem. Gesture is
(unwittingly) shared by child and teacher, and indeed by all speakers
and listeners, and in this way extends the range of communication
beyond the bounds of conventional language.

The Gestures we Produce when we Talk:


Using Gesture to Distribute Cognition
We can Learn a Great Deal about Children's Knowledge from their
Gestures
Gesture and speech encode meaning differently (Goldin-Meadow
2003a; McNeill 1992). Gesture conveys meaning globally relying on
visual and mimetic imagery. Speech conveys meaning discretely relying
on codified words and grammatical devices. Because gesture and speech
employ such different forms of representation, it is difficult for the two
modalities to contribute identical information to a message.
Nonetheless,
the information conveyed in gesture and in speech can overlap a
great deal. For example, consider a child who utters the word “chair”
while pointing at the chair. The word labels and classifies the object.
The point indicates where the object is (see Liszkowski this volume,
for further discussion of the pointing gestures children use at the early
stages of language learning). Although word and gesture do not convey
identical information, they do work together in this example to more
richly specify the same object.
However, there are instances in which gesture conveys information that
overlaps very little with the information conveyed in the accompanying
speech. Consider a child who says “daddy” while pointing at a chair.
This child has produced a gesture for an object that is not mentioned
in speech. Here, word and gesture convey information that does not
overlap at all. Note, however, that taken together the two modalities
convey a simple notion—“daddy’s chair”—that is not conveyed in either
modality on its own.
I have posited a continuum based on the overlap of information
conveyed in gesture and speech (Goldin-Meadow 2003a). At one
end of the continuum, gesture elaborates on a topic that has already
been introduced in speech. At the other end, gesture introduces new
information that is not mentioned at all in speech. Although there are
times when it is not clear where to draw a line to divide the continuum
into two categories, it turns out that most cases are obvious and relatively
easy to identify. We have dubbed cases in which gesture and speech
convey overlapping information “gesture–speech matches,” and cases in
which gesture conveys more information than speech “gesture–speech
mismatches” (Church and Goldin-Meadow 1986).
As an example of a gesture–speech match in a school-aged child,
consider the response given by a child asked to explain his incorrect
solution to the mathematical equivalence problem, 7 + 6 + 4 = 7 +
__. The child indicated that he solved the problem by adding up the
numbers on the left side of the equation in both speech (“I added 7
plus 6 plus 4 and got 17”) and gesture (point at the left 7, the 6, the 4,
and the blank). As an example of a gesture–speech mismatch on this
same problem, another child indicated in speech that she also solved
the problem by adding up the numbers on the left side of the equation
(“I added 7 plus 6 plus 4 and got 17”). However, in gesture, this child
indicated all of the numbers in the problem (point at the left 7, the 6, the
4, the right 7), making it clear that she did, at some level, know that the
7 on the right side of the equation was there and might be important.
Note that this second child seems to have an understanding (however
implicit) of two pieces of information: (1) there are two distinct sides
to the equation (reflected in the add-to-equal-sign strategy the child
conveyed in the speech component of her mismatch); (2) there is an
additional addend on the right side of the equation (reflected in the
add-all-numbers strategy she conveyed in the gesture component of
her mismatch). These two pieces of information are not yet integrated
into a single framework but eventually will have to be if the child is to
solve the problem correctly.
Children who produce mismatches in their explanations of a task
have information relevant to solving the task at their fingertips and
could, as a result, be on the cusp of learning the task. If so, they may
be particularly susceptible to instruction. To explore this hypothesis,
we gave nine- to ten-year-old children instruction on problems of the
4 + 5 + 3 = __ + 3 variety. Prior to instruction, all of the children solved
the problems incorrectly and all of their spoken explanations were
incorrect. However, the children differed with respect to their gestures:
Some produced gestures that did not match their speech, whereas
others produced matching gestures. After the instruction period, we
gave the children a second test to see how much they learned. We
found that children who had produced mismatches prior to instruction
were more likely to profit from instruction than children who had
produced no mismatches (Perry et al. 1988; see also Alibali and Goldin-
Meadow 1993). To test the generality of this finding, we conducted a
comparable study with five- to eight-year-old children using a different
task (reasoning about quantity) and found once again that children
who produced mismatches prior to instruction were more likely to
profit from instruction than children who produced matches (Church
and Goldin-Meadow 1986)—they were ready to learn. Gesture–speech
mismatch can serve as an index of a child’s readiness to learn a particular
task. Moreover, because the gestures in a mismatch convey substantive
information that is not found in speech, mismatches provide insight
into children’s newest and not-yet-digested notions, notions that their
teachers might want to consider teaching next.

Teachers can take Advantage of the Information Conveyed in a


Child's Gestures
Gesture–speech mismatches are not limited to a particular age or task,
nor are they characteristic of particular individuals. Moreover, gesture–
speech mismatch is not a personality trait—the same child who produces
many mismatches on one task can produce none on another (Perry et
al. 1988). Gesture–speech mismatch indicates when a particular child is
ready to profit from instruction on a particular task. In this way, gesture
offers information that could prove useful to teachers when instructing
children. Can teachers take advantage of this offer?
To find out, we observed eight teachers instructing children individually
in the concept of mathematical equivalence (Goldin-Meadow and Singer
2003). As we now would expect, the children’s gestures often revealed
knowledge that they did not seem to know they had. Consider, for
example, a child explaining how he solved the math problem 4 + 5 +
3 =__ + 3. The child said, “I added 4 plus 5 plus 3 plus 3 and got 15,”
demonstrating no awareness that this is an equation bifurcated by an
equal sign. His gestures, however, offered a different picture: He swept
his left palm under the left side of the equation—paused—then swept
his right palm under the right side. His gestures clearly demonstrated
that, at some level, he knew that the equal sign breaks the string into
two parts. The question we asked was whether teachers offer a different
type of instruction to children who produce gesture–speech mismatches
than to children who do not.
The answer is “yes.” The teachers gave more variable instruction to
children who produced mismatches than to children who produced
no mismatches in two respects (Goldin-Meadow and Singer 2003). (1)
They presented more different types of strategies for solving the math
problem in their instructions to children who produced mismatches
than to children who did not. (2) They produced more of their own
mismatches (i.e., more instructions containing two different strategies,
one in speech and one in gesture) to children who produced mismatches
than to children who did not. Most of the teachers’ mismatches
contained correct strategies in both gesture and speech. For example,
on the problem 7 + 6 + 5 = __ + 5, one teacher expressed an equalizer
problem-solving strategy in speech (“we need to make this side equal
to this side”) while conveying a grouping strategy in gesture (point
at the 7, the 6, and the blank—the two numbers that give the correct
answer if grouped and added together). Both strategies lead to correct
solutions yet do so via different routes.
Teachers use their students’ gestures to discover the thoughts those
students are unable to express in words, and they then change their
instruction in response. The question I turn to next is whether the
instruction that teachers spontaneously give children who are on the
cusp of change actually promotes learning.

Teachers' Gestures can Promote Learning


The teachers in our math study increased the number of gesture–speech
mismatches they produced when teaching children who themselves
produced mismatches, and the mismatching children profited from
the instruction (Goldin-Meadow and Singer 2003). However, these
children were ready to learn the math task—any type of instruction
on the problem might have resulted in improvement in mismatching
children. To determine whether the teachers’ instruction per se had a
hand in learning, we needed to manipulate instruction.
We gave nine- and ten-year-old children who did not know how
to solve mathematical equivalence problems instruction that was
modeled after the instruction the teachers had spontaneously used
in our naturalistic study (Singer and Goldin-Meadow 2005). Children
were taught either one or two problem-solving strategies in speech
accompanied by no gesture, gesture conveying the same strategy, or
gesture conveying a different strategy. We found that children were
indeed likely to profit from instruction with gesture, but only when
the gesture conveyed a different strategy from speech. Moreover, two
strategies were effective in promoting learning only when the second
strategy was taught in gesture, not speech.
The teachers were right—instruction in which gesture and speech
convey different information is indeed good for learning. It is unlikely,
however, that the teachers in our study were consciously aware of how
they used their hands to promote learning, nor is it likely that the
children were consciously aware of using their hands to display their
knowledge. Gesture provides an undercurrent of conversation that takes
place alongside the acknowledged conversation in speech. Although not
explicitly recognized, this under-the-surface conversation is influential.
Why?
Gesture provides a second representational format for presenting
ideas, one that has a strong visual component. In this sense, gesture is
like a diagram, physical model, or map—artifacts of the society that can
also play a role in structuring communication and thinking (Goodwin,
and Hutchins in this volume). Gesture is unique, however, in that
unlike a map or a diagram, it is transitory—disappearing in the air
just as quickly as speech. But gesture also has an advantage—it can be,
indeed must be, integrated temporally with the speech it accompanies.
And we know that it is important for visual information to be timed
appropriately with spoken information for it to be effective (Baggett
1984; Mayer and Anderson 1991). Thus, gesture used in conjunction
with speech may present a more naturally unified picture to the student
than a diagram used in conjunction with speech. And because the two
ideas presented in a mismatch (one in speech and the other in gesture)
are temporally unified, the contrast between them may be particularly
salient, and as a result, may catalyze change.
It is clear that gesture is part of the complex multimodal interaction
system that characterizes human interaction. But what role does it
play? Hutchins (this volume) suggests that we must choose between
gesture being (1) an external expression of an internal representation
or (2) part of the multimodal interaction system that is itself thinking.
My own view is that this is an and not an or situation. In the math
tutorial I described, learning how to solve the problem correctly
was a joint activity shared by child and teacher, one in which both
participants’ hands played a contributing role (akin to the navigation
example Hutchins describes in which the crew gestured imaginary lines
of position and then used those gestures as a framework for coming to
an agreement about the ship’s position). This is an instance of gesture
being part of what Hutchins calls the thinking process. However, the
child’s gestures, even before instruction, were not empty movements
waiting for meaning to be supplied by the teacher. Consider the child
who pointed at all four numbers (the add-all-numbers strategy) while
saying that he added the 7, 6, and 4 to solve the problem 7 + 6 + 4 = 7
+ __. Hutchins’s view would lead us to hypothesize that this child does
not have an internal representation of the add-all-numbers strategy,
as the strategy was expressed in gesture but not in speech. But we can
show that this child does indeed have at least an implicit awareness of
the add-all-numbers strategy—the child will judge 24, the solution one
gets if all of the numbers are added together, to be an acceptable answer
to the 7 + 6 + 4 = 7 + __ problem (Garber et al. 1998). Gesture is not a
meaningless activity for the gesturer. Indeed, I suggest that it is because
gesture reflects the speaker’s internal representations that it can serve as
part of the process that leads to change in those representations.
The ideas that gesture externalizes are often incompletely thought
out. These incomplete ideas, once externalized, can become more
complete as a result of being operated on by others, as the math tutorial
and navigation examples illustrate. But the nascent ideas that speakers
express in gesture can also become more complete as a result of being
operated on by the speakers themselves (e.g., gesture could serve as a
cognitive prop allowing the child to think through the math problem
with greater ease; cf. Goldin-Meadow and Wagner 2005). There are
several ways in which ideas can come into being as a function of being
expressed in gesture.
Meeting Other Minds through Gesture
Although rarely acknowledged in the course of conversation, gesture
is always “out there.” Gestures are concrete manifestations of ideas for
all the world, not only the world of education, to see. Speakers produce
gestures that reveal to their listeners thoughts that are not apparent
in their speech. And listeners produce gestures that, in turn, have an
impact on the message their partner takes from the conversation. Hands
play an important role in our conversations.
Gesture is unacknowledged in the two communication situations I
have considered here. It nevertheless has a clear impact on conversation
in both (indeed, for the deaf children, it is the conversation). When
called on to accompany speech, gesture functions along with speech
without assuming its linguistic form and contributes to the give and
take between speakers and listeners. When called on to substitute for
speech, gesture takes over both the forms and the functions of language
and, again, is responsible for the give and take between participants
(although the exchange is less symmetrical—the deaf children give out
linguistic gesture but take in cospeech gesture, and the hearing parents
do the reverse). Both phenomena underscore the fact that conventional
language does not dictate communication—that the urge and capacity
to interact and communicate does not depend on a shared system passed
down from generation to generation.
Indeed, even when raised without access to conventional language
but in the company of other people, human children spontaneously
use their hands to communicate. And the hand gestures they invent
are used not only to make requests but, more strikingly, to share their
thoughts with others. Although many animals have complex social
lives and intricate systems of communication, no other animal society
has a communication system that is used just to share ideas—to tell
stories, to talk to oneself, to talk about talk. The need to interact with
others in a symbolic way appears to be a basic human trait, one that is
difficult to inculcate in other animals and equally difficult to repress
in human children.

Acknowledgment
This research was supported by grants from the National Institute of
Deafness and Other Communicative Disorders (R01 DC00491), the
National Institutes of Child Health and Human Development (R01
HD47450), and the Spencer Foundation.
References
Alibali, M., and S. Goldin-Meadow. 1993. Gesture-speech mismatch
and mechanisms of learning: What the hands reveal about a child’s
state of mind. Cognitive Psychology 25:468–523.
Baggett, P. 1984. Role of temporal overlap of visual and auditory material
in forming dual media associations. Journal of Educational Psychology
76:408–417.
Bekken, K. 1989. Is there “Motherese” in gesture? Ph.D. dissertation,
Department of Psychology, University of Chicago.
Church, R. B., and S. Goldin-Meadow. 1986. The mismatch between
gesture and speech as an index of transitional knowledge. Cognition
23:43–71.
Curtiss, S. 1977. Genie: A psycholinguistic study of a modern-day “wild-child."
New York: Academic Press.
Garber, P., M. W. Alibali, and S. Goldin-Meadow. 1998. Knowledge
conveyed in gesture is not tied to the hands. Child Development
69:75–84.
Goldin-Meadow, S. 1982. The resilience of recursion: A study of a
communication system developed without a conventional language
model. In Language acquisition: The state of the art, edited by E. Wanner
and L. R. Gleitman, 51–77. New York: Cambridge University Press.
——. 1985. Language development under atypical learning conditions:
Replication and implications of a study of deaf children of hearing
parents. In Children’s language, vol. 5, edited by K. Nelson, 197–245.
Hillsdale, NJ: Erlbaum.
——. 1987. Underlying redundancy and its reduction in a language
developed without a language model: The importance of conventional
linguistic input. In Studies in the acquisition of anaphora: Applying
the constraints, vol. 2, edited by B. Lust, 105–133. Boston: Reidel
Publishing Company.
——. 2003a. Hearing gesture: How our hands help us think. Cambridge,
MA: Harvard University Press.
——. 2003b. The resilience of language: What gesture creation in deaf children
can tell us about language-learning in general. New York: Psychology
Press.
——. 2005. What language creation in the manual modality tells us
about the foundations of language, Linguistic Review 22:199–225.
Goldin-Meadow, S., C. Butcher, C. Mylander, and M. Dodge. 1994.
Nouns and verbs in a self-styled gesture system: What’s in a name?
Cognitive Psychology 27:259–319.
Goldin-Meadow, S., and D. McNeill. 1999. The role of gesture and
mimetic representation in making language the province of speech.
In The descent of mind, edited by M. C. Corballis and S. Lea, 155–172.
Oxford: Oxford University Press.
Goldin-Meadow, S., and C. Mylander. 1983. Gestural communication
in deaf children: The non-effects of parental input on language
development. Science 221(4608):372–374.
——. 1984. Gestural communication in deaf children: The effects and noneffects
of parental input on early language development. Monographs
of the Society for Research in Child Development 49:1–121.
——. 1998. Spontaneous sign systems created by deaf children in two
cultures. Nature 91:279–281.
Goldin-Meadow, S., C. Mylander, and C. Butcher. 1995. The resilience of
combinatorial structure at the word level: Morphology in self-styled
gesture systems. Cognition 56:195–262.
Goldin-Meadow, S., and M. A. Singer. 2003. From children’s hands to
adults’ ears: Gesture’s role in teaching and learning. Developmental
Psychology 39:509–520.
Goldin-Meadow, S., and S. M. Wagner. 2005. How our hands help us
learn. Trends in Cognitive Science 9:234–241.
Greenfield, P. M., and E. S. Savage-Rumbaugh. 1991. Imitation,
grammatical development, and the invention of protogrammar by an
ape. In Biological and behavioral determinants of language development,
edited by N. A. Krasnegor, D. M. Rumbaugh, R. L. Schiefelbusch, and
M. Studdert-Kennedy, 235–258. Hillsdale, NJ: Erlbaum.
Hockett, C. F. 1960. The origin of speech. Scientific American 203(3):88–96.

Hoffmeister, R., and R. Wilbur. 1980. Developmental: The acquisition of


sign language. In Recent perspectives on American Sign Language, edited
by H. Lane and F. Grosjean, 61–78. Hillsdale, NJ: Erlbaum.
Iverson, J. M., O. Capirci, E. Longobardi, and M. Caselli. 1999. Gesturing
in mother-child interaction. Cognitive Development 14:57–75.
Lucy, J. A. 1993. Reflexive language and the human disciplines. In
Reflexive language: Reported speech and metapragmatics, edited by J.
Lucy, 9–32. New York: Cambridge University Press.
Mayberry, R. I. 1992. The cognitive development of deaf children: Recent
insights. In Child Neuropsychology, vol. 7: Handbook of Neuropsychology,
edited by S. Segalowitz and I. Rapin; series editors F. Boller and J.
Graffman, 51–68. Amsterdam: Elsevier.
Mayer, R. E., and R. B. Anderson. 1991. Animations need narrations: An
experimental test of a dual-coding hypothesis. Journal of Educational
Psychology 83:484–490.
McNeill, D. 1992. Hand and mind: What gestures reveal about thought.
Chicago: University of Chicago Press.
Miller, P. J., and B. B. Moore. 1989. Narrative conjunctions of caregiver
and child: A comparative perspective on socialization through stories.
Ethos 17:43–64.
Morford, J. P., and S. Goldin-Meadow. 1997. From here to there and
now to then: The development of displaced reference in homesign
and English. Child Development 68:420–435.
Newport, E. L., and R. P. Meier. 1985. The acquisition of American Sign
Language. In The cross-linguistic study of language acquisition, vol. 1,
edited by D. I. Slobin, 881–938. Hillsdale, NJ: Erlbaum.
Ozcaliskan, S., and S. Goldin-Meadow. 2005. Do mothers lead their
children by the hand? Journal of Child Language 32:481–505.
Padden, C., and T. Humphries. 1988. Deaf in America: Voices from a
culture. Cambridge, MA: Harvard University Press.
Perry, M., R. B. Church, and S. Goldin-Meadow. 1988. Transitional
knowledge in the acquisition of concepts. Cognitive Development
3:359–400.
Phillips, S. B., S. Goldin-Meadow, and P. J. Miller. 2001. Enacting stories,
seeing worlds: Similarities and differences in the cross-cultural
narrative development of linguistically isolated deaf children. Human
Development 44:311–336.
Shatz, M. 1982. On mechanisms of language acquisition: Can features
of the communicative environment account for development? In
Language acquisition: The state of the art, edited by E. Wanner and L.
R. Gleitman, 102–127. New York: Cambridge University Press.
Singer, M. A., and S. Goldin-Meadow. 2005. Children learn when their
teachers’ gestures differ from speech. Psychological Science, 16:85–89.
Singleton, J. L., J. P. Morford, and S. Goldin-Meadow. 1993. Once is not
enough: Standards of well-formedness in manual communication
created over three different timespans. Language 69:683–715.
Skuse, D. H. 1988. Extreme deprivation in early childhood. In Language
development in exceptional circumstances, edited by D. Bishop and K.
Mogford, 29–46. New York: Churchill Livingstone.
fourte n

The Distributed Cognition


Perspective on Human Interaction
Edwin Hutchins

What comes to mind when a social or cognitive scientist thinks


about “human interaction”? The answer surely depends on the
scientist’s field of study and most of us learn images of interaction
implicitly as part of being socialized into a scientific community. In
some corners of artificial intelligence, the prototypical interaction
is a sequence of turns in which strings of characters or symbols are
exchanged. For some conversational analysts, the interactions of interest
are mostly verbal, telephone conversations, for example. Ethnographers
of speaking may focus on face-to-face interactions, and that formulation
draws our attention to facial expression in addition to verbal behavior. To
go further in this direction, the phrases that describe our default images
of interaction become awkward. Many of us speak about “multimodal
interaction,” but at the workshop leading to this volume, Emanuel
Schegloff reminded us that this phrase is redundant. So, shall we simply
say “interaction” and hope that others’ imaginations are as rich as our
own? My personal preference is to emphasize the way participants to
an interaction coinhabit a shared environment. No matter how they
are described, our default images of human interaction have powerful
consequences for the way we do science. Such images guide decisions
about where we look for evidence concerning the nature of human
interaction. They shape our understandings of what the observed
evidence means. (What is the nature of human interaction, and what
phenomena are our theories supposed to explain?) Finally, such images
affect how we chose to explain the origins of contemporary human
interaction.
Cognition in Interaction

This chapter has three substantive parts. The first part describes how
the distributed cognition perspective directs our attention to particular
classes of interactions. The second part uses the examination of an
example of real-world human interaction to construct a description of
the nature of interaction. This examination shows real-world interaction
to be deeply multimodal and composed of a complex network of
relationships among resources. It also shows that some cognitive
processes are properties of the system of interaction, distinct from the
cognitive properties of the individuals who participate in the system.
The last part explores the implications of this “naturalized” notion
of human interaction for our understanding of both the nature of
contemporary cognition and for the kinds of processes that might have
given rise to contemporary cognition. What evolves is not the brain
alone, but the system of brains, bodies, and shared environments for
action in interaction. Cultural practices are as much a part of the story
of cognitive evolution as are changes in brain structure. This means
that important milestones in cognitive evolution could, in principle,
have been achieved without any particular genetic adaptation being
associated with them.

Distribution Means Interaction


The subfield of cognitive science called “distributed cognition” does
not study any particular kind of cognition; it is an approach to the
study of all cognition. It assumes that cognitive processes are always
distributed in some way. Rather than assuming a boundary for the
unit of analysis a priori, distributed cognition follows Bateson’s (1972)
advice and attempts to put boundaries on its unit of analysis in ways
that do not leave important things unexplained or unexplainable. This
means that a group of people working together is a distributed cognition
system. In such a case, cognition is distributed across brains, bodies, and
a culturally constituted world. Describing the cognitive properties of this
unit of analysis has been the most obvious contribution of distributed
cognition to cognitive science, and is certainly the most relevant aspect
of the approach for anthropologists. An individual working alone with
material tools is also a distributed cognitive system, as is an individual
working alone without material tools. So too is an individual brain
situated
in the body, or the brain without consideration of the body because
cognition is distributed across areas of the brain. Even single areas of
the brain are studied now as systems in which cognitive function is
distributed across layers of neurons. And the same is true down to the
Distribution Cognition Perspective on Human Interaction

level of a network of neurons in the brain. The point is that distributed


cognition is not a kind of cognition at all, it is a perspective on cognition.
Its chief value is that it poses questions in new ways and leads to new
insights.
When applied to systems that are larger than an individual actor,
distributed cognition is an approach to cognition that is deliberately
framed in a way that keeps culture in mind. When units of analysis
that are larger than an individual are examined as cognitive systems,
acknowledging the involvement of culture with cognition is unavoidable.
Distributed cognition sees real-world cognition as a process that involves
the interaction of the consequences of past experience (for individual,
group, and material world) with the affordances of the present. In this
sense, culture is built into the distributed cognition perspective as at
least a context for cognition.
From a cultural point of view, cognition is distributed through time,
between person and a culturally constructed environment, and among
persons in socially organized settings.

Interaction with Social Others


Just as physical labor can be distributed among persons; cognitive labor
can also be distributed among persons. This distribution of cognitive
labor is always mediated by human interaction. It relies on human
sociality and forms the context for sociality and its development. This
was known by anthropologists long before the distributed cognition label
was coined. Durkheim and his students, especially Halbwachs (1925),
explored socially distributed memory. Douglas’s classic How Institutions
Think (1986) examined the ways that reasoning and rationality are
shaped by institutional organization and goals.
The distribution of cognitive labor can give rise to supraindividual
cognitive effects. That is, social groups can have cognitive properties
that are distinct from the cognitive properties of the individuals who
compose the group. Although similar arguments can be made for all
cognitive processes, it is perhaps easiest to illustrate these processes in
the domain of memory. Jack Roberts’s (1964) analysis of the memory
storage and retrieval properties of various Native American groups was
one of the first to read social organization as computational architecture
(although he did not use that language). Bartlett’s (1995) seminal studies
and theorizing about the reconstructive nature of memory in 1932
led to more recent studies of collective remembering (Middleton and
Edwards 1990). Decision making and interpretation formation are
socially distributed in a number of institutional settings including juries,
intelligence agencies, military units, markets, and elections (Surowiecki
2004). Computer simulation studies (Hutchins 1995b; Henty 1999)
have shown that simply changing patterns of communication among
decision makers in distributed systems can change the likelihood
of various classes of decision outcomes. The notion that patterns of
information flow have cognitive consequences was explored in the
domain of commercial airline flight decks by Hutchins (2000) and
extended to implications for design of work systems in general in Hollan
et al. (2001).

Interaction with the Material Environment


A second kind of interaction that involves the distribution of cognition
is the interaction of persons with their material environment. A person
in interaction with cognitive artifacts can have cognitive properties
different from those of a person alone. Noticing the similarity between
artifacts that amplify our physical strength, and those that amplify
sensory processes, Bruner et al. (1966) proposed that some artifacts, such
as language and symbolic systems, could be conceived of as amplifiers of
cognitive capacities. In cognitive science the notion that the immediate
environment can be considered as external memory was noted by Newell
and Simon (1972). The notion of cognitive artifacts as a class of objects
was explored by Norman (1994) and Hutchins (1999).
Cognitive artifacts have their effects by reorganizing cognitive
capacities
into functional constellations that provide the new capabilities.
Cole and Griffin (1980) refined the cognitive amplifier view by noting
that these artifacts do not actually amplify any existing cognitive capacity.
Rather, when a person performs a cognitive task (e.g., remembering)
in coordination with cognitive artifacts (e.g., using paper and pencil)
a different set of internal and external resources is assembled into a
dynamical functional system (Luria 1966) that does the job. In this
“functional-systems” view, cognitive artifacts are transformers of
cognitive
systems rather than cognitive amplifiers. The focus here is on the
interaction between internal processes and structures and processes
in the environment. The functional-systems framing of distributed
cognition has been applied to flight deck cognition (Hutchins and
Klausen 1996) and to the question of how a flight deck remembers
speeds (Hutchins 1995b). The latter article showed why a complete
knowledge of the psychology of individual memory would be inadequate
to understand memory in an airline flight deck. Humans inhabit a
cognitive ecology that contains many sorts of cognitive resources. Some
of these are physical objects, some are cultural practices, and some are
mental models. Cognitive effects emerge from the interaction of persons
with the rich cultural content of the cognitive ecology.

Interaction of the Present with the Past


The development of cultural cognitive ecology is itself a cognitive
process.
It is a kind of learning process. Culture is a process that, among
other things, accumulates partial solutions to frequently encountered
problems. Artifacts and practices have historically contingent cultural
developmental trajectories. As cultural creatures, we need not discover
the solutions to most of the problems we face. Both the framing of
problems and their solutions are already available for learning as part
of our cultural heritage. Hutchins and Hazlehurst (1991) provide a
computational demonstration of the fact that a community can learn
things that no individual could ever learn alone.
Evolutionary processes operating in the cognitive ecology can build the
structure of a task into the structure of artifacts (Hutchins 1995a). Stated
more accurately, cultural evolutionary processes build the structure of
task performance into the organization of the system of activity that
exists when the artifact is engaged through cultural practices.
Finally, all of these sorts of interaction and distribution take place
simultaneously in real world activity. Many of the effects described in this
section emerge from the interactions of elements in a complex system.
There is an important methodological corollary to this observation. It has
been known in cognitive science for some time that behavior patterns
that can be economically described by rules or by goal seeking, need not
be the products of processes that include the explicit representation of
either rules or goals. There are deep philosophical issues here concerning
the ascription of processes that include the representations of goals to
account for what appears to be goal-directed behavior. (See Clark 2001,
chs. 3–4.) The point is to keep the description of the patterned behavior
clearly separate from the description of the process that produces the
behavior. Many kinds of processes can produce behavior that appears
goal directed, but not all of them involve any representations at all. As
a simple example, consider Braitenberg’s (1984) vehicle number two.
This is a simple robot consisting of a body, two laterally mounted light
sensors, and two laterally mounted drive wheels. Each wheel is driven
at a speed that is a monotonically increasing function of the activity
of the attached light sensor. If the light sensors are connected to the
motors ipsilaterally, the robot turns away from light. If the light sensors
are connected to contralateral motors, then the robot turns toward
light. When observing the robot move in the vicinity of light sources
it is very tempting to say that it avoids light or that it seeks light. But
words such as “avoid” and “seek” invite the attribution of internal
mechanisms, representations of goals for example, that are clearly not
present in the robot. The same goes for many other kinds of cognitive
processes. That is, many so-called “cognitive processes” are identified
by patterns of observable behavior, whereas the nature of the processes
that produce those observable behaviors may be very different from
the patterns that are produced. This is true at the level of an individual
person as well as at the level of groups of persons. What this means for
our images of interaction is that once we commit to the notion of rich
interaction, even deciding what task is being accomplished may depend
on knowing something about how it is accomplished.

Real-world Interaction
All of the sorts of interaction and distribution described in the previous
section take place simultaneously in real world activity. This section
presents an instance of rich, culturally grounded, real-world interaction.
In Cognition in the Wild (Hutchins 1995a), I used an extended study
of ship navigation to show how the cognitive science of real-world
activity could be accomplished. That book emphasized the distribution
of cognitive processes between persons and technology, among people,
and across time in the development of the social and material context
for thinking. My research group recently undertook a reanalysis of some
of the video data from the ship navigation study.

A Brief Case Study


I have selected for analysis about ten seconds of interaction in which
two navigators (a bearing recorder and a plotter on the bridge of a navy
ship) choose landmarks to use in the next position fix (see Figs 14.1 and
14.2). Position is determined by measuring the bearing of landmarks
and plotting these bearings on a chart. A plotted bearing defines a line
of position (LOP). Three lines of position define a position fix (Fig.
14.3). This is a clear case of distributed cognition. The individual and
institutional knowledge of ship’s position is produced by the activity of
a complex system involving interaction among persons and complex
culturally organized material media.
Figure 14.1. Navigation team on the bridge of a navy ship. The bearing
recorder is in the foreground. The plotter is to his left.

The navigators have projected the estimated position of the ship


at the time of the next position fix (the half-circle in Fig. 14.3). They
must now choose three landmarks, such that the LOPs that will be
observed at the next fix time will intersect at useful angles (Fig. 14.4).
This is an instance of a cognitive process: choice. It is useful to note
here that appropriateness of a chosen LOP is not a property of the
LOP itself, it is a property of the relations of the LOP to the other
chosen LOPs. That is, although a position fix consists of three elements
(LOPs), none of the individual elements can be said to be good or bad
with respect to the choice criteria. The criteria refer to the relations
among elements, not to the elements themselves. This can be taken as
a model of a more profound phenomenon to be encountered below.
The meanings of elements of multimodal interactions are not properties
of the elements themselves, but are emergent properties of the system
of relations among the elements.
Figure 14.2. Enacting provisional lines of position. The bearing recorder is
completing his conversation turn while plotter positions his hand to take the
(gesture and talk) floor.

A transcript of the verbal part of the interaction among the navigators


looks like this:
1 BR: so: it’ll be that (1.9) n that (1)
2 P: Ballast Point (.7) Bravo (1)
3 BR: u:[h
4 P: [that’s good (.5)
5 BR: okay (1.2)
However, the verbal exchange is just one element of the interaction.
The next paragraph gives a richer ethnographically informed description
of the activity.
The bearing recorder first proposes two landmarks to use at the next
fix. He leans over the chart (saying “It’ll be. . .”) and uses his left index
finger to quickly trace a line from a landmark called Ballast Point to the
approximate location of the estimated future position of the ship (saying
“that”). His finger wavers for a moment making a loose clockwise loop
over the chart then he traces a line from the landmark called Bravo
Figure 14.3. Three lines of position fix the position of the ship (represented
by the triangle). The anticipated course extends from the fix triangle to the
estimated position, EP (half-circle), where the ship is expected to be at the time
of the next fix. Ballast Point is at left center, Bravo Wharf is above to the right,
Light Victor is to the right.

Wharf (saying “ ‘n that”). The bearing recorder’s left hand remains in


the vicinity of the estimated position and he pauses for one second.
(This moment is shown in Fig. 14.2.) The plotter interrupts the bearing
recorder’s activity by moving his right hand, middle finger extended,
into the area over the chart where the estimated position has been
plotted. The bearing recorder withdraws his left hand from the area as
the plotter’s right hand comes in. Quickly tracing the imagined lines
of position from each landmark as each is named, the plotter revisits
the same landmarks just mentioned by the bearing recorder, “Ballast
Point, Bravo.” The bearing recorder tries to retake the floor by leaning
Figure 14.4. The dashed lines indicate a poorly chosen pair of landmarks
for the next fix. The angles of intersection among the LOPs should be open.
Using either of the dashed lines with the two piers ahead would produce more
favorable angular relations among the LOPs.

over the chart and reaching toward the plotting area with his left hand,
saying, “u:h,” but the plotter rebuffs him by making another gestured
LOP from the vicinity of the depiction of Light Victor to the EP (half-
circle) and saying, “that’s good.” Because Light Victor is located to the
east of the EP, this gesture both indicates a third LOP and effectively
blocks the entry of the bearing recorder’s hand to the plotting area. The
bearing recorder pulls his left hand back, rests it on the chart table in
front of him and says, “Okay.”
As the navigators work, they use their fingers to trace lines from
various
landmarks to the vicinity of the estimated position. The gestures
enact imaginary or provisional LOPs. These ephemeral structures are
the representations on which the choice process operates. The criteria
for evaluation are the angles of intersection among the prospective
LOPs. The creation and evaluation of the proposed LOPs is carried
out in a conversation between the two workers. The conversational
turns are multimodal in that they include environmentally coupled
gesture, cogesture speech, body orientation, facial expression, and tool
manipulation.

Environmentally Coupled Gesture


The gesture is complex. The hands of the participants move around
a lot over the chart (Fig. 14.5). Some parts of the gesture stream are
meaningful. Some are not. Some gestural strokes represent lines of
position, whereas other strokes reposition the hand to begin a meaningful
stroke. How do the participants distinguish the meaningful parts of the
gesture from the parts they should disregard? First, the participants
know that the objects of interest are virtual lines of position. These
lines should link landmarks with the ship’s estimated position. This
is part of the common ground shared by the navigators. As Enfield
(2005, this volume) and Clark (this volume) demonstrate, features of a
shared task world can contribute to the establishment and maintenance
of common ground. The bearing recorder says, “It’ll be that ‘n that.”
The seemingly unbound anaphora of “It” refers to the object of the

Figure 14.5. The trajectory of the bearing recorder’s gesture is complex.


Figure 14.6. The trajectory of the bearing recorder’s gesture as it was
performed over the chart. Tick marks on the gesture trajectory indicate the
location of the bearing recorder’s index finger in each frame of the video. The
filled arrows indicate the LOPs that are made salient by the combination of the
many cues produced by the bearing recorder. These are the LOPs he proposes
for consideration.

understood current project that is the triplet of landmarks to be used in


plotting the next position. An experienced navigator can see the chart
as landmarks and EP. The trajectory of the gesture superimposed on the
interpreted chart picks up some possible lines of position but seems to
have nothing to do with others (Fig. 14.6). The trajectory of the gesture
does not unambiguously pick out the potential lines of position that
are being proposed by the bearing recorder.
The bearing recorder’s gesture also has a velocity profile. That is, some
parts of the motion of the bearing recorder’s hand are fast, others are
slow, and others come nearly to a stop. Velocity is probably an indication
of many conceptual elements and of the affective states of actors as
well. Meaningful gestures often come in the form of strokes that are
demarcated by pauses before and after the meaningful stroke. These
are called pre- and poststroke holds (McNeill 1992). A frame-by-frame
analysis makes it possible to indicate the location of the hand in each
frame of the video clip. Fig. 14.6 indicates the location of the hand in
each frame by a tick mark on the gesture trajectory. The density of tick
marks is a measure of velocity. Sparse tick marks indicate rapid motion,
whereas dense areas of tick marks indicate slow motion. The velocity
profile indicates pre- and poststroke holds for two gestural strokes: one
on the ESE-ward (East–Southeast) stroke from Ballast Point through the
EP, and the other on the SSW (South–Southwest) stroke from Bravo
Wharf through the EP. These holds give special salience to these sections
of the gestural trajectory.
Another useful cue is the shape of the gestural trajectory. Because
lines of position are, by definition, straight, gesture segments that are
curved are unlikely to be meaningful representations of virtual lines of
position. Correspondences between potential lines of position and the
linear segments of the gesture add plausibility to some potential lines
of position and not others. Again, this cue is not, by itself, sufficient to
pick out the lines of position being proposed by the bearing recorder.
The straightest section of the gesture trajectory does not correspond
to any possible LOP.
Some parts of the gesture are performed many centimeters above
the surface of the chart, whereas others are performed with the tip of
the finger in contact with the chart. Real lines of position are drawn
by putting a pencil in contact with the surface of the chart. Making
contact with the chart seems to add perceptual salience to these parts
of the gesture. The two strokes that correspond to the intended lines of
position are made with the tip of the finger in contact with the surface
of the chart.
Finally, one can add the cogesture speech to the representation. The
bearing recorder says, “It’ll be that ‘n that.” The two occurrences of the
indexical “that” are produced in synchrony with the two meaningful
strokes and add to their perceptual salience. These words mediate the
allocation of attention, of speaker and listener, to the gestural
performance.
These words imply a structure of relationship among the elements
of the multimodal system (something will be composed of two parts),
but the identity of the elements and the nature of their relationship is
not in the words alone; it is in the interpretation of the environmentally
coupled gesture.
The combined contributions of these cues unambiguously pick out
two gestural strokes as representations of proposed lines of position.
These are not the straightest strokes in the gestural trajectory, nor are
they exactly aligned with landmark or estimated position, but the
combination of cues marks them as unambiguously meaningful. The
meanings of the motions that constitute the gesture are established by
their relations to the other elements of this complex act of meaning
making.
No one knows in what order or how these cues are perceived,
processed,
or combined. This is precisely the problem indicated by Levinson
under the heading of the “binding problem.” (Levinson this volume).
Multimodal signal streams require the linking of elements that belong
to one another across time and modality. Although all of the cues have
discernable physical properties, it is not the signals themselves that
make the cues relevant. It is the meaning that the overall performance
has as part of the understood project at hand.

Thinking with Brain, Body, and Culturally Constructed World


The cultural practice of gesturing in meaningfully interpreted space
brings the objects of interest, potential lines of position, into existence.
This is an example in which high-level cognition is enacted in the
motion of the body in shared culturally meaningful space. It is also
likely that this cultural practice takes advantage of some very general
properties of brain organization.
The distributed cognition perspective makes the boundary around the
person permeable and leads to a natural curiosity about the relations of
the other cognitive structures to activity in the brain. Unfortunately,
little is known about how the brain accomplishes high-level cognition.
In recent years, a large number of brain imaging studies have hinted
at the promise of being able to localize some kinds of processes in
regions of space (fMRI) or time (ERP), but the actual mechanisms remain
obscure. For example, retinotopic maps in the visual system exist in
low-level visual areas, but at higher levels the patterns of activation turn
into something other than topological variants of retinal representation,
and the significance of spatial patterns of activity becomes increasingly
difficult to interpret. However, even without knowing the details of the
processes, it is possible to say some things confidently about brain and
cognition from a distributed cognition perspective.
The simple acts of seeing the landmarks and the ship’s estimated
position on the chart bring visual processes into coordination with
structure in the chart and with memories for the depiction of the
landmarks. This is already a complex process because the memory may
be recall of specific depictions of known landmarks and/or recognition of
landmarks through the interpretation of the graphical conventions used
in cartography. In either case, these marks on the chart are recognized as
depictions of landmarks and the previously plotted estimated position.
The visual and somatosensory systems produce many representations
of the location of the points of interest and the spatial relations among
them. There are certainly retina-centered representations, but probably
also head-centered and body-centered representations as well (personal
communication, M. Sereno, October 5, 2003). Each representational
system may encode multiple features such as location, direction of
motion, and velocity. High-level conceptual and visual constraints on
what a LOP can look like and where it can occur on a chart support the
imagination of possible lines of position. This may be coordinated with
eye movements tracing the LOP or saccading between the depiction
of the estimated position and the depictions of the landmarks. Thus,
simply seeing the chart as a meaningful space is already a complex
cognitive activity.
So why gesture? By superimposing gesture on the meaningfully
interpreted chart surface a navigator adds representations of motion
to the visual system and representations of the trajectories of motion
of the hand and fingers to the somatosensory system. At present,
no one knows exactly how these representations work, but imaging
studies show that there are from ten to fifteen parietal areas carrying
coordinated representations of space and motion in space (personal
communication, M. Sereno, October 5, 2003). The hands, guided by
conceptually meaningful visual and motor representations, act in the
world thereby producing new richer more complex and more integrated
brain representations. By acting and monitoring one’s own action at
the same time one uses brain processes to guide activities that entrain
more brain processes. This is a self-organizing process that is located
in the brain–body–world system.
Reasoning about the angles of intersection of the LOPs requires stable
representations of the LOPs. The robustness of the high-level cognition
depends on the way this activity coordinates a large number of related
representations, some in the environment of action, some in the body,
and some in the brain. The cultural practices take advantage of the way
the brain works to bring into existence multiple representations that
together are more stable than any single representation alone.
The practice the navigators engage in is located in a complex cognitive
ecology. The practice of gesturing to imagine lines of position brings into
coordination many elements in a rich web of constraints that includes
the technological tools of the job, the social relationships and division
of labor among the people, the functional organization of the brain, and
the culturally shaped ways of using the body. The high-level cognitive
accomplishment, choosing appropriate landmarks, depends on all of
these things. Each element of the system makes sense in the context of
its relations to the other elements. This tight web of interrelationships
is typical of real-world cognitive ecologies. In such systems the correct
unit of analysis is not one brain or even one semiotic modality, such as
speech or gesture taken in isolation, but the entire system. The meaning
of a complex emerges from the interactions among the modalities that
include the body as well as material objects present in the environment.
The effects of these interactions are generally not simply additive. Such
a meaning complex may be built up incrementally or produced more
or less whole, depending on the nature of the components and the
relations among them (see Alač and Hutchins 2004; Goodwin 1994,
this volume; Hutchins and Palen 1997).
Navigation is a special domain of activity, and this sometimes gives
rise to concerns about the generalization of the findings made here.
Fortunately,
navigation does not involve cognitive processes that are alien
to everyday life. Rather, what is special about this setting is how well it
supports enacted reasoning. The generalization of results must be tied to
the distribution of the mode of thinking, not to the characteristics of the
setting. We now have ample evidence that enacted reasoning is surely
a very widespread phenomenon. Goodwin (this volume) highlights
the importance of the meaningfully interpreted material world when
he says that environmentally coupled gesture is pervasive. Others in
this volume who describe the central involvement of meaningfully
construed material environment include Byrne, Clark, Enfield, Gaskins,
Goldin-Meadow, Hanks, Keating, Levinson, Liszkowski, and Tomasello.
The observations reported by these authors span species, cultures, and
levels of development. It is therefore likely that embodied reasoning is
a very old and widespread cognitive process.

Implications: Being Human


Although researchers are increasingly attending to interactions in which
the physical and social environments for action play an important role,
that role is still not clearly conceptualized. In the example presented
above, we saw that the social distribution of cognitive labor increases
the variability in the choice space. The conversational practice of taking
turns suggesting and evaluating options creates a cognitive system that
is likely to explore a wider range of alternatives than would be explored
by any navigator alone. Our folk theories assume that thought precedes
action. I have tried to show that in some activity settings, acting in the
world is thinking (see also Alačand Hutchins 2004). Finally, processes
of cultural evolution can produce activity settings in which simple
courses of action can produce powerful cognitive processes.
With these observations, I offer a sketch of an image of interaction
as a complex dynamic system. Typical human–human interactions are
composed of many elements, the meanings of which emerge from
the network of relations among the elements. For example, the
representations
of the provisional imagined LOPs are emergent properties
of the complex activity system. They cannot be partialed out as being x
percent in the brain, y percent in the body, and z percent in the world.
Like the components of a position fix, the parts of a meaningful human
interaction only mean what they mean by virtue of their roles in the
whole culturally understood activity.

Implications: Becoming Human


All serious cognitive scientists acknowledge the importance of symbolic
processes in human cognition. But where, when, and how are symbols
involved in human cognition? As noted in the discussion of the nature
of interaction above, much more work needs to be done to document
the distribution of cognitive strategies across space, culture, and context.
Although internal symbol processes must be inferred from observable
behavior, the use of external symbols is quite apparent. And this provides
the basis for some speculations about symbolic processes.
In a seminal work, Rumelhart et al. (1986) argue that individual
humans are good at three sorts of activity: (1) recognizing patterns, (2)
manipulating the physical world, and (3) imagining simple dynamical
processes. They describe how these processes could be invoked by a
person doing place-value multiplication with paper and pencil.
Each cycle of this operation involves first creating a representation
through manipulation of the environment, then a processing of the
(actual physical) representation by means of our well-tuned perceptual
apparatus leading to further modification of this representation. By
doing this we reduce a very abstract conceptual problem to a series of
operations that are very concrete and at which we can become very
good. . . . This is real symbol processing and, we are beginning to think,
the primary symbol processing that we are able to do. [Rumelhart et al.
1986:46]

In the example of the navigators enacting lines of positions, we see


that manipulating the world and imagining the dynamics of simple
worlds happen together. Environmentally coupled gestures allow the
navigators to use the motion of their bodies to imagine prospective
lines of position. In doing this, they are reasoning about properties of
the relations among the enacted LOPs. The representations of interest
here do not exist until they are enacted in the world of action. They
come into being as external representations created in the complex
interactions of the navigators with each other and the technology of
the job. Once these representations have been created in ephemeral
external form, the consequent multiple coordinated internal images
of them have the persistence needed to support reasoning about the
angular relations among them.
Once LOPs have been experienced as external representations, they
can be imagined. A navigator could even, perhaps, imagine gesturing,
thereby creating imagined enacted prospective LOPs, although one
suspects that the results of such imagining would not be as stable or
persistent as the results of actually making the gestures. As Rumelhart et
al. note, “Not only can we manipulate the physical environment and then
process it, we can also learn to internalize the representations we create,
‘imagine’ them, and then process these imagined representations—just
as if they were external” (1986:47). This story does not explain how
external representations arise, but it does claim that once external
representations
arise, there is a possibility of those representations being
imagined by a person, and the person imagining transformation of
those representations.
The argument above assumes that symbols could arise in interaction
before they arise internally. Is such a thing possible? When cognition takes
place in the interaction of the mind with the surrounding environment,
there is a new place to look for the origins of cognitive processes and
structure. This is important because so many theories of the origins
of human cognitive capacities go wrong by positing special processes,
modules, or evolutionary miracles that seem necessary to construct a
plausible story for the development of cognitive capabilities entirely
inside isolated individual brains. The origins of features of language are
good examples of this. But computer simulation studies have shown
that communities of agents in interaction can develop shared lexicons
(Hutchins and Hazlehurst 1995) and shared propositional structure
(Hazlehurst and Hutchins 1998; Hutchins and Hazlehurst 2002) without
any change to processes inside the agents.
Thinking about the roots of sociality and cognition it is a common
practice to project an image of activity into the past and imagine what
functional properties evolution could select for to produce a more
advantageous activity. When social interaction is our target, what sort
of image of interaction shall we project into the past? Projecting the
image of complex, multimodal, environmentally coupled interaction
into the past illuminates new possibilities for development. Change can
take place anywhere in the complex interaction system. This means
that one need not imagine that all mechanisms of change are lodged
inside individual organisms. Just as the image of complex multimodal
environmentally coupled interaction gives us a new place to look for the
sources of organization of ongoing behavior; it also gives a new place
to look for the developmental changes across phylogenetic time.
A very similar argument is made in contemporary evolutionary biology.
Oyama (2000) argues that the system that evolves is not the genome,
but the phenotype in context (see also Turner 2000). The central dogma
of evolutionary biology is that all important change resides in the
genome. But the system that evolves is a wider system of organism and
environment in interaction. Similarly, when thinking about cognition,
it is a mistake to focus narrowly on hypothesized functional adaptations
of the brain. It is commonly assumed that genetic adaptations must
produce a brain that is capable of the hypothesized new functional
abilities. What evolves, however, is not the brain alone, but the system
of brains, bodies, and shared environments in interaction. Cultural
practices are as much a part of the story of cognitive evolution as are
changes in brain structure. This means that important milestones in
cognitive evolution could, in principle, have been achieved without any
particular genetic adaptation being associated with them. A change in
physical environment, for example, could lead to changes in interactive
processes that could give rise to a new cognitive ability in the interaction
system. This goes even for critical milestones such as the advent of
symbolic representations. Once a new functional capacity arises in
the interaction system, it creates new opportunities for change in the
genome. This argument does not deny the role of genetic change, it
only points out that the genome is but one of many elements of a
complex adaptive system.
This is not to say that thinking and imagining never happen in the
absence of a material world, for clearly they do. But it does say that such
processes are different in nature than thinking with the world, that they
are derived from (transformations of) processes that do involve action
with the world, and they generally appear later developmentally (both
ontogenetically and phylogenetically) than thinking with the world.
The last point is a key component of Vygotsky’s (1986) theory of the
social origins of mind. The kind of thinking that has been the focus of
cognitive science and psychology is likely a relatively recent add-on to
a more fundamental, but, as yet, poorly understood mode of thinking
with the world. No one knows the relative frequencies of thinking in
these different modes or thinking across behavioral settings. And we
know even less about the distribution of the ways of thinking that lie
along the continuum between completely mental activity and thinking
that is inextricably bound up with action in the world.

Understanding Interaction
Human minds did not evolve in isolation, each wrapped tightly in a
thick skull and thereby insulated from the complexities of the body and
the world. We know that the brain takes advantage of minute details
of the body and the body’s interaction with the physical environment
(Clark 2001; Quartz and Sejnowski 2002). Similarly, mind will have
evolved, not in isolation from the material and social world, but in
ways that weave its activity inextricably into the details of those worlds
(Tomasello 2001).
If distributed cognition presents us with a world in which everything
is seemingly connected to everything else, does not studying cognition
become impossible? I think it certainly becomes more difficult.
Understanding
complex real-world interactions is more difficult than
understanding
systems of simple linear relations. However, in some ways,
more complex problems can be easier to solve than what seem to be
simpler problems. If the nature of the problem is to constrain behavior,
a system of multiple interacting subsystems can provide a solution more
easily than tying to get all of the constraints out of a single subsystem.
For example, it is easier to account for the organization of the visual
system if one recognizes that it develops in concert with the auditory
system than it is to account for the organization of either system in
isolation (de Sa and Ballard 1998). Such findings are part of a wider
shift in the cognitive sciences is toward an increasing appreciation for
rich interactions among systems at all levels of organization. People
in normal interaction are in the business of creating and interpreting
rich multimodal meaning complexes. Here again, sometimes solving
what looks like a more complex problem is easier than solving what
looks like a simpler problem. It is easier to work out the significance of
complex multiply constrained acts of meaning than it is to determine
the meanings of the individual components as isolated systems. It is
easier to establish a meaning for words embedded with gestures that
are performed in coordination with a meaningful shared world than it
is to establish meanings for words as isolated symbols.
Thus, when we approach the more complex objects of scientific
scrutiny
demanded by distributed cognition theory, it is not the case that
explanations will necessarily be more difficult to create. They may be
somewhat more complex than easy linear and modular stories, but in
some cases, the explanations come naturally as side effects or by
products
of general principles. For example the development of a shared
lexicon mentioned above.

Conclusion
By softening the traditional disciplinary boundaries the distributed
cognition perspective focuses on a new unit of analysis that encloses
a complex set of interactions among brain, body, and culturally
constructed world. Careful attention to the microstructure of interaction
from the distributed cognition perspective leads to a reconceptualization
of the individual–environment relationship and suggests that this newly
conceived relation has important implications for the way we confront
many sorts of cognitive and anthropological problems. In particular,
it provides a new place to look for mechanisms that shape both the
ontogenetic and the phylogenetic development of sociality.

Acknowledgment
This work was funded by a grant from the Santa Fe Institute’s program
on robustness in natural and social systems, which is supported by the
McDonnell Foundation. Alisa Durán transcribed the data and suggested
many elements of the analysis presented here.

References
Alač, M., and E. Hutchins. 2004. I see what you are saying: Action as
cognition in fMRI brain mapping practice. Journal of Cognition and
Culture 4(3–4):629–661.
Bartlett, F. 1995[1932]. Remembering: A study in experimental and social
psychology, 2nd edition. Cambridge: Cambridge University Press.
Bateson, G. 1972. Steps to an ecology of mind. New York: Ballantine
Books.
Braitenberg, V. 1984. Vehicles: Experiments in synthetic psychology.
Cambridge, MA: MIT Press.
Bruner, J., R. Olver, P. Greenfield, et al. 1966. Studies in cognitive growth:
A collaboration at the center for cognitive studies. New York: John Wiley
and Sons.
Clark, A. 2001. Mindware: An introduction to the philosophy of cognitive
science. Oxford: Oxford University Press.
Cole, M., and M. Griffin. 1980. Cultural Amplifiers Reconsidered. In
The social foundations of language and thought. Essays in honor of Jerome
Bruner, edited by D. Olson, 343–364. New York: Norton.
de Sa, V., and D. H. Ballard. 1998. Category learning through
multimodality
sensing. Neural Computation 10(5):1097–1117.
Douglas, M. 1986. How institutions think. Syracuse, NY: Syracuse
University
Press.
Enfield, N. J. 2005. The body as cognitive artifact in kinship
representations:
Hand gesture diagrams by speakers of Lao. Current
Anthropology 46(1):51–73.
Goodwin, C. 1994. Professional vision. American Anthropologist 96(3):
606–633.
Halbwachs, M. 1925. Les cadres sociaux de la memoire. Paris: Albin
Michel.
Hazlehurst, B., and E. Hutchins. 1998. The emergence of propositions
from the coordination of talk and action in a shared world. Language
and Cognitive Process 13(3):373–424.
Henty, S. 1999 A computational simulation of jury decision making.
Honors thesis, Department of Cognitive Science, University of
California, San Diego.
Hollan, J. D., E. Hutchins, and D. Kirsh. 2001. Distributed cognition:
A new foundation for human-computer interaction research. ACM
Transactions on Human-Computer Interaction: Special Issue on Human-
Computer Interaction in the New Millennium 7(2):174–196.
Hutchins, E. 1995a. Cognition in the Wild. Cambridge, MA: MIT Press.
——. 1995b. How a cockpit remembers its speeds. Cognitive Science
19:265–288.
——. 1999. Cognitive artifacts. In The MIT Encyclopedia of the Cognitive
Sciences, edited by R. A. Wilson and F. C. Keil, 126–128. Cambridge,
MA: MIT Press.
——. 2000. The cognitive consequences of patterns of information flow.
Intellectica 1(30):53–74.
Hutchins, E., and B. Hazlehurst. 1991. Learning in the cultural process.
In Artificial life 2, Santa Fe Institute studies in the sciences of complexity
series, edited by C. Langton, C. Taylor, J. D. Farmer, and S. Rasmussen,
689–706. Santa Fe: Santa Fe Institute.
——. 1995. How to invent a lexicon: The emergence of shared form-
meaning mappings in interaction. In Social intelligence and interaction,
edited by E. Goody, 53–67. Cambridge: Cambridge University
Press.
——. 2002. Auto-organization and emergence of shared language
structure. In Simulating the evolution of language, edited by A. Cangelosi
and D. Parisi, 279–306. London: Springer Verlag.
Hutchins, E., and T. Klausen. 1996. Distributed cognition in an airline
cockpit. In Cognition and communication at work, edited by Y. Engeström
and D. Middleton, 15–34. New York: Cambridge University Press.
Hutchins, E., and L. Palen. 1997. Constructing meaning from space,
gesture, and speech. In Discourse, tools, and reasoning: Essays on situated
cognition, edited by L. B. Resnick, R. Saljo, C. Pontecorvo, and B.
Burge, 23–40. Heidelberg: Springer Verlag.
Luria, A. M. 1966. Higher cortical functions in man. New York: Basic
Books.
McNeill, D. 1992. Hand and mind: What gestures reveal about thought.
Chicago: University of Chicago Press.
Middleton, D., and D. Edwards (eds.). 1990. Collective remembering.
London: Sage.
Newell, A., and H. Simon. 1972. Human problem solving. Englewood
Cliffs, NJ: Prentice-Hall.
Norman, D. 1994. Things that make us smart: Defending human attributes
in the age of the machine. Boston, MA: Addison-Wesley.
Oyama, S. 2000. Evolution’s eye: A systems view of the biology-culture divide.
Durham, NC: Duke University Press.
Quartz, S., and T. Sejnowski. 2002. Liars, lovers, and heroes: What the
new brain science reveals about how we become who we are. New York:
Morrow.
Roberts, J. M. 1964. The self-management of cultures. In Explorations
in cultural anthropology, edited by W. H. Goodenough, 433–454. New
York: McGraw-Hill.
Rumelhart, D., P. Smolensky, J. McClelland, and G. Hinton. 1986.
Schemata and sequential processes in PDP models. In Parallel
distributed processing, vol. 2, edited by J. McClelland and D. Rumelhart,
7–57. Cambridge, MA: MIT Press.
Surowiecki, J. 2004. The wisdom of crowds. New York: Doubleday.
Tomasello, M. 2001. The cultural origins of human cognition. Cambridge,
MA: Harvard University Press.
Turner, J. 2000. The extended organism. Cambridge, MA: Harvard
University Press.
Vygotsky, L. S. 1986. Thought and language. Cambridge, MA: MIT
Press.
fite n

Social Consequences of Common


Ground
N. J. Enfield

The pursuit and exploitation of mutual knowledge, shared


expectations,
and other types of common ground (Clark 1996; Lewis 1969;
Smith 1980)1 not only serves the mutual management of referential
information, but has important consequences in the realm of social,
interpersonal affiliation. The informational and social-affiliational
functions of common ground are closely interlinked. I argue in this
chapter that the management of information in communication is
never without social consequence and that many of the details of
communicative practice are therefore dedicated to the management
of social affiliation in human relationships.
Common ground constitutes the open stockpile of shared presumption
that fuels amplicative inference in communication (Grice 1989),
driven by intention attribution and other defining components of the
interaction engine (Levinson 1995, 2000, this volume). Any occasion of
“grounding” (i.e., any increment of common ground) has consequences
for future interaction of the individuals involved, thanks to two
perpetually
active imperatives for individuals in social interaction. The
informational imperative compels individuals to cooperate with their
interactional
partners in maintaining a common referential understanding,
mutually calibrated at each step of an interaction’s progression. Here,
common ground affords economy of expression. The greater our
common ground, the less effort we have to expend to satisfy the
informational imperative. Second (but not secondary), the affiliational
imperative compels interlocutors to maintain a common degree of
interpersonal affiliation (trust, commitment, intimacy), proper to the
Cognition in Interaction

status of the relationship, and again mutually calibrated at each step of


an interaction’s progression. In this second dimension, the economy
of expression enabled by common ground affords a public display of
intimacy, a reliable indicator of how much is personally shared by a
given pair (trio, n-tuple) of interactants. In these two ways, serving the
ends of informational economy and affiliational intimacy, to increment
common ground is to invest in a resource that will be drawn on later,
with interest.

Sources of Common Ground


A canonical source of common ground is joint attention, a unique
human practice that fuses perception and inferential cognition (Moore
and Dunham 1995; Tomasello 1999, this volume). In joint attention,
two or more people simultaneously attend to a single external stimulus,
together, each conscious that the experience is shared. Figure 15.1
illustrates a typical, everyday joint attentional scene.
In this example, the fact that a washing machine is standing in front
of these women is incontrovertibly in common ground thanks both to
its physical position in the perceptual field of both interactants and to its

Figure 15.1. Joint attention on washing machine console.


Social Consequences of Common Ground

operating panel being the target of joint attentional hand gestures (Kita
2003; Liszkowski this volume). But common ground is also there when
it is not being signaled or otherwise manifest directly. At a personal
level, the shared experiences of interactants are in common ground as
long as the interactants know (and remember!) they were shared. At
a cultural level, common ground may be indexed by signs of ethnic
identity, and the common cultural background such signs may entail.
One such marker is native dialect (as signaled, e.g., by accent), a readily
detectable and reliable indicator of long years of common social and
cultural experience (Nettle and Dunbar 1997; Nettle 1999). Suppose I
begin a conversational exchange with a stranger of similar age to myself,
who, like me, is a native speaker of Australian English. We will each
immediately recognize this common native origin from each other’s
speech, and then I can be pretty sure that my new interlocutor and I
will share vast cultural common ground from at least the core years
of our linguistic and cultural socialization (i.e., our childhoods, when
our dialects were acquired). We will mutually assume, for instance,
recognition of expressions like fair dinkum, names like Barry Crocker,
and possibly even sporting institutions like the Dapto Dogs.

Common Ground as Fuel for Gricean Amplicative


Inference
Common ground is a resource that speakers exploit in inviting and
deriving pragmatic inference, as a way to cut costs of speech production
by leaving much to be inferred by the listener. As Levinson (2000)
points out, the rate of transfer of coded information in speech is slow,
thanks to our articulatory apparatus. Psychological processes run much
faster. This bottleneck problem is solved by the amplicative properties
of pragmatic inference (Levinson 2000; cf. Grice 1975). Interpretative
amplification of coded messages feeds directly on the stock of common
ground, in which we may include a language’s semantically coded
linguistic categories (lexicon, morphosyntax), a community’s set of
cultural practices and norms (Levinson 1995:240; Enfield 2002:234–
236), and shared personal experience. (This implies different categories
of social relationship, defined in part by amount and type of common
ground: e.g., speakers of our language, people of our culture, and
personal associates of various types; see below.) The more common
ground we share, the less constrained we are in communication. Hanks
(1989:118) captures this notion in his “principle of relative symmetry”
(see also Kockelman 2005:289 and passim on “symmetry of attitudes”),
by which greater common ground licenses a greater range of semiotic
possibilities for referential differentiation.
This logic of communicative economy—intention attribution via
inference fed by common ground—is complemented by the use of
convention to simplify problems of social coordination (Clark 1996;
Lewis 1969; Schelling 1960). Although we have access at all times to
the powerful higher-order reasoning that makes common ground and
intention attribution possible, we keep cognition frugal by assuming
defaults where possible (Gigerenzer et al. 1999; Sperber and Wilson 1995;
cf. Barr and Keysar 2004). So, if tomorrow is our weekly appointment
(midday, Joe’s) we do not have to discuss where and when to meet.
The hypothesis that we will meet at Joe’s at midday has been tested
before,2 and confirmed. And we further entrench the convention by
behaving in accordance with it (i.e., by turning up at Joe’s at midday
and finding each other there).
Consider a simple example from everyday interaction in rural Laos,
which illustrates common ground from both natural and cultural
sources playing a role in inference making. Figure 15.2 is from a video
recording of conversation among speakers of Lao in a lowland village
near Vientiane, Laos. (The corners of the image are obscured by a
lens hood.) The image shows a woman (foreground, right; hereafter,
Foreground Woman [FW]) who has just finished a complex series of
preparations to chew betel nut, involving various ingredients and tools
kept in the basket visible in the lower foreground. In this frame, FW
is shifting back, mouth full with a betel nut package, having finished
with the basket and placed it aside, to her left.
Immediately after this, the woman in the background, at far right
(Background Woman [BW]), moves forward, to reach in the direction
of the basket, as shown in Figs 15.3a, b.
BW’s forward-reaching action gives rise to an inference by FW that
BW wants the basket.3 We can tell FW has made this inference from
the fact that she grasps the basket and passes it to BW in Fig. 15.4. And
we can tell, in addition, from what she says next, in line 1 of (1), that
she infers BW wants to chew betel nut (the numbers at the end of each
Lao word mark lexical tone distinctions).
(1) 1 FW khiaw4
caw4
vaa3 chew?"
"you

2BWmm5
"

2sg
chew
q? 5
mm

"Mm." (i.e. "Yep.")


Figure 15.2. Conversation
among Lao speakers, lowland Laos.
Foreground Woman shifts back
having finished preparations to
chew betel nut (in basket, lower
foreground).

Figure 15.3. Background Woman


moves forward to reach in direction
of basket (lower foreground).
Figure 15.4. Foreground Woman passes basket to Background Woman,
inferring the goal of her reaching forward.

FW infers more than one thing from the forward-reaching action of


BW shown in Fig. 15.3. It would seem hardly culture specific that BW
is taken to be wanting the basket. (But an inference or projection is
nevertheless being made; after all, she may have wanted to rub a spot
of dirt off the floor where the basket was sitting.) More specific to the
common ground that comes with this cultural setting, BW’s reaching
for the basket is basis for an inference that she wants to chew betel
nut (and not, for instance, that she wants to reorganize the contents
of the basket, or tip it out, or put it away, or spit into it). The inference
that BW wants to chew betel nut is made explicit in the proposition in
line 1 “you chew.” The added sentence-final “evidential interrogative”
particle vaa3 (Enfield n.d.) makes explicit, in addition, that it is an
inference. The particle vaa3 encodes the notion that an inference has
been made, and seeks confirmation that this inference is correct: that
is, in a sequence X vaa3, the meaning of vaa3 can be paraphrased along
the lines “Something makes me think X is the case, you should say
something now to confirm this.” BW responds appropriately with a
minimal spoken confirmation in line 2.
The two inferences made in this example—one, that BW’s forward
movement indicates that she wants to take hold of something in front
of her, and two, that she wants to have the basket to chew betel nut—are
launched from different types of categorical knowledge (although they
are both based on the attribution of intention through recognition of
an agent’s “attitude”; Mead 1934, see also Kockelman 2005). The first
is a general stock of typifications determined naturally, essentially by
biology: naive physics, parsing of motor abilities (Byrne this volume),
frames of interpretation of experience arising through terrestrial fate
(Levinson 1997:28). A second basis for inference is the set of categories
learned in culture—here, from the fate of being born in a Lao-speaking
community, and acquiring the frames, scripts, and scenarios (Schank
and Abelson 1977) of betel-nut chewing among older ladies in rural
Laos (e.g., that betel paraphernalia is “free goods” that any middle-
aged or older woman may reach for in such a setting—had a man or a
child made the same reaching action here, they would not have been
taken to be embarking on a betel-chewing session). Both these types
of knowledge are in the common ground of these interlocutors, in the
strict sense of being information openly shared.

Grounding for Inferring: The Informationally


Strategic Pursuit of Common Ground
Links between joint attention, common ground, and pragmatic inference
suggest a process of grounding for inferring, by which the requirements
of human sociality direct us to tend—while socializing—to dimensions
of common ground that may be exploited in later socializing.4 This
formulation highlights the temporality of the connection between
grounding (i.e., securing common ground) and inferring. Grounding
is an online process (enabled by joint attention). Later inferring based
on common ground presupposes or indexes the earlier establishment
of that common ground (or indexes a presumption of that common
ground, based on some cue, such as a person’s individual identity, or
some badge of cultural or subcultural identity).
Grounding for inferring takes place at different levels of temporal
grain—that is, with different time lags between the point of grounding
and the point of drawing some inference based on that grounding. At a
very local level, it is observable in the structure of reference management
through discourse (Fox 1987). Canonically, a referent’s first mention is
done with a full noun phrase (e.g., a name or a descriptive reference), with
subsequent mentions using a radically reduced form (such as a pronoun;
recorded example from Fox 1987:20, transcription simplified):
(2) A: Did they get rid of Kuhleznik yet?
B: No in fact I know somebody who has her now.

Forms like her do not identify or describe their referent. Their


reference
must be retrieved by inference or other indexical means. This is
straightforward when a full form for the antecedent is immediately prior,
as in (2). But if you miss the initial reference, lacking the common ground
required for inferring what her must be referring to, you might be lost.
Without the benefit of informative hand gestures or other contextual
cues, you are likely to have to disrupt the flow of talk by asking for
grounding, to be able to make the required referential inferences.
At a step up in temporal distance between grounding and payoff
are forward-looking “setups” in conversational interaction (Jefferson
1978; Sacks 1974; cf. Goodwin’s “prospective indexicals”; Goodwin
1996:384), which, for instance, alert listeners to the direction in which a
speaker’s narrative is heading. When I say Her brother is so strange, let me
tell you what he did last week, you as listener will then need to monitor
my narrative for something that is sufficiently strange to count as the
promised key illustration of her brother’s strangeness, and thus the
punch line. What constitutes “her brother’s strangeness” is “not yet
available to recipients but is instead something that has to be discovered
subsequently as the interaction proceeds” (Goodwin 1996:384). When
you hear what you think is this punch line, you will likely surmise that
the story is at completion. Your response will be shaped by a second
function of the prospective expression, namely, as a forewarning of the
appropriate type of appraisal that the story seeks as a response or receipt.
So, He’s so strange, let me tell you. . . will rightly later elicit an appreciation
that is fitted to the projected assessment; for example, Wow, how strange.
Setup expressions of this kind are one type of grounding for inferring,
with both structural-informational functions (putting in the open the
fact that the speaker is engaged in a sustained and directed activity
of telling—e.g., “how strange her brother is”), and social-affiliational
functions (putting in the open the speaker’s stance toward the narrated
situation, which facilitates the production of affiliative, or at least fitted
response). Both these functions help constrain a listener’s subsequent
interpretation as appropriate to the interaction, at a discourse level.
All the way at the other end of the scale in temporal distance between
grounding and its payoff are those acts of building common ground
that look ahead into the interactional future of the people involved.
At a personal level, our efforts to maintain and build common ground
have significant consequences for the type of relationship we succeed in
ongoingly maintaining, that is, whether we are socially close or distant
(see below). At a cultural level, in children’s socialization we spend a lot
of time explaining and acting out for children “what people do,” “what
people say,” and “how things are.” This builds the cultural common
ground that will soon streamline an individual’s passage through the
moment-by-moment course of their social life.

ESPCmetorugcenioptuincoresn,:
and
A matter of some contention in the discussions documented in this
volume is the degree of involvement of higher-order cognition in social
interactional processes. Despite currency of the term “mind reading”
and its variants in literature on social intelligence (Baron-Cohen 1995;
Carruthers and Smith 1996; inter alia; cf. Astington this volume), we
cannot read each other’s minds. Miller wrote, “One of the psychologist’s
great methodological difficulties is how he can make the events he
wishes to study publicly observable, countable, measurable” (1951:3).
This problem for the psychologist is a problem for the layperson too.
In interaction, normal people need, at some level, to be able to model
each other’s (evolving and contingent) goals, based solely on perceptible
information, by attending to one another’s communicative actions and
displays (Mead 1934). A no-telepathy assumption means that there is
“no influencing other minds without mediating artifactual structure”
(Hutchins and Hazlehurst 1995). As a result, semiosis—the interplay of
perception and cognition, rooted in ethology and blossoming in the
modern human mind—is a cornerstone of human sociality (Kockelman
2005; Peirce 1965). Humans augment the ethologically broad base of
iconic and indexical meaning with symbolic structures and higher-order
processes of intention attribution.
So if action and perception are the glue in human interaction, higher-
order cognition is the catalyst. I see this stance as a complement, not
an alternative, to radically interactionist views of cognition (cf. Molder
and Potter 2005). Authors like Norman (1991), Hutchins (1995), and
Goodwin (1994, 1996) are right to insist that the natural exercising
of cognition is in distributed interaction with external artifacts. And
we must add to these artifacts our bodies (Enfield 2005; Goodwin
2000; Hutchins and Palen 1993) and our social associates (Goodwin
this volume). Similarly, the temporal-logical structures of our social
interactions are necessarily collaborative in their achievement (Clark
1996; Schegloff 1982), as may be our very thought processes (Goody 1995,
inter alia; Mead 1934; Rogoff 1994; Vygotsky 1962). But as individuals,
we each physically embody and transport with us the wherewithal to
move from scene to scene and still make the right contributions. We
store cognitive representations (whether propositional or embodied)
of the conventional signs and structures of language, of the cultural
stock of conventional typifications that allow us to recognize what
is happening in our social world (Schutz 1970), and of more specific
knowledge associated with our personal contacts. And we have the
cognitive capacity to model other participants’ states of mind as given
interactions unfold (Mead 1934).
Accordingly, here is my rephrasing of Miller (with a debt to Schutz
1970 and Sacks 1992): One of the man in the street’s great methodological
difficulties is how he can understand (and make himself understood to) his
social associates solely on the basis of what is publicly observable. Any model
of multiparty interaction will have to show how the combination of a
physical environment and a set of mobile agents will result in emergence
of the structures of interactional organization that we observe. It will
also have to include descriptions of the individual agents, their internal
structure and local goals. General capacities of social intelligence,
and specific values of common ground will have to be represented
somewhere in those individual minds. Then, in real contexts, what is
emergent can emerge.
So, human social interaction not only involves cognition, it involves
high-grade social intelligence (Goody 1995; allowing that it need not
always involve it—Barr and Keysar 2004). And in line with a number of
other contributors to this volume who resist the overuse or even abuse
of mentalistic talk in the analysis of social interaction, it is clear that
intention attribution is entirely dependent on perception in a shared
environment (see esp. Byrne this volume, for his “heretical thought”;
Danziger this volume; Goodwin 2000, this volume; Hutchins 1995:
ch. 9, this volume; Schegloff 1982:73). Both components—individual
cognition and emergent organization—are absolutely necessary (see
the introduction to this volume). Human social interaction would not
exist as we know it without the cocktail of individual, higher-order
cognition and situated, emergent, distributed organization. A mentalist
stance need therefore not be at the expense of the critically important
emergence of organization from collaborative action in shared physical
context, above and beyond any individual’s internally coded goals. To
be sure, there remain major questions as to the relative contribution
of individual cognition and situated collaborative action in causing
the observed organization of interaction. But however you look at it,
we need both.
Audience Design
Equipped with higher-order inferential cognition, an interlocutor (plus
all the other aspects of one’s interactional context), and a stock of
common ground, a speaker should design his or her utterances for
that interlocutor (Clark 1996; Sacks 1992; Sacks and Schegloff 1979;
Schegloff 1997). If we are to optimize the possibility of having our
communicativeintentions correctly recognized, any attempt to make the
right inferences obvious to a hearer will have to take into account the
common ground defined by the current speaker–hearer combination. In
ordinary conversation, there is no generic, addressee-general, mode of
message formulation. To get our communicative intentions recognized,
we ought to do what we can to make them the most salient solutions
to the interpretive problems we foist on our hearers. The right ways to
achieve this will be determined in large part by what is in the common
ground, and this is by definition a function of who is being addressed
given who it is they are being addressed by. Because Gricean implicature
is fundamentally audience driven (whereby formulation of an utterance
is tailored by how one expects an addressee will receive it), to do audience
design is to operate at a yet higher level than mere intention attribution.
It entails advance modeling of another’s intention attribution. 5
Consider an example that turns on highly local common ground. Fig.
15.5 shows two men sitting inside a Lao village house, waiting while
lunch is prepared in an outside kitchen.
At the moment shown in Fig. 15.5, a woman’s voice can be heard
(coming from the outside kitchen verandah, behind the camera, left
of screen) as follows:
(3) mòòt4 nam4 haj5 nèè1
extinguish water benefactive please
“Please turn off the water for (me).”

In making this request, the speaker does not explicitly select an


addressee. Anyone in earshot is a potential addressee. Within a second
or two, the man on the left of frame gets up and walks to an inside wall
of the house, where he flicks an electric switch (Fig. 15.6).
Consider the mechanism by which the utterance in (3) brought
about this man’s compliance. Although the woman’s call in (3) was
not explicitly addressed to a particular individual, it was at the very
least for someone who was in hearing range and knew what compliance
with the request in (3) entailed. Although relative social rank of hearers
may work to narrow down who is to carry out the request, it remains
Figure 15.5. Two men waiting for lunch to be served, lowland Laos. Woman in
kitchen (out of frame) is calling out “Please turn off the water!”

that the utterance in (3) could not be intended for someone who lacks
the common ground, that is, who does not know what “turning off
the water” involves. The switch that controls an outside water pump
is situated at the only power outlet in the house, inside, far from the
kitchen verandah. To respond appropriately to the utterance in (3), an
addressee would need this inside knowledge of what “turning off the
water” entails. Without it, one might not even realize that the addressee
of (3) is someone (anyone) inside the house. But it is in the common
ground for the people involved in this exchange. They are neighbors of
this household, daily visitors to the house. The woman outside on the
verandah knows that the people inside the house know (and know that
they are known to know!) the routine of flicking that inside switch to
turn the outside water pump on and off. This enables the success of the
very lean communicative exchange consisting of the spoken utterance
in (3) and the response in Fig. 15.6.
Much is inferred by the actor in Fig. 15.6 beyond what is encoded in
the spoken message in (3), in the amplicative sense outlined above. In
addition, this example illustrates a defining feature of common ground
information, namely that people cannot deny possessing it.6 The man
on our left in Fig. 15.5—who is situated nearest the switch—might not
feel like getting up, but he could not use as an excuse for inaction a
Figure 15.6. Man gets up to turn
off switch of electric water pump.

claim that he does not know what the speaker in (3) wants (despite the
fact that nothing in her utterance makes this explicit).
The principle of audience design dovetails with common ground,
because both are defined by a particular social relationship between
particular interlocutors. As prefigured above, the general imperative
of audience design is served by two, more specific imperatives of
conversation. I described one of these—the informational imperative—as
the cooperative struggle to maintain common referential understanding,
mutually calibrated at each step of an interaction’s trajectory (Clark
1996; Schegloff 1992). This will be satisfied by various means including
choice of language spoken, choice of words, grammatical constructions,
gestures, and the various devices for meeting “system requirements”
for online alignment in interaction (mechanisms for turn organization,
signals of ongoing recipiency, correction of errors and other problems,
etc.; Goffman 1981:14; Schegloff this volume). Less well understood
are the “ritual” requirements of remedial face work, and the need to
deal with “implications regarding the character of the actor and his
evaluation of his listeners, as well as reflecting on the relationships
between him and them” (Goffman 1981:21; cf. Goffman 1967, 1971).
We turn now to those.

The Affiliational Imperative in Social Interaction


Any time one is engaged in social interaction, one’s actions are of real
consequence to the social relationship currently being exercised. If
you are acting too distant, or too intimate, you are most likely going
to be held accountable for it. Heritage and Atkinson (1984:6) write that
there is “no escape or time out” from the consequences of interaction’s
sequential, contextual nature. Similarly, there is no escape or time out
from the social–relational consequences of interaction. Just as each little
choice we make in communicative interaction can be assessed for its
optimality for information exchange, it can equally be assessed for its
optimality for maintaining (or forging) the current social relationship
at an appropriate level of intensity or intimacy. The management of
common ground is directly implicated in our perpetual attendance to
managing personal relationships within our social networks. Next, I
elaborate some mechanisms by which this is achieved, but first I want
to flesh out what is meant by degrees of intimacy or intensity in social
relationships.
One of the key tasks of navigating social life is maintaining positions
in social networks, where relationships between individuals are carried
through time, often for years on end. There are logical constraints on
the nature of an individual’s network of relationships thanks to an
inverse relationship between time spent interacting with any individual,
and number of individuals with whom one interacts. We have only so
much time in the day, and sustained relationships cannot be multiplied
beyond a certain threshold (cf. grooming among primates; Dunbar
1993, 1996). Spending more time interacting with certain individuals
means more opportunities to increment common ground with those
individuals, by virtue of the greater opportunity to engage in joint
attentional activity such as conversation. This results in greater access
to amplicative inference in communication. A corollary is having less
time to interact with others, and thus less chance to increment common
ground through personal contact with those others, and, in turn, less
potential to exploit amplicative inference in communication with
them.
Such considerations of the logical dynamics of time and social group
size have been taken to suggest inherent biases in the organization of
social network structure (Dunbar 1998; Dunbar and Spoors 1995; Hill
and Dunbar 2003). Hill and Dunbar suggest that social networks are
“hierarchically differentiated, with larger numbers of progressively less
intense relationships maintained at higher levels’ (2003:67; cf. Dunbar
1998). They propose a model with inclusive levels (Hill and Dunbar
2003:68; note that they also discuss groupings at higher levels than
this):
(4) Level of relationship intensity Approximate size of group
support clique 7
sympathy group 21
band 35
social group 150

What defines membership in one or other of these levels? As with


physical grooming among primates, those who I spend more time with
in committed engagement will tend to be those who I can later rely on
in times of trouble (and, similarly, to whom I will be obliged to offer
help if needed). In some societies this will be somewhat preordained
(e.g., by kin or equivalently fixed social relations), whereas in other types
of societies people may be more freely selective (as in many modern
urban settings). For humans, unlike in primitive physical grooming,
such rounds of engagement are intertwined with the deployment of
delicate and sophisticated symbolic structure (language), and so it is
not (just?) a matter of how long we spend interacting with whom, but of
what kind of information is traded and thereby invested in common ground.
This is why in one type of society I might have a more intensive, closer
relationship with my best friend, even though I see very much less of
him than my day-to-day professional colleagues.
Cultures will differ with respect to the determination of relationship
intensity (quantitatively and qualitatively defined), and the practices by
which such intensity is maintained and signaled. Hill and Dunbar suggest
that a hierarchical structure of social relatedness like (4), above, will be
maintained in more or less any cultural setting, but the qualitative basis
for distinction between these levels in any given culture will be “wholly
open to negotiation” (i.e., by the traditions of that culture; 2003:69).
They cite various types of social practice that may locally define the
relevant level of relationship: those from whom we get our “hair care”
(among the!Kung San; Sugawara 1984), those “whose death would be
personally devastating” (Buys and Larson 1979), those “from whom
one would seek advice, support, or help in times of severe emotional or
financial stress” (Dunbar and Spoors 1995), those to whom we would
send Christmas cards (Hill and Dunbar 2003; the other citations in
this sentence are also from Hill and Dunbar 2003:67). An important
empirical project is the investigation of commonality and difference in
how people of different cultures mark these social distinctions through
interactional practice (regardless of whether membership in different
levels of relationship intensity in a given setting is socioculturally
predetermined, or selected by individuals’ preference).
Practices concerned with the management of common ground for
strategic interactional purposes provide, I suggest, an important kind
of data for assessing Hill and Dunbar’s proposal. Given the “no time
out” nature of everyday interaction, we may better look to practices
that are very much more mundane and constant in the lives of regularly
interacting individuals than, say, annual gestures like the Anglo Christmas
Card. To this end, I want to draw a key link, so far entirely unseen in the
literature, it seems, between the line of thinking exemplified by Hill and
Dunbar (2003), and a strand of work arising from research within corners
of sociology on conversation and other types of interaction, rooted in
the work of Sacks and associates on “social membership categorization”
(cf. Sacks 1992; see also Garfinkel and Sacks 1970; Schegloff in press b).
In a review of this work, Pomerantz and Mandelbaum (2005) outline
four types of practice in U.S. English conversation by which people
“maintain incumbency in complementary relationship categories, such
as friend–friend, intimate–intimate, father–son, by engaging in conduct
regarded as appropriate for incumbents of the relationship category and
by ratifying appropriate conduct when performed by the cointeractant”
(Pomerantz and Mandelbaum 2005:160):
(5) Four sets of practices for maintaining incumbency in more intensive/
intimate types of social relationships (derived from Pomerantz and
Mandelbaum 2005):
“Inquiring about tracked events and providing more details on one’s
own activities”: reporting and updating on events and activities
mentioned in previous conversations; eliciting detailed accounts,
demonstrating special interest in the details; attending to each
other’s schedules and plans; and so forth (Drew and Chilton 2000;
Morrison 1997).
“Discussing one’s own problems and displaying interest in the
other’s problems”: claiming the right to (and being obliged to) ask
and display interest in each other’s personal problems; showing
receptivity to such discussion; and so forth (Cohen 1999; Jefferson
and Lee 1980).
&#x25FB; “Making oblique references to shared experiences and forwarding
the talk about shared experiences”: one party makes minimal
reference to past shared experience (e.g., John says Remember Mary’s
brother?), and the other displays their recognition of it, takes it up
and forwards it in the conversation (Fred responds Oh God, he’s so
strange, what about when he. . .), thereby demonstrating the common
ground (Lerner 1992; Mandelbaum 1987; Maynard and Zimmerman
1984; cf. Enfield 2003).
&#x25FB; “Using improprieties and taking up the other’s improprieties by
using additional improprieties and/or laughter”: cussing and other
obscenities; laughter in response to such improprieties; shared
suspension of constraints usually imposed by politeness (Jefferson
1974).
At least the first three of these cases are squarely concerned with the
strategic manipulation of information—the incrementing, maintaining,
or presupposing of common ground—with consequences for the
relationship and for its maintenance. 7 These are important candidates
for local, culturally variant practices for maintaining social membership
in one or another level (the examples in (5) being all definitive of “closer”
relationships). Whether these are universal is an empirical question. It
requires close analysis of social interaction based on naturally occurring,
informal conversation across cultures and in different types of social–
cultural systems.
I now want to elaborate with further examples of social practices
from specific cultural settings that show particular attention to the
maintenance of social relationships at various levels. In line with the
theme of the chapter, they concentrate on the management of, or
presupposition of, common ground, with both informational payoffs
and social-affiliational payoffs.
A first example, from Schegloff (in press a), is a practice that arises in
the cultural context of Anglo-American telephone calls (at least before
the era of caller ID displays). It hinges on the presumption that people
in close social relationships should be able to recognize each other by
a minimal voice sample alone. Here is an example:
(6) 1 ((ring))
2 Clara: Hello
3 Agnes: Hi
4 Clara: Oh hi, how are you Agnes
This typical case displays an exquisite minimality and efficiency,
which puts on mutual display to the interlocutors the intimacy of their
relationship, thanks to the mutual presumption of person recognition
based on minimal information. In line 1, Clara hears the phone ring.
When she picks up, in line 2, she does not identify herself by saying
who she is. She gives a voice sample carried by the generic formula
hello. If the caller is socially close enough to the callee, he or she will
recognize her by her voice (biased by expectation, given that one usually
knows who one is calling). On hearing this, Schegloff explains, by
supplying the minimal greeting response Hi in line 3, the caller “claims
to have recognized the answerer as the person they meant to reach.”
(Otherwise—i.e., if the caller did not recognize the answerer—he or she
would have to ask, or at least ask for confirmation; e.g. Clara?.) At the
same time, the caller in line 3 is reversing the direction of this minimal-
identifying mechanism, providing “a voice sample to the answerer from
which callers, in effect, propose and require that the answerer recognize
them.” In this seamless and lightning-fast exchange, these interactants
challenge each to recognize the other given the barest minimum of
information, and through the course of the exchange each of them
claim to have achieved that recognition. (Clara not only claims but
demonstrates recognition by producing Agnes’s name in line 4.) Were
they not to recognize who was calling on the basis of a small sample of
speech like hi—which, after all, was produced on the presumption that
the quality of the voice should be sufficient for a close social associate
to identify the person—they would pay a social price of disaffiliation
via a betrayal of distance and lack of intimacy (What? You don’t recognize
me?!; cf. Schegloff in press b).
Consider a second example, another practice by which social inter-
actants identify persons. In English, when referring to a nonpresent
person in an informal conversation, a speaker may choose whether to
use bare first name (John) as opposed to some fuller name (John Smith) or
description (my attorney, Bill’s brother, that guy there; Sacks and Schegloff
1979, Enfield and Stivers in press). The choice depends on whether it is
in speaker and addressee’s common ground who “John” is and whether
he is openly known to this speaker–addressee pair as John. The choices
we make will, in general, reflect the level of intimacy and intensity of
social relations among speaker, addressee, and referent, and this more
directly concerns the common ground of speaker and addressee. In
my example (Fig. 15.7), Kou (left) has just arrived at his village home,
having been driven from the city (30 or so km away) in a pickup. He
has brought with him a load of passengers, mostly children, who have
Figure 15.7. Kou (light shirt) has just arrived at his village home in a pickup
truck loaded with passengers, mostly children. Saj (dark shirt), a neighbor, comes
by to investigate.

now scattered and are playing in the grounds of his compound. Saj
(right), a neighbor of Kou, has just arrived on the scene.
Saj asks Kou how many people were in the group that has just arrived
with Kou’s vehicle, following this up immediately by offering a candidate
set of people: “Duang’s lot” (line 1). The named referent—Duang—is
Kou’s third daughter.8 Kou responds with a list of those who have arrived
with him, beginning by listing four of his own daughters by name (lines
2–3), then mentioning two further children (line 4):

1(7) S maa2 cak2 khon2 niø – sum1 qii1+duang3 kaø maa2


come how_many person tpc_pcl group f.non_resp+D foc_pcl come
“How many people have come?—Duang’s lot have come?”
2 K qii1+duang3 – qii1+daa3, qii1+phòòn2
f.non_resp+D f.non_resp+D f.non_resp+P
“Duang – Daa, Phòòn.
3 maa2 bet2 lèq5, qii1+khòòn2van3
come all pfv f.non_resp+K
All have come, Khòònvan.
4 dêk2+nòòj4 maa2 tèè1 paak5_san2 phunø qiik5 sòòng3 khon2
child come from P dem_far_dist more two person
Kids from Paksan, another two.

It is in the common ground that Kou’s own four children are known
to both Kou and Saj by their first names. Kou is therefore able to use
the four children’s personal names in lines 2–3 to achieve recognition.
In line 4, Kou continues his list, with two further children who have
arrived with him. These two are not his own, are not from this village,
and are presumed not to be known by name to Saj. They are children of
Kou’s brother and sister, respectively, who both live in Kou’s mother’s
village Paksan, some distance away. Kou refers to them as “kids from
Paksan.” The reason he does not he refer to these two children by name
is that he figures his addressee will not recognize them by name—their
names, as ways of uniquely referring to them, are not in the common
ground. But although Saj certainly will not recognize the children by
name, he will recognize their village of origin by name (and further,
will recognize that village to be Kou’s village of origin, and the home
of Kou’s siblings). So Kou’s solution to the problem of formulating
reference to these two children—in line 4—is to tie them to one sure
piece of common ground: the name of the village where a host of Kou’s
relatives are (openly, mutually) known to live.
However, it appears that Kou’s solution in line 4 is taken—by Saj—to
suppose too little common ground. Although Saj would not know the
names of these Paksan children, he does know the names of some of Kou’s
siblings from Paksan. This is common knowledge, which could form
the basis of a finer characterization of these children’s identities than
that offered in line 4. What immediately follows Kou’s vague reference
to the two children by place of origin in line 4 is Saj’s candidate offer of
a more specific reference to the children. Saj’s candidate reformulation
(line 5 in [8], below) links the children explicitly to one of Kou’s siblings,
referring to him by name. This guess, which turns out to be not entirely
correct, succeeds in eliciting from Kou a finer characterization of the
children’s identities (line 6). This new characterization presupposes
greater common ground than Kou’s first attempt did in line 4, yet it
remains a step away in implied social proximity from that implied
by Kou’s first-name formulations to his own children in lines 2–3,
above:
(8) (Follows directly from (7).)
5 S luuk4 qajø+saaj3
child eB+S
“Children of Saaj?”
6 K luuk4 bak2+saaj3 phuu5 nùng1, luuk4 - qii1+vaat4sa=naa3 phuu5 nùng1
child m.non_resp+S person one child f.non_resp+V person one
“Child of Saaj, one, child of – Vatsana, one.”

The contrasts between the three ways to formulate reference to a


person—by first name in lines 1–3, via place of origin in line 4, via
parent’s name in lines 5–6—represent appeals to common ground
of different kinds, and different degrees. They are indicative of, and
constitutive of, different levels of social familiarity and proximity. This
example shows how such expression of these levels of familiarity can be
explicitly negotiated within the very business of social interaction. Kou’s
reference to the two children from Paksan in line 4 was constructed
differently to the references to his own children in lines 2–3, but
Saj effectively requested, and elicited, a revision of the first-attempt
formulation in line 5, thereby securing a display of greater common
ground than had a moment before been presupposed. 9
A third example involves two men in a somewhat more distant
relationship. This is from an exchange between the two men pictured
on the left of Figs 15.2–15.4. (I call them Foreground Man [FM] and
Background Man [BM].) The men hardly know each other, but are
of a similar age. The younger sister of BM’s younger brother’s wife is
married to the son of FM. The two men seldom meet. Their kinship ties
are distant. Their home territories—the areas about which they should
naturally be expected to have good knowledge—overlap partially. They
originate in villages that are a day’s travel apart. This is far enough to
make it likely that they have spent little time in each other’s territory,
but it is not so far that they would be expected not to have ever done so.
The common ground at stake, then, concerns knowledge of the land.
The conversation takes place in the village of FM. This is therefore
an occasion in which BM is gathering firsthand experience beyond his
home territory. It may be inferred from the segment we are about to
examine that FM wants to display his familiarity with BM’s territory.
The point of interest in this conversation is a series of references to
a geographical location close to BM’s home village, but which FM
apparently knows well about. During a discussion of medicinal herbs,
BM mentions an area in which certain herbs can be found. His first
mention of the place is by name: Vang Phêêng.10 As with reference to
persons (see previous example), the use of the bare name in first mention
presupposes recognizability or identifiability (Schegloff 1972). This
identifiability is immediately confirmed by FM’s reply of “Yeah, there’s
no shortage (of that herb) there.” There is then over a minute’s further
discussion of the medicine, before the following sequence begins: 11

(9)1 BM haak4 phang2 khii5 ka0 bò0 qut2 juul [thèèw3-


root PK foe neg lacking at area

"Hak phang khii (a type of medicinal root) is plentiful, at the area of-"

2 FM [qee5
yeah
"Yeah,

3 ka0 cang1 vaa1faaj3 vang2 pheeng2 faaj3 nang3 qooj4


foe so say weir VP weir what intj
Like I said, Vang Phêêng Weir, whatever weir, oh."

4 BM m5
mm

"Mm."

5 I'M bò0 qùt2 lèqS, faaj3 qan0 nan0 na0


neg lacking pfv weir elf that pel
"It's not lacking (medicinal roots and herbs), that weir.

6 tè0+kii4 haak5 vang2 phêêng2 nan4 tè0+kii4 khaw3 paj3 tèq2-tòòng4


before pcl VP that before 3pl go touch
Before, Vang Phêêng, before for them to go and touch it

7 bò0 daj4, paa1_dong3 man2 lèwO dêj2


neg can forest 3sg pel pel
impossible, the forest of it you know."
non-respect'

was

In line 1, BM mentions a type of herbal medicine, saying that it is


plentiful. He is about to mention the location in which it is plentiful,
as projected by the use of the locational marker glossed in line 1 as “at.”
Not only does FM anticipate this, but also anticipates which location
it is that BM is about to mention (in a form of anticipation directly
related to that in the more simple example shown in Figs 15.2–15.4),
namely Vang Phêêng Weir (line 3) (cf. Lerner 1996 on collaborative
turn completion.) This is confirmed by BM’s acknowledgement marker
mm in line 4. Again, we see a dance of display of common ground, by
anticipation of what the current speaker is going to say. FM goes on to
comment in lines 6–7 that in the old days it was impossible to collect
medicinal herbs from the area.
The element of special interest here is the pronoun man2 “it” in bold
face in line 7. There is no local antecedent for this pronoun. The speaker
is using a locally subsequent form in a locally initial position (Fox
1987; Schegloff 1996), with a subsequent major risk of not succeeding
in getting recognition. How do his addressees know what he is talking
about? (We get evidence that BM at least claims to follow him, as we see
BM in the video doing an acknowledging “head toss”—something like
a nod—directed to FM just as the latter utters line 7.) A couple of lines
ensue (omitted here to save space), which finish with FM repeating that
in the old days it was impossible to get medicinal herbs out of there.
Then, Foreground Woman (FW) contributes:
(10)
8 FW khuam2 phen1 haaj4 ni0 na0
reason 3sg.hon angry pc1 pc1
“Owing to it’srespect being angry?”
9 FM qee5 – bòò1 mèèn2 lin5 lin5 dêj2, phii3 vang2 phêêng2 ni0
yeah
neg be play play pc1 spirit VP tpc.pc1
“Yeah – It’s not playing around you know, the spirit of Vang Phêêng.”

Line 8, uttered by FW (BM’s wife) partly reveals her analysis of what


FM is saying, and specifically of what he was referring to by the 3rd
person singular pronoun man2 in line 7. She, too, uses a 3rd-person
singular pronoun, but her choice is the honorific phen1. She suggests
that the previous difficulty in extracting herbs was because of “the anger
of it.” Someone who lacks the relevant cultural common ground will
have no way of knowing that the referent of “it” is the spirit owner
of Vang Phêêng. This is not made explicit until it seems obvious that
everyone already knows what the speaker has been talking about—that
is, as a demonstrative afterthought in line 9.
This exchange reveals to the analyst the extent to which recognition
of quite specific references can be elicited using very minimal forms
for reference when those involved in the social interaction share a
good deal of common ground (cf. [3] and Figs 15.5 and 15.6). It also
makes important indications to the participants themselves. They
display to each other, in a way hardly possible to bluff, that they
share specific common ground. In line 3 of (9), FM anticipates what
BM is going to say, and says it for him. In line 7, FM uses a nearly
contentless pronoun to refer to a new entity in the discourse, relying
entirely on shared knowledge and expectation to achieve successful
recognition.12 In line 8, FW displays her successful recognition of the
referent introduced by FM in line 7, by making explicit something
about the referent that up to this point has been merely implied. By
the economy and brevity of these exchanges, these individuals display
to each other—and to us as onlookers—that they share a great deal of
common knowledge, including common knowledge of the local area
(and the local biographical commitment this indexes), and membership
in the local culture. This may be of immense value for negotiating the
vaguely defined level of interpersonal relationship pertaining between
the two men, whose only reason for interacting is their affinal kinship.
In conversing, they test for, and display common ground, and through
the interplay of their contributions to the progressing trajectory of talk
demonstrate an unbluffable ability to know what is being talked about
before it is even mentioned.

Conclusion
This chapter has proposed that the practices by which we manage
and exploit common ground in interaction demonstrate a personal
commitment to particular relationships and particular communities,
and a studied attention to the practical and strategic requirements
of human sociality. I have argued that the manipulation of common
ground serves both interactional efficacy and social affiliation. The logic
can be summarized as follows. Common ground—knowledge openly
shared by specified pairs, trios, and so forth—is by definition socially
relational, and relationship defining. In an informational dimension,
common ground guides the design of signals by particular speakers
for particular recipients, as well as the proper interpretation by particular
recipients, of signals from particular speakers. Richer common ground
means greater communicative economy, because it enables greater
amplicative inferences on the basis of leaner coded signals. In a social-
affiliational dimension, the resulting streamlined, elliptical interaction
has a property that is recognized and exploited in the ground-level
management of social relations: these indices of common ground are
a means of publicly displaying, to interactants and onlookers alike,
that the requisite common ground is shared, and that the relationship
constituted by that degree or kind of common ground is in evidence.
In sum, common ground is as much a social-affiliational resource as
it is an informational one. In its home disciplines of linguistics and
psychology, the defining properties of common ground concern its
consequences in the realm of reference and discourse coherence. But
sharedness, or not, of information, is essentially social. Why else would
it be that if I were to get the promotion, I had better tell my wife as
soon as I see her (or better, call her and let her be the first to know),
whereas others can be told in due course (my snooker buddies), and
yet others need never know (my dentist)? The critical point, axiomatic
in research on talk in interaction yet alien to linguistics and cognitive
science, is that there is no time out from the social consequences of
communicative action.

Acknowledgments
I would like to acknowledge a special debt to Bill Hanks, Steve Levinson,
Paul Kockelman, Tanya Stivers, Herb Clark, Chuck Goodwin, John
Heritage, and Manny Schegloff, along with the rest of my colleagues
in the Multimodal Interaction Project (MPI Nijmegen)—Penny Brown,
Federico Rossano, JP de Ruiter, and Gunter Senft—for helping me develop
my thinking on the topics raised here. I received helpful commentary on
draft versions from Steve Levinson, Tanya Stivers, as well as Jack Sidnell
and two other anonymous reviewers. None are responsible for errors
and infelicities. I gratefully acknowledge the support of the Max Planck
Society. I also thank Michel Lorrillard for providing me with a place
to work at the Vientiane centre of l’École française d’Extrême-Orient,
where final revisions to this chapter were made. Finally, I thank the
entire cast of contributors to the Roots of Human Sociality symposium at
the village of Duck, on North Carolina’s Outer Banks, October 2004.

Notes
1. See also Schiffer (1972), Sperber and Wilson (1995), D’Andrade (1987:113),
Searle (1995:23–26), Schegloff (1996:459), Barr and Keysar (2004). Although
analysts agree that humans can construct and consult common ground in
interaction, there is considerable disagreement as to how pervasive it is (see
discussion in Barr and Keysar 2004).
2. By “hypothesis,” I do not mean that we need consciously or explicitly
entertain candidate accounts for questions like whether our colleagues will
wear clothes to work tomorrow, or whether the sun will come up, or whether
we will stop feeling thirsty after we have had a drink (saying “Aha, just as I
suspected” when verified). But we nevertheless have models of how things are,
which, most importantly, are always accessible, and become visible precisely
when things go against our expectations (Whorf 1956). For this to work,
we need some kind of stored representation, whether mental or otherwise
embodied, which accounts for our expectations.
3. Steve Levinson points out the relevance of the great spatial distance
between BW and the basket. Her reach has a long way to go when FW acts on
the inference derived from observing her action. It may be that BW’s stylized
reach was overtly communicative, designed to induce recognition of intention,
and the perlocutionary effect of causing FW to pass the basket (functioning,
effectively, as a request).
4. The phrasing appropriates Slobin’s thinking-for-speaking idea: that “lan-
guage directs us to attend—while speaking—to the dimensions of experience
that are enshrined in grammatical categories” (Slobin 1996:71).
5. There is some controversy as to the extent to which we do audience
design and assume its having been done. By a frugal cognition view, audience
design is heavily minimized, but all analytical positions acknowledge that
high-powered inference must at the very least be available when required
(Barr and Keysar 2004; cf. Goodwin, Hutchins, and Danziger in this volume).
6. This is the corollary of the impossibility of pretending to possess common
ground when you do not: witness the implausibility of fictional stories in
which characters assume other characters’ identities and impersonate them,
living their lives without their closest friends and kin detecting that they are
imposters (e.g., the reciprocal face transplant performed on arch enemies
Castor Troy and Sean Archer in Face/Off, Paramount Pictures, 1997).
7. More work is needed to understand how the use of profanities works
to display and constitute “close” social relations. Presumably, the mechanism
is that “we can’t talk like that with everybody.” So, it is not a question of the
propositional content of the information being exchanged, but its register,
its format. Compare this with more sophisticated ways of displaying social
affiliation in the animal world, such as the synchronized swimming and
diving that closely affiliated porpoises employ as a display of alliance (Connor
et al. 2000:104). It is not just that these individuals are swimming together,
but, in addition, how they are doing it.
8. Like the others in this list of names, Duang is socially “lower” than
both the participants, and accordingly, her name is prefixed with the female
nonrespect prefix qii1-; cf. Enfield (in press).
9. I gratefully acknowledge the contribution of Manny Schegloff and
Tanya Stivers to my understanding of this example.
10. The Lao word vang refers to a river pool, a section of river in which the
water is deep and not perceptibly flowing, usually with thick forest towering
over it, producing a slightly spooky atmosphere, of the kind associated with
spirit owners (i.e., ghosts or spirits that “own” a place, and must be appeased
when traveling through). The same place is also called Faaj Vang Phêêng (faaj
means “weir”; the deep still water ofVang Phêêng is a weir reservoir).
11. Vertically aligned square brackets indicate overlap in speech.
12. This is comparable with the use of him in the opening words of Paul
Bremer’s announcement at a Baghdad news conference in December 2003 of
the highly anticipated capture of Saddam Hussein: “Ladies and gentlemen, we
got him.”

References
Baron-Cohen, S. 1995. Mindblindness: An essay on autism and Theory of
Mind. Cambridge, MA: MIT Press.
Barr, D. J., and B. Keysar. 2004. Making sense of how we make sense:
The paradox of egocentrism in language use. In Figurative language
processing: Social and cultural influences, edited by H. Colston and A.
Katz, 21–41. Mahwah, NJ: Erlbaum.
Buys, C. J., and K. L. Larson. 1979. Human sympathy groups. Psychology
Reports 45:547–553.
Carruthers, P. and P. K. Smith (eds.). 1996. Theories of Theories of Mind.
Cambridge: Cambridge University Press.
Clark, H. H. 1996. Using language. Cambridge: Cambridge University
Press.
Cohen, D. 1999. Adding insult to injury: Practices of empathy in an infertility
support group. Ph.D. dissertation, School of Communication, Rutgers
University.
Connor, R. C., R. S. Wells, J. Mann, and A. J. Read. 2000. The Bottlenose
Dolphin: Social relationships in a fission-fusion society. In Cetacean
societies: Field studies of dolphins and whales, edited by J. Mann, R. C.
Connor, P. L. Tyack, and H. Whitehead, 91–126. Chicago: Chicago
University Press.
D’Andrade, R. D. 1987. A Folk Model of the Mind. In Cultural models in
language and thought, edited by D. Holland and N. Quinn, 112–148.
Cambridge: Cambridge University Press.
Drew, P. and K. Chilton. 2000. Calling just to keep in touch: Regular and
habitualised telephone calls as an environment for small talk. In Small
Talk, edited by J. Coupland, 137–162. Harlow: Pearson Education.
Dunbar, R. I. M. 1993. Coevolution of neocortical size, group size, and
language in humans. Behavioral and Brain Sciences 16:681–735.
——. 1996. Grooming, gossip and the evolution of language . London: Faber
and Faber.
——. 1998. The social brain hypothesis. Evolutionary Anthropology
6:178–190.
Dunbar, R. I. M., and M. Spoors. 1995. Social networks, support cliques,
and kinship. Human Nature 6:273–290.
Enfield, N. J. 2002. Cultural logic and syntactic productivity: Associated
posture constructions in Lao. In Ethnosyntax: Explorations in culture
and grammar, edited by N. J. Enfield, 231–258. Oxford: Oxford
University Press.
——. 2003. The definition of what-d’you-call-it: Semantics and pragmatics
of recognitional deixis. Journal of Pragmatics 35:101–117.
——. 2005. The body as a cognitive artifact in kinship representations.
Hand gesture diagrams by speakers of Lao. Current Anthropology
46(1):51–81.
——. in press. Meanings of the unmarked: Why default references do
more than just refer. In Person reference in interaction, edited by N. J.
Enfield and T. Stivers. Cambridge: Cambridge University Press.
——. n.d. Evidential interrogative particles in Lao. Language and Cognition
Group, MPI Nijmegen, March 2006. [Typescript]
Enfield, N. J., and T. Stivers (eds.). in press. Person reference in interaction.
Cambridge: Cambridge University Press.
Fox, B. A. 1987. Discourse structure and anaphora: Written and conversational
English. Cambridge: Cambridge University Press.
Garfinkel, H., and H. Sacks. 1970. On formal structures of practical
actions. In Theoretical sociology: Perspectives and developments, edited by
J. C. McKinney and E. A. Tiryakian, 337–366. New York: Meredith.
Gigerenzer, G., P. M. Todd, and The ABC Research Group. 1999. Simple
heuristics that make us smart. Oxford: Oxford University Press.
Goffman, E. 1967. Interaction ritual. New York: Anchor Books.
——. 1971. Relations in public. New York: Harper & Row.
——. 1981. Forms of talk. Philadelphia: University of Pennsylvania
Press.
Goodwin, C. 1994. Professional vision. American Anthropologist 96(3):
606–633.
——. 1996. Transparent vision. In Interaction and grammar, edited by
E. Ochs, E. A. Schegloff, and S. A. Thompson, 370–404. Cambridge:
Cambridge University Press.
——. 2000. Action and embodiment within situated human interaction.
Journal of Pragmatics 32:1489–1522.
Goody, E. N. (ed.). 1995. Social intelligence and interaction: Expressions
and implications of the social bias in human intelligence. Cambridge:
Cambridge University Press.
Grice, H. P. 1975. Logic and conversation. In Speech acts, edited by P.
Cole and J. L. Morgan, 41–58. New York: Academic Press.
——. 1989. Studies in the way of words. Cambridge, MA: Harvard
University Press.
Hanks, William F. 1989. The indexical ground of deictic reference.
Chicago Linguistics Society 25(2):104–122.
Heritage, J., and J. M. Atkinson. 1984. Introduction. In Structures of social
action: Studies in conversation analysis, edited by J. M. Atkinson and J.
Heritage, 1–15. Cambridge: Cambridge University Press.
Hill, R. A., and R. I. M. Dunbar. 2003. Social network size in humans.
Human Nature 14:53–72.
Hutchins, E. 1995. Cognition in the wild. Cambridge, MA: MIT Press.
Hutchins, E., and B. Hazlehurst. 1995. How to invent a shared lexicon:
The emergence of shared form-meaning mappings in interaction. In
Social intelligence and interaction: Expressions and implications of the
social bias in human intelligence, edited by E. Goody, 53–67. Cambridge:
Cambridge University Press.
Hutchins, E., and L. Palen. 1993. Constructing meaning from space,
gesture, and speech. In Discourse, tools, and reasoning: Essays on situated
cognition, edited by L. B. Resnick, R. Säljö, C. Pontecorvo, and B.
Burge, 23–40. Berlin: Springer.
Jefferson, G. 1974. Error correction as an interactional resource. Language
in Society 2:181–199.
——. 1978. Sequential aspects of storytelling in conversation. In Studies
in the organization of conversational interaction, edited by J. Schenkein,
219–248. New York: Academic Press.
Jefferson, G., and J. R. E. Lee. 1980. End of Grant Report to the British
SSRC on the analysis of conversations in which “troubles” and
“anxieties” are expressed. Ref. Hr 4802. Manchester: University of
Manchester.
Kita, S. (ed.). 2003. Pointing: Where language, cognition, and culture meet.
Mahwah, NJ: Erlbaum.
Kockelman, P. 2005. The semiotic stance. Semiotica 157:233–304.
Lerner, G. H. 1992. Assisted storytelling: Deploying shared knowledge
as a practical matter. Qualitative Sociology 15:24–77.
——. 1996. On the “semi-permeable” character of grammatical units
in conversation: Conditional entry into the turn space of another
speaker. In Interaction and grammar, edited by E. Ochs, E. A. Schegloff,
and S. A. Thompson, 238–276. Cambridge: Cambridge University
Press.
Levinson, S. C. 1995. Interactional biases in human thinking. In Social
intelligence and interaction: Expressions and implications of the social
bias in human intelligence, edited by E. Goody, 221–260. Cambridge:
Cambridge University Press.
——. 1997. From outer to inner space: Linguistic categories and
nonlinguistic
thinking. In Language and conceptualization, edited by J.
Nuyts and E. Pederson, 13–45. Cambridge: Cambridge University
Press.
——. 2000. Presumptive meanings: The theory of generalized conversational
implicature. Cambridge, MA: MIT Press.
Lewis, D. K. 1969. Convention: A philosophical study. Cambridge, MA:
Harvard University Press.
Mandelbaum, J. 1987. Couples sharing stories. Communication Quarterly
352:144–170.
Maynard, D. W., and D. Zimmerman. 1984. Topical talk, ritual, and
the social organization of relationships. Social Psychology Quarterly
47:301–316.
Mead, G. H. 1934. Mind, self, and society from the standpoint of a social
behaviorist, edited by C. W. Morris. Chicago: University of Chicago
Press.
Miller, G. A. 1951. Language and communication. New York: McGraw-
Hill.
Molder, H. te, and J. Potter (eds.). 2005. Conversation and Cognition.
Cambridge: Cambridge University Press.
Moore, C., and P. Dunham (eds.). 1995. Joint attention: Its origins and
role in development. Hillsdale, NJ: Erlbaum.
Morrison, J. 1997. Enacting involvement: Some conversational practices for
being in relationship. Ph.D. dissertation, School of Communications,
Temple University.
Nettle, D. 1999. Language variation and the evolution of societies.
In The evolution of culture: An interdisciplinary view, edited by R. I.
M. Dunbar, C. Knight, and C. Power, 214–227. New Brunswick, NJ:
Rutgers University Press.
Nettle, D., and R. I. M. Dunbar. 1997. Social markers and the evolution
of reciprocal exchange. Current Anthropology 38(1):93–99.
Norman, D. A. 1991. Cognitive artifacts. In Designing interaction:
Psychology at the human-computer interface, edited by J. M. Carroll,
17–38. Cambridge: Cambridge University Press.
Peirce, C. S. 1965[1932]. Collected papers of Charles Sanders Peirce, vol.
2: Elements of Logic, edited by Charles Hartshorne and Paul Weiss.
Cambridge, MA: Belknap Press of Harvard University Press.
Pomerantz, A., and J. Mandelbaum. 2005. Conversation analytic
approaches to the relevance and uses of relationship categories in
interaction. In Handbook of language and social interaction, edited by
K. L. Fitch and R. E. Sanders, 149–171. Mahwah, NJ: Erlbaum.
Rogoff, B. 1994. Apprenticeship in thinking: Cognitive development in social
context. New York: Oxford University Press.
Sacks, H. 1974. An analysis of the course of a joke’s telling in conversation.
In Explorations in the ethnography of speaking, edited by R. Bauman and
J. Sherzer, 337–353. Cambridge: Cambridge University Press.
——. 1992. Lectures on conversation. London: Blackwell.
Sacks, H., and E. A. Schegloff. 1979. Two preferences in the organization
of reference to persons in conversation and their interaction. In
Everyday language: Studies in ethnomethodology , edited by G. Psathas,
15–21. New York: Irvington.
Schank, R. C., and R. P. Abelson. 1977. Scripts, plans, goals, and
understanding:
Erlbaum.
An inquiry into human knowledge structures. Hillsdale, NJ:
Schegloff, E. A. 1972. Notes on a conversational practice: Formulating
place. In Studies in social interaction, edited by D. Sudnow, 75–119.
New York: The Free Press.
——. 1982. Discourse as an interactional achievement: Some uses of “Uh
Huh” and other things that come between sentences. In Georgetown
University Roundtable on Languages and Linguistics 1981; Analyzing
Discourse: Text and Talk, edited by D. Tannen, 71–93. Washington,
DC: Georgetown University Press.
——. 1992. Repair after next turn: The last structurally provided defense
of intersubjectivity in conversation. American Journal of Sociology
97(5):1295–1345.
——. 1996. Some practices for referring to persons in talk-in-interaction:
A partial sketch of a systematics. In Studies in anaphora, edited by B.
Fox, 437–485. Amsterdam: Benjamins.
——. 1997. Third turn repair. In Towards a social science of language, vol.
2: Social interaction and discourse structures, edited by G. R. Guy, C.
Feagin, D. Schiffrin, and J. Baugh, 31–40. Amsterdam: Benjamins.
——. in press a. Conveying who you are: The presentation of self, strictly
speaking. In Person reference in interaction, edited by N. J. Enfield and
T. Stivers. Cambridge: Cambridge University Press.
——. in press b. Tutorial on membership categorization devices. In
Special Issue of Journal of Pragmatics, edited by M. F. Nielsen and J.
Wagner.
Schelling, T. C. 1960. The strategy of conflict. Cambridge, MA: Harvard
University Press.
Schiffer, S. R. 1972. Meaning. Oxford: Clarendon Press.
Schutz, A. 1970. On phenomenology and social relations. Chicago:
University of Chicago Press.
Searle, J. R. 1995. The construction of social reality. New York: Free
Press.
Slobin, D. 1996. From “thought and language” to “thinking for speaking.”
In Rethinking linguistic relativity, edited by J. J. Gumperz and S. C.
Levinson, 70–96. Cambridge: Cambridge University Press.
Smith, N. V. (ed.). 1980. Mutual knowledge. London: Academic Press.
Sperber, D., and D. Wilson. 1995. Relevance: Communication and cognition,
2nd edition. Oxford: Blackwell.
Sugawara, K. 1984. Spatial proximity and bodily contact among the
Central Kalahari San. African Study Monograph (supp.) 3:1–43.
Tomasello, M. 1999. The cultural origins of human cognition. Cambridge,
MA: Harvard University Press.
Vygotsky, L. S. 1962[1934]. Thought and language. Cambridge, MA: MIT
Press.
Whorf, B. L. 1956. Language, thought, and reality. Cambridge, MA: MIT
Press.
PSwiisIEofCUWhyaDesncivhduxyoemarltsepuonitrghdbpawnlye

Dan Sperber

Human cognition, interaction, and culture are thoroughly


intertwined.
Without cognition and interaction, there would be no
culture. Without culture, cognition and interaction would be very
different affairs, as they are among other social species. The effect of
culture on mental life has always been a main concern of the social
sciences and, after a long period of almost total neglect, it is more and
more taken into consideration in cognitive psychology. The effect of
cognition, and in particular of the ability to attribute mental states to
others, on interaction has become an important topic of investigation
in developmental and social psychology. Little attention, however, is
paid to the effect of cognition on culture. It is hardly controversial
that the human mind is what makes human culture possible. Quite
generally, however, the mind is seen as a mere enabler of culture, a
pure opportunity with no constraints attached, nothing that might
contribute to shaping, or at least to biasing cultural contents.
Against this neglect of cognition, I argue that understanding the mind
is doubly important to the study of culture. Psychological considerations
are crucial both to a proper characterization of what is cultural and to a
proper explanation of cultural phenomena.
Most anthropologists may be too savvy to still talk of the mind as a
“blank slate,” but they too often meet any discussion of psychological
factors in culture with a blank stare. Most evolutionary theorists who
see cultural evolution as an analogue of biological evolution by natural
Cognition in Interaction

selection postulate rather than investigate the few particulars of the


human mind that may be relevant to them. They are too often content
to assume that human capacities for imitation and communication are
reliable enough to justify treating cultural items as replicators.
Some evolutionary theorists interested in gene–culture coevolution
have however recognized the relevance of psychological factors
involved in decision or preference of a kind typically studied by social
psychologists. Boyd and Richerson and their collaborators in particular
have investigated biases in cultural transmission based, for instance, on
preferences for more frequent traits, or for traits possessed by successful
individuals (Boyd and Richerson 1985, this volume; Richerson and Boyd
2005). Other researchers (e.g., Atran 1990, 2002; Boyer 1994, 2001;
Hirschfeld 1996, Sperber 1996) have focused on the role of domain-
specific cognitive mechanisms in cultural evolution (see also Tooby
and Cosmides 1992). The difference between Boyd and Richerson’s
approach and ours is, I believe, one of emphasis rather than a general
disagreement. Still, cognitive factors—and hence deeper psychological
levels than those involved in preferences—do deserve emphasizing,
because they are crucial to the kind of fine-grained explanation of
cultural facts that anthropologists rightly seek to provide.

Representations and Cognition


Cognitive psychology and cultural anthropology deal with
representations
(whether or not the word is used) in the broad sense where a
representation is whatever carries meaning or content. Can this broad
notion of representation be made precise enough to help bridge the gap
between the cognitive and the social sciences? Here is a way to go.
Something (a brain state or an artifact for instance) is a representation
if it is produced by some information processing device (a mental
mechanism, an individual, an organization, or a robot for instance) so as
to contain information about some event or state of affairs, information
to be used by some other cognitive device (or the same device at a
later time). The notion so understood applies equally well to mental
representations inside people (or other animals, or intelligent robots)
and to public representations, that is, artifacts such as utterances or
pictures produced to communicate among people. This notion does
not presuppose that a representation must have internal structure, let
alone languagelike articulation. It does not impose any condition on
the spatial and temporal location of representations, continuous or
fragmented, inside or outside brains (other than what follows from the
Cultural Evolution’s Incompatibility with Shallow Psychology

fact that representations are produced and used by cognitive devices


and therefore must be within their reach—they do not just hover in
social space).
Here are a couple of examples of representations so understood. It has
recently been discovered (Rizzolatti et al. 1996) that specific neurons
(so-called “mirror neurons”) in the F5 area in the premotor cortex of
macaques discharge when and only when either the monkey grasps
a piece of food or observes the same action performed by another
individual. This activation represents (in the sense intended here) the
action of grasping a piece of food: it is produced by cognitive
mechanisms
of perception or motor control; it contains information about the
occurrence of certain kind of event; it has the function of making this
information available to other cognitive mechanisms in the individual
that may be involved in prediction, coordination, or competition. When
Joan, asked where is the nearest bus stop, points in a certain direction,
her pointing represents the direction of the nearest bus stop: it is produced
by a chain of cognitive mechanism linking her understanding of the
question, her knowledge of the location of the nearest bus stop, and her
ability to produce interpretable gesture; it contains information about
the direction of the nearest bus stop, and it has the function of making
this information available to the person who asked for it.
This broad notion of representation dovetails with a broad notion
of cognition. Cognition broadly understood refers to a set of processes
that have as a function to secure specific “content,” or “semantic”
relationship between their inputs and their outputs. Some semantic
relationships (e.g., entailment, contradiction, synonymy, similarity of
content) obtain among representations. Other semantic relationships
(e.g., truth, fulfillment) obtain between representations and what they
are about, that is, events or states of affairs. By this definition, cognitive
processes must have representations as input and/or as output.
Perception has the function of realizing a true–of relationship between
a distal stimulus and the mental representation that identifies this
stimulus. For instance, Ann’s perceiving that the doorbell is ringing is
both caused by the ringing of the doorbell and is true of that event.
Inference has the function of realizing a follows-from relationship
between two sets of representations, its input premises and its output
conclusion. Ann’s inferring from the ringing of the doorbell that
someone wants the door open is a causal process that takes as input the
general representation that what normally causes doorbell to ring is the
action of people wanting the door open and the specific representation
that the doorbell is ringing. Remembering has the function of realizing
a near-identity-of-content relationship between two representations,
distant in time, of the same event or state of affairs. Ann’s remembering
that she had ordered a pizza has among its causes her having mentally
represented her action of ordering a pizza at the time at which she
did so, and produces a representation similar in content to that earlier
representation. Motor control has the function of realizing a fulfillment
relationship between a mentally represented goal and the effect of an
action. Ann’s opening the door is both caused by her goal of opening
the door to the person presumably delivering the pizza, and fulfills
that goal. This matching of causal and semantic relationships is what
characterizes cognition (and explaining step by step how this matching
is realized can be seen as the central goal of the cognitive sciences).

Cognitive Causal Chains (CCCs) and Social Cognitive


Causal Chains (SCCCs)
Because a representation, to be a representation, has to be produced
by a cognitive process and used by another one, there cannot be such
thing as an isolated atomic cognitive process. Every cognitive process
has, as its input and/or output, the output and/or input of one or several
other cognitive processes.
Cognitive processes are linked to one another in causal chains (that
can branch in complex ways). I call such chains Cognitive Causal
Chains, or CCCs for short. The chains of events that goes from the
perception of the doorbell ringing and the memory of having ordered
a pizza to the action of opening the door to the person delivering the
pizza is an example of a CCC. Cognitive psychologists typically do
not look at a full individual CCCs but work on only one type of link
in such chains: perception, memory, inference, or motor control for
instance. Still, their work helps us understand CCCs that are going
from the stimuli of individual perception to the immediate outcome
of individual actions. Cognitive processes are not however limited to
individual processes, and nor are CCCs.
What makes a process cognitive is that it has as its function to produce
an output that stands in a specific content relationship to its input. This
includes not only processes in the brain but also interactions between
the brain and the rest of the body, between organisms or artifacts and
their environment, or among organisms and artifacts. Much of this
and more has been convincingly argued and illustrated by defenders
of the “distributed cognition” approach (in particular Hutchins 1995,
this volume).
Chains of cognitive processes can extend across individuals and have
a social character. In the simplest cases, the behavioral output of some
individual’s CCC may serve as a perceptual input for other individuals’
CCCs and link them in a single Social CCC or SCCC for short. Here are
two rudimentary examples.
Mary, Peter, and Paul are walking in a single file. Mary is leading
the way, followed by Peter, followed by Paul. Mary’s steps, then, have
not only the function of carrying her in a certain direction but also
that of indicating to Peter where to tread, and Peter’s steps have the
same function vis-à-vis Paul. The CCC that controls Mary steps extends
through Peter’s perception and decisions to Peter’s steps and similarly
from Peter’s to Paul’s. In other terms, the coordinated action of Mary,
Peter, and Paul is controlled by a SCCCs with most of the information
flowing from Mary to Peter to Paul (there might also be some typically
acoustic-channel feedback information from Paul to Peter to Mary).
Ann calls the local pizza store to order a pizza. John takes the order,
pass it to Mary, who prepares the pizza, and to Bill who takes the pizza
to Ann’s door. Bill rings the bell, Ann’s hears the doorbell ringing,
remembers she has ordered a pizza, and so on. The individual CCC
described earlier of Ann’s hearing the doorbell and opening the door
could only exist as a fragment of a longer SCCC such as this one.
To satisfy her desire to have a pizza delivered, Ann has to recruit the
attention, cognitive processes, and work of others. At the same time,
she herself is recruited by Bill to open the door, and, more importantly,
by the pizza store as a source of income.
As such examples illustrate, each SCCCs is characterized by a specific
flow of information across people, behaviors, and artifacts. It is not just
information, however, but also people and objects that are being moved
and altered by SCCCs. There can be no flow of information without
physical processes that carry it out. These bodily and environmental
changes are as essential to explaining the flow of information as the
flow of information is essential to explaining them. A process that
has a cognitive function may have other functions as well. It can be
simultaneously cognitive and emotional, metabolic, motor, social,
cultural, or economic, and the cognitive function need not be the
most important one.
Whatever their main function, all SCCCs involve processes inside
individual organisms, processes in the environment of organisms, and
processes at the organism-environment interface.
Inside organisms, we find the causal chains that construct, elaborate,
maintain, and use mental representations, and that go, in particular,
from stimulation of afferent nerve endings to impulses from efferent
nerve endings. The links in these causal chains include first and foremost
brain and other nervous system processes, but also a variety of other
bodily processes.
At the organisms–environment interface we find the production
by organisms of a variety of items including events such as bodily
movements, and traces or products of these events, in particular artifacts,
and the impact of these items on other organisms. To describes all these
items (events and objects), I talk of “public productions” (they are
“public” in the sense that, unlike mental state and event, they can be
perceived by other organisms, not in the narrower sense that they can
be perceived by everybody: a whisper in somebody’s ear is public in
the sense here intended).
Public productions may undergo changes in the environment that are
not, or not wholly, controlled by their producers and users. They may
undergo a more or less rapid decay (quasi-immediate for speech, much
slower for writings, for instance); they may undergo a development of
their own, as do biological artifacts such as domestic plants and animals;
they may change in ways related to other environmental changes, as
do traps that close on their prey, or measure instruments that indicate
past or present environmental conditions. Most of these environmental
processes do not have a cognitive function (with some exceptions, e.g.,
changes in measure instruments) and therefore are not in themselves
links in SCCCs, but they are aspects of such links and help explain their
effectiveness, their fragility, their penetrability by outside information
or by “noise.”

Cultural Cognitive Causal Chains (CCCCs)


Most SCCCs are relatively short. They involve few people coordinating
in the cooperative or competitive pursuit of some goal. In such
coordination,
information changes as it flows along the chain, according to the
stage reached in the pursuit of the goal. More generally, information
processed along an individual or a social CCC may be different at every
juncture even though, by the very definition of a CCC, it is both causally
and semantically connected with information processed at previous
junctures. This is no mystery: only some cognitive processes have as
a function to preserve in their output the content of their input—
remembering, for instance, is a preservative process but hypothesis
formation or decision are not.
Some SCCCs do have the function of preserving across individuals the
content of mental representations or the form of public productions.
A typical example of a SCCC that preserves the content of mental
representations is provided by communication. Communication
between
two people involves two complementary cognitive processes, one
of expression and one of interpretation. The communicator expresses
a mental representation by producing some public representation.
This representation is then interpreted by the receiver, yielding, if
all goes well, a mental representation similar enough to the one that
had been expressed by the communicator. A typical example of a
SCCC that preserves the form of public productions is provided by
imitation. An individual observes the behavior of another individual
and mentally represents it in a manner such that she can then exploit
this representation to produce herself a similar behavior.
Preservation of mental content and preservation of behavioral form
from one individual to another are often linked. Typically, the receiver
in a process of communication not only constructs a mental
representation
similar to that of the communicator, she also acquires the ability to
produce a new version of the public representation she has interpreted.
Typically, the imitator, to imitate, constructs a mental representation
of the behavior to be imitated similar to that of the individual he is
imitating. (I hedge with “typically” because a receiver’s interpretive
abilities may not be matched by her expressive abilities, for instance,
when communication takes place in a second language well understood
but poorly spoken by the receiver; similarly, an imitator may use a
mental representation of the behavior quite different from that of the
individual he is imitating, as in the case a parrot imitating human
speech.) Thus, imitation and communication can overlap as in Fig.
16.1.
SCCC that have the function of preserving mental content, behavioral
form, or both, may extend across many individual and through a social
group, distributing throughout this group similar mental representations
or public productions. Such representations and productions are cultural
and the SCCCs that distribute them are cultural CCCs, or CCCCs for
short. CCCCs are quite diverse. Fig. 16.2 presents a simplified fragment
of a CCCC distributing a folktale via oral transmission. Some involve
large numbers of successive face-to-face interactions, some involve a few
one-to-many communications, some involve staged public events such
as rituals, some involve fabrication of specific artifacts, and so forth.
CCCCs vary in the degree to which they rely on permanent artifactual
features of the environment, from books to churches, from musical
Figure 16.1. Two kinds of link in SCCCs.

Figure 16.2. Simplified fragment of the CCCC of a folktale: The content of


several public narratives heard over time is remembered as a single mental story
and may be retold as a public narrative, contributing to the cultural distribution
of the tale.

instruments to the Internet. They vary in the extent to which they are
mutually supportive: some extend indefinitely almost on their own,
such as the CCCC that distributes the “God bless you!” response to a
sneezing. Others, such as those that distribute elements of an ideology
(e.g., the dogma of Trinity) or of a discipline (e.g., Cantor’s proof),
flourish only in the middle of related CCCCs.
Although the idea that thoughts or practices can be contagious is an
old and common one, SCCCs and CCCCs (with the possible exceptions
of rumors) are not objects recognized by common sense, nor are they
part of the social sciences’ toolkit. Notwithstanding, I have argued
(Sperber 1999) that they are what social life is made of, and that things
(objects, events, mental states, etc.) are social to the extent that they
owe their properties to their being embedded in SCCCs and that they
are cultural to the extent that they are shaped and stabilized by CCCCs.
I have suggested moreover that a properly naturalistic approach to
social and cultural phenomena centrally involves identifying the causal
factors and mechanisms that shape these causal chains and explaining
the macroregularities and changes of social and cultural life in term
of these microprocesses. Even though this particular idiom is mine
(and, I hope to show, is useful), it is a variant of a more general type
of approach to society and culture that I call “epidemiological” and
that is found for instance, even if not under that name, in the work of
Cavalli-Sforza and Feldman (1981), Dawkins (1976) and memeticists
inspired by his ideas (Aunger 2002; Blackmore 1999), Durham (1991),
or Boyd and Richerson (1985).
All epidemiological approaches to cultures consider cultural
phenomena
as a population of mental or artifactual items distributed in
a biological population (in particular a human population) and its
habitat, and seek to explain the evolving distribution of these cultural
items. Epidemiological approaches themselves are forms of “population
thinking” applied to cultural phenomena (and discussed under this
label in Richerson and Boyd 2005). Although all these epidemiological
approaches share some basic presuppositions that put them at odds with
more standard holistic and antinaturalistic approaches that are common
in the social sciences, they differ in the way they explain the distribution
and evolution of cultural items and in particular in the role they give to
fine-grained psychological factors in their explanations. I am arguing
that a cultural epidemiology that does not interface with psychology
makes as little sense as would a medical epidemiology that would not
interface with pathology. In particular, postulating that preservative
processes in the human mind are reliable enough to explain cultural
stability rather than investigating whether they really are is as shallow
as would be to postulating without further investigation that all diseases
are infectious and carried by only one type of pathogenic agent.
Fidelity and Stability
How stable are cultural things? Less than is commonly assumed,
especially in the case of “traditional societies,” too often taken to change
very little over generations. Cultures are in constant flux, and this is
true at all levels, from microinteractions to societal institutions. Still,
nothing is cultural without a modicum of stability over social time
and space. What makes some item a token of a cultural type is that it
is similar enough to other tokens of the type to be identified as such.
Members of a cultural group do recognize different word tokens as being
the same word, narratives as being of the same tale, food on their plate
as being the same dish, individual haircuts as exemplifying the same
hairstyle, performances as being the same ritual, individual attitudes
as expressing the same values, and so on. All the tokens recognized as
being of the same type need not be identical, but their resemblance to
one another in relevant respects—even if mere “family resemblance” à
la Wittgenstein, even if exaggerated—must be sufficient for this quasi-
unanimous recognition to be possible. This relative resemblance of
tokens of a type across social space and time gives a measure of the
stability of cultural types and of the stabilizing effect of their CCCCs.
Some types, for example proverbs, are extremely stable; others, for
example dress fashions in modern societies, much less so (from which
we can infer that their respective CCCCs work differently).
How can CCCCs stabilize cultural contents and forms? The answer
may seem obvious: CCCCs are concatenations of preservative processes
of memory, imitation and communication, and, so the explanation goes,
these processes must have sufficient fidelity at the micro scale to bring
about the stability we observe at the macro scale. Cultural stability is
then seen as the proof of the reliability of human memory, imitation and
communication. At this point, students of cultural processes may feel
that the inferred high fidelity achieved by these preservative processes
is all that is relevant to them. Moreover, if all this is correct, cultural
items are indeed replicators (even if, unlike genes, they do not directly
generate their replica). Given that these replicators exhibit great variety,
and given that the waxing of some (linguistic devices, religious practices
and ideas, techniques, fashions, etc.) is at the expense of the waning
of others, the three conditions for Darwinian selection of heritability,
variability, and competition are met. As suggested by Dawkins and
embraced by Dennett, Aunger, Blackmore (see Aunger 2000) and so many
others, culture can be described as a process of “memetic” evolution
comparable with genetic evolution, with, in both cases selection as the
main driving force. The study of the precise mechanisms that make
such fidelity possible can be left to other scholars, now or when they
will be up to it, just as a population geneticists may leave the details of
chromosome replication to molecular biologists.
This attitude is well illustrated in the recent review by Mesoudi et al.
(2004) of arguments in favor of a selectionist approach to culture. They
suggest that “our current understanding of culture is comparable to that
attained by biology in 1859” and that, just as Darwin’s own ignorance
of the mechanisms of biological inheritance did not stop him from
successfully developing and applying his theory, what they take to be
our comparable ignorance of the mechanisms of cultural inheritance
should not inhibit us from applying evolutionary models. They express
the hope that some future “cultural ‘Watson and Crick’ ” (Mesoudi et
al. 2004:9) will discover the cultural counterpart of DNA. I believe we
know enough to know that there is no cultural DNA to be found.
Cultural transmission is achieved not by a single mechanism of
replication but by a variety of mental and social mechanisms. These
mechanisms are intensely studied and in good part understood. Consider
the work done on “imitation” in the past 15 years (Hurley and Chater
2005; Tomasello and Carpenter 2005). Among the several processes that
result in the re-production of a behavior,1 we now must distinguish (at
least) stimulus enhancement, emulation, and, within imitation proper,
imitation of behavior and imitation of goal (if the latter can properly
be described as “imitation” at all). Work on verbal communication in
linguistics, pragmatics, psycholinguistics, and sociolinguistics reveals a
great variety of submechanisms interacting in complex ways; nonverbal
communication involve yet other mental and interactional mechanisms
such as joint attention (see Astington, Clark, Enfield, Gaskins, Gergely
and Csibra, Goldin-Meadow, Goodwin, Keating, Levinson, Liszowski,
Pyers, Schlegloff, and Tomasello in this volume; see also Sperber and
Wilson 1995).
If, instead of postulating that they must be faithful enough to
explain cultural macro stability, one looks closely at the micro processes
involved, what is immediately striking (and abundantly confirmed
by experimental work) is that outputs of memory, imitation and
communication are quite generally transformations of the inputs, so
much so that the rare case where the output is identical to the input
are best seen as limiting cases of “zero transformation.” Much of these
transformations is in the direction of entropy: information is lost in
the process of transmission. Part of these transformations is biased
so as better to fit the current mental or motor schema and goals of
the user. This is hardly surprising. It is not just that imperfection is
to be expected. It is, more importantly, that the finality of individual
memory, imitation and communication processes is not to preserve
information per se (and even less to preserve it so as to secure cultural
stability). Rather, a relative degree of preservation of information is a
means toward a variety of ends.
When Jill tries to remember what happened at the last council
meeting,
it is to better prepare the coming one, and, for this, all she needs is
the parts of the gist of what was said on issues that are likely to come up
again. When you read this chapter, you do so not to store in your mind
a copy of its contents but to extract from it what may be of relevance to
you. When Peter tries to copy the way in which he saw Henrietta prepare
a soufflé, he does so not to duplicate her movements but to produce at
a soufflé to his liking. Only when the goal of preservation is best served
by strict replication, as when forging a banknote or dancing in a chorus
line, is an effort made to avoid any departure from the model.
To generalize: in preservative processes, information is transformed
in two directions—entropy and relevance. Part of this transformation
results from the imperfection of these processes, part of it results
from their finality. Incidentally, it would be a mistake to assume
that transformation toward entropy is always and entirely an aspect
of the imperfection of preservative processes: eliminating irrelevant
information is a contribution to overall relevance. 2
Most cognitive processes are constructive. They do not just reencode
input information; rather, they construct new mental representations
by drawing jointly on new inputs and on memorized information
and by typically going beyond a mere addition of the two. Even
preservation of information is to a large extent achieved by processes
that reconstruct rather than merely replicate the information to be
preserved. Reconstruction is often more efficacious than replication
because it can better handle fragmentary or degraded inputs. It is
also more parsimonious because it makes it unnecessary to register
information in full to make it available when needed.
Preservative and constructive processes, far from being mutually
exclusive, typically overlap. Preservation, and in particular the
reproduction of cultural information, can be more or less replicative and
more or less reconstructive. Why does it matter? Because replication and
reconstruction provide different explanations of stability in chains of
re-production, and therefore in culture. With replicative processes, an
error of replication at some juncture is preserved in further replications:
it becomes the model until the next copying error. If such copying
errors are very frequent (as they are in human memory, imitation and
communication except that describing them as “copying errors” is
misleading, because these are not copying processes), this compromises
both heritability and stability.
However, a reconstructive process of transmission can combine
transformations
at every micro step with macro stability. Why? Because
reconstruction, unlike replication, just as it can easily depart from the
model, can also easily return to the model even if it had been modified
in earlier re-productions. This occurs for instance when in so-called
“imitation of goal,” the imitator produces an action that succeeds in
achieving a goal that the model missed. By not copying the model’s
actual actions, the so-called imitator of a goal may reconstruct a cultural
skill and become better at it than the model. More generally, constructive
processes in members of the same population may draw on the same
inputs and converge on the same outcome, that is, they may result in
the re-production of some cultural representation or practice whether
or not they were intended to achieve such re-production.
If I am right, a good part of the explanatory weight in the explanation
of cultural stability and evolution should move from mechanisms of
inheritance and selection to the mechanisms of construction and
reconstruction and to the cognitive and environmental factors that
cause these mechanisms to have converging outputs.

Environmental and Psychological Factors of Stability


Let me start with environmental factors, they are less controversial and
will help us put psychological factors in perspective. The effectiveness
of public productions as links in CCCCs depends on their respecting,
or, even better, taking advantage of environmental constraints. Some
of these constraints are very general and help explain cross-culturally
recurring aspects of public productions. Buildings that do not properly
respect earthly gravity tend to fall down and are unlikely to stay up
long enough to be imitated. Hence all culturally stable architectural
forms obey basic physical principles. The domestication of animals is
constrained by their biology, and so is their artificial selection. Hence,
the practice of animal husbandry exhibit strong commonalities across
cultures. Other environmental factors are more local. The slope of roofs
is influenced by local weather conditions. Llama husbandry differs from
camel husbandry not just because the cultures where they are practiced
are otherwise different but also because llamas differ from camels; and
so on.
Much of the environment that contributes to shaping human
culture is itself cultural. The process described by biologists as “niche
construction” (see Odling-Smee et al. 2003) is indeed extraordinary
developed among humans. The evolution of cooking is made possible
by that of agriculture, the evolution of some forms of hunting by the
domestication of dogs, the evolution of furniture by that of housing, the
spread of spam by the evolution of the Net, each time novel environments
bringing together new opportunities and new constraints.
Here is a couple of simplified illustration of the role of environmental
factors in CCCCs. When moving around and away from their settlements,
humans have typically followed footpaths. The presence of paths
in the environment was an obvious factor in stabilizing patterns of
movements and mental representations of space, but the converse is
also true. For how are footpaths themselves stabilized? By the erosion
caused by people walking the path. Walkers, knowing that a path goes
where they want to go, look for the path and follow it and thereby
contribute to others following it in turn. Imagine a footpath going first
through a sand dune, and then through a narrow pass between rocks.
In the sand dune, the path is quite unstable, and many people do not
bother to follow it at all, thereby adding to its instability. In the pass,
on the contrary, the path, borrowing a natural passage, remains stable.
Walkers’s behavior on the sand dune exhibits greater variety, but then
they converge toward the pass because of this environmental difference.
The environmental factors that affect cultural representations and
productions are themselves affected to a greater or lesser extent by
these human productions.
Consider, as a second brief illustration, the case of a standard artifact:
a pair of scissors. Different users produce, with greater or lesser dexterity
and efficiency, movements that are all tokens of the same type. The
cultural practice of using scissors is more informed by the physics of
the scissors than by imitation. Some people may have had only clumsy
people to imitate, and have become scissors virtuosos, and conversely.
Again, much of our cultural practices is stabilized by the affordances
and constraints of cultural productions.
What is true of environmental factors of cultural stability and
evolution
is true also of psychological factors: they are important, they are
diverse, some are more constraining than others, and they themselves
are modified by the CCCCs in which they are involved.
Psychological factors involved in cultural evolution are partly innate,3
partly the result of cognitive development. Human beings are innately
equipped with psychological dispositions and abilities that cause them,
from birth on or when biological maturation permits, to allocate
greater cognitive resources to specific stimuli and to approach them
in different ways. For instance infants pay more attention to speech
sounds than other noises, and they try and extract from them certain
kinds of regularities that are different from the kind of regularities
they extract from, say, outdoor noises. For the most part, these innate
endowments are learning mechanisms. They have, that is, the function
of allowing the acquisition of information about the environment and
of further abilities and dispositions that may enrich, complement,
or even displace and overturn innate ones. At any time in cognitive
development, individuals are processing new inputs with what their
abilities and dispositions have become at that time, and not—need one
say this?—with just their innate capacities.
Specialized learning mechanisms are factors of cultural stabilization.
I have illustrated this claim in other writings (Bloch and Sperber 2002;
Sperber 1996; Sperber and Hirschfeld 2004) and so have Atran (1990,
2002), Boyer (1994, 2001), and Hirschfeld (1996) (see also Hirschfeld
and Gelman 1994). Here, I give just one illustration drawing on the
recent work of Shaun Nichols.
Nichols (2004) extends the epidemiological approach to culture by
looking at the role affect mat play in stabilizing norms. Not all social
norms elicit affective reactions, but those that do are likely to be regarded,
ceteris paribus, as more important, and also to be best remembered (the
relationship between in particular negative affect and remembering
being well established). Nichols proposes that “normative prohibitions
against action X will be more likely to survive if action X elicits (or is
easily led to elicit) negative affects.” To provide empirical evidence
for this claim, Nichols looked at Erasmus’s extremely influential On
Good Manners for Boys (first published in Latin in 1530) to see which
of the hundred of norms it contained have survived, and which have
become obsolete. There is independent evidence that the human mind
is equipped with a disposition to regard certain kinds of substances, in
particular body products such as feces or vomit (Rozin et al. 2000) as
disgusting. Nichols’s hypothesis is, more specifically, that the prohibition
of actions eliciting such “core disgust” is more likely to survive than
other manner norms (Nichols 2004).
Nichols’s hypothesis is not in contradiction with the common
anthropological observation that humans differ in what they consider
disgusting and that cultural norms in such matter do vary (e.g., belching
is seen as disgusting in some culture and acceptable or even required in
others). The idea, rather, is that variation is less likely—or, equivalently
stability more likely—when it comes to core disgust. Indeed, it would
take even a strong relativist a certain dose of bad faith to claim to be
equally surprised or unsurprised by the fact that one of Erasmus’s rule:
“If given a napkin, put it either over the left shoulder or the left forearm”
is obsolete (even though napkins are not), and by the fact that another
of his rule, “Withdraw when you are about to vomit,” is still very much
in force. Less anecdotally, among the actions prohibited by Erasmus,
almost all those that elicit core disgust are still prohibited, although only
about a third of the actions that do not elicit core disgust still are.
Nichols’s work can be naturally expanded by distinguishing more
than innate core disgusts on the one hand and relatively variable
cultural norms concerning disgusting things on the other hand. Core
disgusts may themselves exhibit some degree of individual and cultural
flexibility, and the cultural forms of these emotions may stabilize in a
more profound manner than explicit norms of behavior. For instance,
attitudes to bodily odors, found quite disgusting in some culture and
not in others, may get fixated before adulthood and therefore evolve
more slowly than norms about proper ways of attenuating these bodily
odors (with deodorants, perfumes, mouth sprays, etc.), regarding which
people can change attitude in their life time. The degree to which
an innate disposition can be modified by culture and therefore can
itself be a source of cultural variability depends on its role in cognitive
and affective development. In other terms a richly anthropological
perspective has to be associated with a developmental perspective.

Conclusion
Holism in the social sciences starts from the correct coarse observation
that everything is connected to everything else, but alas arrives
nowhere. Methodological individualism and interactionism have in
various ways looked at social life with a magnifier, revealing details and
providing novel insights into the bigger picture. I am also advocating
using a microscope. Social life is a web of causal chains that are better
described not as individual or as supraindividual but as both infra-
and transindividual. Individual- and societal-level observable effects
are caused by the aggregation of microprocesses few of which are open
to easy observation or introspection. Half of these microprocesses are
mental. Thanks to the development of the cognitive sciences, our
understanding of infraindividual (or “subpersonal,” see Dennett 1969)
mental processes is rapidly changing and growing. In particular, it is
becoming clear, or so I argue, that, to an important extent, cognition
enables culture through domain-specific constructive mechanisms.
Mechanisms of imitation and communication, however remarkable
and important in humans, do not yield the kind of heritability that by
itself would explain cultural stability. This is why a deep understanding
of culture and its evolution is incompatible with shallow psychology.

Notes
1. I write throughout re-production rather than reproduction because I am
talking of the new production of the token of a type, whether or not it is
achieved by means of “reproduction” in the usual sense of copying.
2. A striking example of this is provided by the experiments of Van der
Henst et al. (2002). They found that 57 percent of people with digital watches
asked for the time by a stranger in the street, rather than just reading aloud
what their watch indicates (a purely preservative process) make the effort
of rounding to the nearest multiple of five the time precise to the minute
they read on their watch, thus providing a less informative but more relevant
answer.
3. Cognitive dispositions, like all phenotypic traits, are determined by the
interaction, during their development, of genetic and environmental factors.
Cognitive dispositions are “innate” to the extent that the environmental
factors needed for their development are not themselves cognitive inputs. So,
knowledge of English is certainly not innate, but the ability to learn English
or any other human language may be, even if this ability might fail to develop
because, for instance, of severe nutritional deficits. “Innate” so understood
(see Samuels 2002) does not mean determined solely by the genes—nothing
is—and does not either mean present at birth (whether development ends
intra or extra utero is irrelevant).

References
Atran, S. 1990. Cognitive foundations of natural history: Towards an
anthropology of science. Cambridge: Cambridge University Press.
——. 2002. In gods we trust: The evolutionary landscape of religion. Oxford:
Oxford University Press.
Aunger, R. (ed.). 2000. Darwinizing culture: The status of memetics as a
science. Oxford: Oxford University Press.
——. 2002. The electric meme. New York: Free Press.
Blackmore, S. J. 1999. The meme machine. Oxford: Oxford University
Press.
Bloch, M., and D. Sperber. 2002. Kinship and evolved psychological
dispositions: The Mother’s Brother controversy reconsidered. Current
Anthropology 43(4):723–748.
Boyd, R., and P. J. Richerson. 1985. Culture and the evolutionary process.
Chicago: University of Chicago Press.
Boyer, P. 1994 The naturalness of religious ideas: A cognitive theory of
religion. Berkeley: University of California Press.
——. 2001. Religion explained: The evolutionary origins of religious thought.
New York: Basic Books.
Cavalli-Sforza, L. L., and M. W. Feldman. 1981. Cultural transmission and
evolution: A quantitative approach. Monographs in Population Biology.
Princeton: Princeton University Press.
Dawkins, R. 1976. The selfish gene. Oxford: Oxford University Press.
Dennett, D. 1969. Content and consciousness. London: Routledge and
Kegan Paul.
Durham, W. H. 1991. Coevolution: Genes, culture, and human diversity.
Stanford: Stanford University Press.
Erasmus, D. 1530. On Good Manners for Boys. In Collected Works of
Erasmus, vol. 25, edited by J. Sowards; translated by B. McGregor,
269–289. Toronto: University of Toronto Press.
Hirschfeld, L. A. 1996. Race in the making: Cognition, culture, and the
child’s construction of human kinds. Cambridge, MA: MIT Press.
Hirschfeld, L. A., and S. A. Gelman (eds.). 1994. Mapping the mind:
Domain specificity in cognition and culture. Cambridge: Cambridge
University Press.
Hurley, S., and N. Chater (eds.). 2005. Perspectives on imitation. Cambridge,
MA: Bradford Books.
Hutchins, E. 1995. Cognition in the wild. Cambridge, MA: MIT Press.
Mesoudi, A., A. Whiten, and K. N. Laland. 2004. Is human cultural
evolution Darwinian? Evidence reviewed from the perspective of
“The Origin of Species.” Evolution 58(1):1–11.
Nichols, S. 2004. Sentimental rules: On the natural foundations of moral
judgement. Oxford: Oxford University Press.
Odling-Smee F. J., K. N. Laland, and M. W. Feldman. 2003. Niche
construction: The neglected process in evolution. Monographs in
Population Biology 37. Princeton: Princeton University Press.
Richerson, P. J., and R. Boyd. 2005. Not by genes alone: How culture
transformed human evolution. Chicago: University of Chicago Press.
Rizzolatti, G., L. Fadiga, V. Gallese, and L. Fogassi. 1996. Premotor
cortex and the recognition of motor actions. Cognitive Brain Research
3:131–141.
Rozin, P., J. Haidt, and C. McCauley. 2000. Disgust. In Handbook of
emotions, 2nd edition, edited by M. Lewis and J. Haviland-Jones,
637–653. New York: Guilford Press.
Samuels, R. 2002. Nativism in cognitive science. Mind & Language
17(3):233–265.
Shatz, M., D. Behrend, S. A. Gelman, and K. S. Ebeling. 1996. Colour
term knowledge in two-year-olds: Evidence for early competence.
Journal of Child Language 23:177–199.
Sperber, D. 1996. Explaining culture: A naturalistic approach. Oxford:
Blackwell.
——. 1999. Conceptual tools for a natural science of society and culture
(Radcliffe-Brown Lecture in Social Anthropology 1999). Proceedings
of the British Academy (2001) 111:297–317.
Sperber, D., and L. Hirschfeld. 2004. The cognitive foundations of cultural
stability and diversity. Trends in Cognitive Sciences 8(1):40–46.
Sperber, D., and D. Wilson. 1995. Relevance: Communication and cognition,
2nd edition. Oxford: Blackwell.
Tomasello, M., and M. Carpenter. 2005. Intention reading and imitative
learning. In Perspectives on imitation: From neuroscience to social science:
Vol. 2. Imitation, human development, and culture, edited by S. Hurley
and N. Chater, 133–148. Cambridge, MA: MIT Press.
Tooby, J., and L. Cosmides. 1992. The psychological foundations of
culture. In The adapted mind: Evolutionary psychology and the generation
of culture, edited by J. Barkow, L. Cosmides, and J. Tooby, 19–136. New
York: Oxford University Press.
Van der Henst, J.-B., L. Carles, and D. Sperber. 2002. Truthfulness and
relevance in telling the time. Mind & Language 17:457–466.
Part 5

Evolutionary Perspectives
sev nte n

Culture and the Evolution of the


Human Social Instincts
R. Boyd and P. J. Richerson

Human societies are extraordinarily cooperative compared with those


of most other animals. In the vast majority of species, individuals
live solitary lives, meeting to only to mate and, sometimes, raise their
young. In social species, cooperation is limited to relatives and (maybe)
small groups of reciprocators. After a brief period of maternal support,
individuals acquire virtually all of the food that they eat. There is little
division of labor, no trade, and no large scale conflict. Communication
is limited to a small repertoire of self-verifying signals. No one cares for
the sick, or feeds the hungry or disabled. The strong take from the weak
without fear of sanctions by third parties. Amend Hobbes to account
for nepotism, and his picture of the state of nature is not so far off for
most other animals. In contrast, people in even the simplest human
societies regularly cooperate with many unrelated individuals. Human
language allows low-cost honest communication of virtually unlimited
complexity. The sick are cared for, and sharing leads to substantial flows
of food from the middle aged to the young and old. Division of labor
and trade are prominent features of every historically known human
society, and archaeology indicates that they have a long history. Violent
conflict among sizable groups is common. In every human society,
social life is regulated by commonly held moral systems that specify
the rights and duties of individuals enforced, albeit imperfectly, by
third party sanctions.
Thus, we have an evolutionary puzzle. Doubtless, the societies
of our Plio-Pleistocene hominin ancestors were much like those of
other primates, small, without much division of labor or cooperation.
Evolutionary Perspectives

Sometime over the last five million years, important changes occurred
in human psychology that gave rise to larger more cooperative societies.
Given the magnitude and complexity of the changes, they were probably
the product of natural selection. However, the standard theory of the
evolution of social behavior is consistent with Hobbes, not observed
human behavior. Apes fit the bill, not humans.
Something makes our species different, and in this chapter we argue
that something is cultural adaptation. Over the last million years or
so, humans evolved the ability to learn from other humans, creating
the possibility of cumulative, nongenetic evolution. These capacities
were strongly beneficial in the chaotic climates of the Pleistocene,
allowing humans to culturally evolve highly refined adaptations to
rapidly varying environments. However, cultural adaptation also vastly
increased heritable variation among groups, and this gave rise to the
evolution of group beneficial cultural norms and values. Then, in such
culturally evolved cooperative social environments, genetic evolution
created new, more prosocial motives.
We begin by reviewing the evolutionary theory of social behavior,
explaining why natural selection does not normally favor large-scale
cooperation. Then, we argue that cumulative cultural adaptation
generates between-group variation, which potentiates the evolution
of cooperation. Next, we suggest that such changes would lead to the
evolution of genetically transmitted social instincts favoring tribal scale
cooperation, and summarize some of the evidence consistent with this
hypothesis. Finally, we briefly discuss how these ideas relate to the
theme of this volume, the nature of everyday human interactions.

Cooperation is Defined as Costly, Group-beneficial


Behavior
In this chapter, we use the word cooperation to mean costly behavior
performed by one individual that increases the payoff of others. This
usage is typical in game theory, and common, but by no means universal
in evolutionary biology. It contrasts with ordinary usage in which
cooperation refers to any coordinated, mutually beneficial behavior. It
is important to distinguish between cooperation in narrow, technical
sense used here and other forms of cooperation because they have very
different evolutionary properties.
To see why, consider a game called the “stag hunt” (see Fig. 17.1), so
named because it is thought to capture the state of nature as described
Rousseau in his Discourse on Inequality. Assume there is a population in
Culture and the Evolution of the Human Social Instincts

Figure 17.1. Suppose that there were a population of people who were paired
at random and play the stag hunt. The average payoff of each strategy is plotted
as a function of the fraction of players who choose to hunt stag. Assuming
that strategies with higher payoffs increase in frequency, there are two stable
equilibria: everybody chooses stag or everybody chooses hare. The average
payoff of the whole population is maximized at the all stag equilibrium. However,
unless stag hunting has a much larger payoff than hunting hares (2h < s), the
basin of attraction of the stag equilibrium is smaller than that of the lower payoff
hare equilibrium.

which pairs of individuals have two options: They can hunt for “a stag”
or for “hare.” Hunting hare is a solitary activity and an individual who
chooses to hunt hare gets a small payoff, h, no matter what the other
individual does. Stag hunting, however, requires coordinated action.
If both players hunt for the stag, they usually succeed and each gets a
large payoff, s. However, a single individual hunting stag always fails
and gets a payoff of 0 (see Table 17.1).
The best thing for the population is if everybody hunts stags, so
stag hunting is “cooperative” in the sense of a mutually beneficial
activity. However, it is not cooperative in the technical sense because
individuals do not experience a cost to provide a benefit. When most
Table
Table 17.1.
17.1. The
The Stag Hunt.
Hunt. In
In Rosseau’s
Rosseau's parable, hunters
hunters can
can either
either
hunt
hunt stag or
or hare.
hare. Hunting together does
does not
not affect
affect the
the success
success of
of hare
hare
hunters; they always get a small payoff, h. If they hunt stag together
a small h. If hunt
they are
are likely to
to succeed
succeed and
and achieve
achieve aa high payoff s, but aa single stag
s, but stag
hunter fails and receives a payoff of zero
hunter fails and receives a of zero

Right
Stag Hare

Left Stag s, s 0, h
Hare h, 0 h, h

of the population hunts stag, switching to hunting hare lowers an


individual’s payoff, and therefore once it is common, stag hunting
is not costly; it is individually beneficial. Assuming that strategies
with a higher payoff spread (because of natural selection if they are
genetically transmitted, or because successful behaviors are imitated
if they are culturally transmitted), then it follows that both behaviors
are evolutionarily stable, meaning that once common they can resist
rare invaders. In the jargon of game theory, the stag hunt is a game of
coordination because players do better if they coordinate their behavior
with the behavior of others.
Now contrast the stag hunt with its more famous cousin, the prisoner’s
dilemma (see Table 17.2). Once again consider a population of players
who interact in pairs. Each individual has an opportunity to help his
partner. If he does, the partner’s payoff is increased an amount b but the
helper’s payoff is decreased an amount c—this is clearly cooperative in
the narrow sense. As long as helping provides more benefit than it costs
(b > c), everybody is better off if everybody helps. However, unlike the
stag hunt, the group beneficial behavior is not evolutionarily stable. As
shown in Fig. 17.2, nonhelpers (conventionally labeled “defectors”) have
a higher payoff no matter what the frequency of helpers (conventionally
called cooperators). This means that defectors always increase, and,
even though everyone is better off if everyone cooperates, cooperation
cannot evolve.

The Potential for Cooperation is Everywhere in


Nature
Opportunities for cooperation are omnipresent in social life. Exchange
and division of labor increase the efficiency of productive processes for
Table
Table 17.2. The Prisoner’s
17.2. The Prisoner's Dilemma.
Dilemma. Each
Each individual
individual has
has the
the
opportunity
to cooperate by helping the other individual. Helping increase
to cooperate by helping the other individual. Helping increase
the payoff of the receiver 2 units and costs the helper 1 unit
the of the receiver 2 units and costs the 1 unit
payoff helper

Right
Cooperate Defect
Left Cooperate 1 1

<0 -
c,b
Defect b, -

c 0,0

Figure 17.2. Suppose that there were a population of people who were paired
at random and play the prisoner’s dilemma. The average payoff of each strategy
is plotted as a function of the fraction of players who choose to cooperate.
Now there is only one stable equilibria, everybody defects at which the average
payoff of the whole population is minimized. The payoff maximizing equilibrium,
everybody cooperates, is unstable because defectors have a higher payoff than
cooperators.

all the reasons given by Adam Smith in The Wealth of Nations. However,
participating in exchange typically requires cooperation. In all but the
simplest transactions, individuals experience a cost now in return for
a benefit later and are vulnerable to defectors who take the benefit but
do not produce the return. Exchange and division of labor also are
typically characterized by imperfect monitoring of effort and quality
that give rise to opportunities for free riding. The potential for conflict
over land, food, and other resources is everywhere. In such conflicts
larger more cooperative groups defeat smaller less cooperative groups.
However, each warrior’s sacrifice benefits everyone in the group whether
or not they too went to war and thus defectors reap the fruits of victory
without risking their skins. Honest, low-cost communication provides
many benefits—coordination is greatly facilitated, resources can be
used more efficiently, hazards avoided; the list is long. However, once
individuals come to rely on the signals of others, the door is open for
liars, flim-flam artists, and all the rest. Widely held stable moral systems
enforced by stern sanctions can solve most of these problems; cheats,
cowards, and liars can be punished. The problem is that punishment is
typically costly, and defectors can reap the benefits of the moral order
without paying the costs of punishment.
However, aside from humans, only a few other taxa, most notably
social insects, cooperate very much. Interestingly, those that have are,
like humans, spectacular evolutionary successes. It has been estimated,
for example, that termites account for half of the animal biomass in
the tropics. So, if cooperation produces such spectacular benefits, why
is it so rare?

The Genetic Evolution of Cooperation Requires


Assortment
The answer is simple: cooperation benefits groups, (sometimes large,
sometimes small) and as we have seen, group benefits are (usually)
irrelevant to course of organic evolution. Selection usually favors traits
that increase the reproductive success of individuals, or sometimes
individual genes, and when there is a conflict between what is good for
the individual and what is good for the group, selection usually leads
to the evolution of the trait that benefits the individual.
Selection favors costly group beneficial behavior only if the benefits
flow disproportionately to individuals who are genetically similar to the
actor who performs the behavior. To see why, suppose that groups are
formed at random. Then each prosocial act has the same average effect
on the fitness of helpers and egoists. This means that prosocial behavior
has no effect on the relative fitness of helpers and selfish types, so there
will be no change in the frequency of these two types in the population.
The group benefits of the trait are irrelevant to its evolution. At the
same time, it is important to see that the costs of performing prosocial
behavior solely fall on helpers, and thus decrease their fitness relative
to egoists. Thus, the group beneficial behaviors do not evolve. Now
suppose instead that groups are made up of close relatives. Selection can
favor the genes that give rise to prosocial behavior because the benefits
of prosocial acts are nonrandomly directed toward others who carry
the same genes. Thus, the benefits of the act raise the average fitness
of the genes leading to the prosocial behavior, and if this effect is big
enough to compensate for the cost, selection will lead to the evolution
of the behavior.
This simple example illustrates a fundamental evolutionary principle:
costly group beneficial behavior cannot evolve unless the benefits
of group beneficial behavior flow nonrandomly to individuals who
carry the genes that give rise to the behavior. Altruism toward kin
can be favored by selection because kin are similar genetically. W. D.
Hamilton (1964) worked out the basic calculus of kin selection in 1964
and deduced many of its most important effects on social evolution.
Full siblings can count on sharing half of their genes through common
descent, and can therefore afford to help a sibling reproduce so long as
the fitness payoffs are twice the costs. More distant relatives require a
higher benefit cost ratio.1 This principle, often called Hamilton’s rule,
successfully explains a vast range of behavior and morphology in a
very wide range of organisms (e.g., Keller and Chapuisat 1999; Queller
1989; Queller and Strassmann 1998).

Selection can Favor Cooperation Among Small Groups


of Reciprocators
When animals interact repeatedly, past behavior also provides a cue
that allows nonrandom social interaction. To see why, suppose that
animals live in social groups and the same pair of individuals interacts
repeatedly. During each interaction one member of the pair has the
opportunity to help the other, at some cost to itself. Suppose that there
are two types: defectors who do not help and reciprocators who use
the strategy; help on the first interaction. After that, help your partners
as long as they keep helping you, but if they do not help, do not help
them any more. Initially, partners are chosen at random so that during
the first interaction reciprocators are no more likely to be helped than
defectors. However, after the first interaction, only reciprocators receive
any help, and if interactions continue long enough, the high fitness
of reciprocators in such pairings will be enough to cause the average
fitness of reciprocators to exceed that of defectors.
Beyond this basic story, there is little agreement among scientists
about how reciprocity works. The contrast with kin selection theory is
instructive. The simple principle embodied by Hamilton’s rule allows
biologists to explain a wide range of phenomena. Despite much work,
evolutionary theorists have not managed to derive any widely applicable
general principles describing the evolution of reciprocity. Worse, there
is little evidence that reciprocity is important in nature. There are only
a handful of studies that provide any evidence for reciprocity, and none
of them are definitive (Hammerstein 2003).

Reciprocity in Large Groups is Unlikely to Evolve


Despite its many problems, theoretical work does make one fairly clear
prediction that is relevant here: reciprocity can support cooperation in
small groups, but not in larger ones (Axelrod and Dion 1988; Boyd and
Richerson 1988; Nowak and Sigmund 1998). Instead of assuming that
individuals interact in pairs, suppose that individuals live in groups,
and each helping act benefits all group members. For example, the
helping behavior could be an alarm cry that warns group members of
an approaching predator, but makes the callers conspicuous and thereby
increases their risk of being eaten. Suppose there is a defector in the
group who never calls. If reciprocators use the rule, only cooperate if
all others cooperate, this defector induces other reciprocators to stop
cooperating. These defections induce still more defections. Innocent
cooperators suffer as much as guilty defectors when the only recourse
to defection is to stop cooperating. However, if reciprocators tolerate
defectors, then defectors can benefit in the long run.
Some authors have emphasized that punishment takes other forms—
noncooperators are punished by reduced status, fewer friends, and fewer
mating opportunities (e.g., Binmore 1994). Following Trivers (1971) we
will call this “moralistic punishment.” Although moralistic punishment
and reciprocity are often lumped together, they have very different
evolutionary properties. Moralistic punishment is more effective in
supporting large-scale cooperation than reciprocity for two reasons.
First, punishment can be targeted so that only defectors are affected.
This means that defectors can be penalized without generating the
cascade of defection that follows when reciprocators refuse to cooperate
with defectors. Second, with reciprocity, the severity of the sanction is
limited by the effect of a single individual’s cooperation on each other
group member, an effect that becomes small as group size increases.
Moralistic sanctions can be much more costly to defectors, making it
possible for cooperators to induce others to cooperate in large groups
even when they are rare. Cowards, deserters, and cheaters may be
attacked by their erstwhile compatriots and shunned by their society,
made the targets of gossip, or denied access to territories or mates. Thus,
moralistic punishment provides a much more plausible mechanism for
the maintenance of large-scale cooperation than reciprocity.
There are two problems with moralistic punishment that remain
to be explained: First, why should individuals punish? If punishing is
costly and the benefits of cooperation flow to the group as a whole,
administering punishment is a costly group beneficial act, and therefore,
selfish individuals will cooperate but not punish. Second, moralistic
punishment can stabilize any arbitrary behavior—wearing a tie, being
kind to animals, or eating the brains of dead relatives. It does not matter
whether or not the behavior produces group benefits. All that matters is
that, when moralistic punishers are common, being punished is more
costly than performing the sanctioned behavior, whatever it might be.
When any behavior can persist at a stable equilibrium, then the fact
that cooperation is a stable equilibrium does not tell us whether it is a
likely outcome or not (Boyd and Richerson 1992).
Although much of the debate about moralistic punishment has
focused on the first problem, we think the second presents a much bigger
obstacle to the evolution of cooperation in large groups. Explaining the
persistence of moralistic punishment is much easier than explaining
why moralistic punishment would be used to maintain cooperation
rather than some other form of behavior. If moralistic punishment
is common, and punishments sufficiently severe, then cooperating
will pay. As a result, most people may go through life without having
to punish very much. This in turn means that on average having a
predisposition to punish may be cheap compared with a disposition to
cooperate (in the absence of punishment). This means that relatively
weak evolutionary forces can maintain a moralistic predisposition, and
then punishment can maintain group beneficial behavior. However,
getting around the second problem is more difficult. If evolutionary
change is driven only by individual costs and benefits, then moralistic
punishment can stabilize cooperation, but it can stabilize anything
else too. Because cooperative behaviors are a tiny subset of all possible
behaviors, punishment does not explain why large-scale cooperation
is so widely observed. In other words, moralistic punishment may be
necessary to sustain large-scale cooperation, but it is not sufficient to
explain why large scale cooperation evolves in the first place.
Selection Among Large, Partially Isolated Groups is not Effective
Group selection may be the number one hot button topic among
evolutionary biologists, and as with many heated controversies it is
more about how to use words than about what the world is like. The
controversy began in the early 1960s when V. C. Wynne-Edwards,
a British bird biologist, published a book that explained a number
interesting bird behaviors in terms of the benefit to the group (Wynne-Edwards
1962). Although this kind of explanation was common in
those days, Wynne-Edwards was much clearer than his contemporaries
about the process that gave rise to such group level adaptations. Groups
that had the display survived and prospered, although those that did
not overexploited their food supply and perished. The book generated
a storm of controversy, with biological luminaries such as George
Williams (1966) and John Maynard Smith (1964) penning critiques
explaining why this mechanism, then called group selection, could
not work. At the same time Hamilton’s newly minted theory of kin
selection provided an alternative explanation for cooperation. The result
was the beginning of an ongoing, and highly successful revolution in
our understanding of the evolution of animal behavior, a revolution
that is rooted in carefully thinking about the individual and nepotistic
function of behaviors.
In the early 1970s, a retired engineer named George Price (1970,
1972) published two articles that presented a new way to think about
evolution. Up until that time, most evolutionary theory kept track
of the average fitness of alternative genes (just as we did above in
explaining kin selection and reciprocity). Price argued that it was also
fruitful to think about selection going on in a series of nested levels:
among genes within an individual, among individuals within groups,
and among groups, and he discovered a very powerful mathematical
formalism for describing these processes. Using Price’s method kin
selection is conceptualized as occurring at two levels: selection within
family groups favors defectors because defectors always do better than
other individuals within their own group. Selection among family
groups favors groups with more helpers because each helper increases
the average fitness of the group. The outcome depends on the relative
amount of variation within and between groups. If group members are
closely related, most of the variation will occur between groups. Price’s
multilevel selection approach, and the older gene centered approaches
are mathematically equivalent, and if you do your sums properly, you
will come up with the same answer either way.2
The multilevel selection approach has led to a renaissance in group
selection in recent years, and this has led to new wrangling between
those who thought that they had killed group selection, and those who,
thinking in multilevel terms, see nothing wrong with it (e.g., Sober and
Wilson 1998). This argument is mainly about what kinds of evolutionary
processes should be called “group selection.” Some people use group
selection to mean the process that Wynne-Edwards envisioned—selection
between large groups made up of mostly genetically unrelated
individuals, although others use group selection to refer to selection
involving any kind of group in a multilevel selection analysis, including
groups made up of close kin.
The real scientific question is what kinds of population structure
can produce enough variation between groups so that selection at
that level can have an important effect? The answer to this question
is fairly straightforward. Selection between large groups of unrelated
individuals is not usually an important force in organic evolution. Even
very small amounts of migration are sufficient to reduce the genetic
variation between groups to such a low level that group selection is not
important (Aoki 1982; Rogers 1990). However, as we will see below, the
same conclusion does not hold for cultural variation.

LCPAmoorimpoeaiurtnepiosgdnl,
Smalis
Grto
The punch line is that evolutionary theory predicts that cooperation
in primates and other species that have small families will be limited
to small groups. Kin selection results in large-scale social systems only
when there are large numbers of closely related individuals. The social
insects, where a few females produce a mass of sterile workers, and
multicellular invertebrates are examples of such exceptions. Primate
societies are nepotistic, but cooperation is mainly restricted to relatively
small kin groups. Theory suggests that reciprocity can be effective in
small groups, but not in larger ones. Reciprocity may play some role
in nature (although many experts are unconvinced), but there is no
evidence that reciprocity has played a role in the evolution of large-scale
sociality. All would be well if humans did not exist, because human
societies, even those of hunter-gatherers, are based on groups of people
linked together into much larger highly cooperative social systems.
Rapid Cultural Adaptation Potentiates Group
Selection
So why are not human societies very small in scale, like those of other
primates? For us, the most likely explanation is that rapid cultural
adaptation led to a huge increase in the amount of behavioral variation
among groups. In other primate species, there is little heritable variation
among groups because natural selection is weak compared with migration.
This is why group selection at the level of whole primate groups is not
an important evolutionary force. In contrast, there is a great deal of
behavioral variation among human groups. Such variation is the reason
why we have culture—to allow different groups to accumulate different
adaptations to a wide range of environments.
In the Origin of Species, Darwin famously argued that three conditions
are necessary for adaptation by natural selection: First, there must be a
“struggle for existence” so that not all individuals survive and reproduce.
Second, there must be variation so that some types are more likely to
survive and reproduce than others, and finally, variation must be heritable
so that the offspring of survivors resemble their parents. Although
Darwin usually focused on individuals,3 the same three postulates apply
to any reproducing entity—molecules, genes, and cultural groups. Only
the first two conditions are satisfied by most other kinds of animal
groups. For example, vervet monkey groups compete with one another,
and groups vary in their ability to survive and grow, but, and this is
the big but, the causes of group-level variation in competitive ability
are not heritable, so there is no cumulative adaptation. Once rapid
cultural adaptation in human societies gave rise to stable, between-group
differences, the stage was set for a variety of selective processes
to generate adaptations at the group level.
The simplest mechanism is intergroup competition. The spread of the
Nuer at the expense of the Dinka in the 19th-century Sudan provides a
good example. During the 19th century each consisted of a number of
politically independent groups. Cultural differences in norms between
the two groups meant that the Nuer were able to cooperate in larger
groups than the Dinka. The Nuer, who were driven by the desire for more
grazing land, attacked and defeated their Dinka neighbors, occupied
their territories, and assimilated tens of thousands of Dinka into their
communities. This example illustrates the requirements for cultural
group selection by intergroup competition. Contrary to some critics
(e.g., Palmer et al. 1997), there is no need for groups to be strongly
bounded, individual-like entities. The only requirement is that there
are persistent cultural differences between groups, and these differences
must affect the group’s competitive ability. Losing groups must be
replaced by the winning groups. Interestingly, the losers do not have
to be killed. The members of losing groups just have to disperse or to
be assimilated into the victorious group. Losers will be socialized by
conformity or punishment, so even very high rates of physical migration
need not result in the erosion of cultural differences. This kind of group
selection can be a potent force even if groups are usually very large.
Group competition is common in small scale societies. The best
data come from New Guinea, which provides the only large sample
of simple societies studied by professional anthropologists before
they experienced major changes because of contact with Europeans.
Joseph Soltis assembled data from the reports of early ethnographers
in New Guinea (Soltis et al. 1995). Many studies report appreciable
intergroup conflict and about half mention cases of social extinction
of local groups. Five studies contained enough information to estimate
the rates of extinction of neighboring groups (see Table 17.3). The
typical pattern is for groups to be weakened over a period of time
by conflict with neighbors and finally to suffer a sharp defeat. When
enough members become convinced of the group’s vulnerability to
further attack, members take shelter with friends and relatives in other
groups, and the group becomes socially extinct. At these rates of group
extinction, it would take between twenty and forty generations, or 500
to 1,000 years, for an innovation to spread from one group to most of
the other local groups by cultural group selection.
These results imply that cultural group selection is a relatively slow
process. But then, so are the actual rates of increase in political and social
sophistication we observe in the historical and archaeological records.

Table
Table 17.3.
17.3. Extinction
Extinction rates
rates for
for cultural
cultural groups
groups from
from five
live regions in
in
New
New Guinea
Guinea from
from Soltis
Soltis et
et al.
al. 1995
1995

Region Number of Number Number % groups extinct

groups of social of years every 25 years


extinctions

Mae Enga 14 5 50 17.9%


Maring 13 1 25 7.7%
Mendi 9 3 50 16.6%
Fore/Usurufa 8-24 1 10 31.2%-10.4%
Tor 26 4 40 9.6%
New Guinea societies were no doubt actively evolving systems (Wiessner
and Tumu 1998), yet the net increase in their social complexity over
those of their Pleistocene ancestors was modest. Change in the cultural
traditions that eventually led to large-scale social systems like the ones
that we live in proceeded at a modest rate. The relatively slow rate of
evolution by cultural group selection may explain the 5,000 year lag
between the beginnings of agriculture and the first primitive city-states,
and the five millennia that transpired between the origins of simple
states and modern complex societies.
A propensity to imitate the successful can also lead to the spread
of group beneficial variants. People often know about the norms that
regulate behavior in neighboring groups. They know that we can marry
our cousins here, but over there they cannot; or anyone is free to pick
fruit here, although individuals own fruit trees there. Suppose different
norms are common in neighboring groups, and that one set of norms
causes people to be more successful. Both theory and empirical evidence
suggest that people have a strong tendency to imitate the successful
(Henrich and Gil-White 2001). Consequently, behaviors can spread
from groups at high payoff equilibria to neighboring groups at lower
payoff equilibria because people imitate their more successful neighbors.
A mathematical model suggests that this process will lead to the spread
of group beneficial beliefs over in a wide range of conditions (Boyd and
Richerson 2002). The model also suggests that such spread can be rapid.
Roughly speaking, it takes about twice as long for a group beneficial
trait to spread from one group to another as it does for an individually
beneficial trait to spread within a group.
The rapid spread of Christianity in the Roman Empire may provide
an example of this process. Between the death of Christ and the rule
of Constantine, a period of about 260 years, the number of Christians
increased from a only a handful to somewhere between 6 and 30 million
people (depending on whose estimate you accept). This sounds like a
huge increase, but it turns out that it is equivalent to a 3-4 percent
annual rate of increase, about growth rate of the Mormon Church
over the last century. According to the sociologist Rodney Stark many
Romans converted to Christianity because they were attracted to what
they saw as a better quality of life in the early Christian community.
Pagan society had weak traditions of mutual aid, and the poor and
sick often went without any help at all. In contrast, in the Christian
community norms of charity and mutual aid created “a miniature
welfare state in an empire which for the most part lacked social
services” (Johnson 1976:75, quoted in Stark 1997). Such mutual aid
was particularly important during the several severe epidemics that
struck the Roman Empire during the late Imperial period. Unafflicted
pagan Romans refused to help the sick or bury the dead. As a result,
some cities devolved into anarchy. In Christian communities, strong
norms of mutual aid produced solicitous care of the sick, and reduced
mortality. Both Christian and pagan commentators attribute many
conversions to the appeal of such aid. For example, the emperor Julian
(who detested Christians) wrote in a letter to one of his priests that
pagans needed to emulate the virtuous example of the Christians if
they wanted to compete for their souls, citing “their moral character
even if pretended” and “their benevolence toward strangers” (Stark
1997:83-84). Middle-class women were particularly likely to convert
to Christianity, probably because they had higher status and greater
marital security within the Christian community. Roman norms allowed
polygyny, and married men had great freedom to have extramarital
affairs. In contrast, Christian norms required faithful monogamy. Pagan
widows were required to remarry, and when they did they lost control of
all of their property. Christian widows could retain property, or, if poor,
would be sustained by the church community. Demographic factors
were also important in the growth of Christianity. Mutual aid led to
substantially lower mortality rates during epidemics, and a norm against
infanticide led to substantially higher fertility among Christians.

The Credulity Required for the Cultural Evolution of Novel Forms


of Cooperation is Consistent with an Evolved, Genetically Adaptive
Psychology
The claim that cultural evolution can give rise to forms of novel
cooperation is vulnerable to two related objections: First, there is what
might be called the “bootstrap problem”: Cultural evolution can lead
to the spread of cooperation in large, weakly related groups only if
computational and motivational systems existed in the human brain
that allowed people to acquire and perform the requisite behaviors.
Given that such behaviors were not favored by natural selection, why
should these systems exist? Second, even if they were accidentally present
at the outset, why did natural selection not modify our psychology so
that we did not acquire such deleterious behaviors? Why do we not
have a “cultural immune system” that protects us from bad ideas abroad
in our environment?
Like living primates, our ancestors were large brained mammals capable
of flexibly responding to a range of biotic and social environments.
Natural selection cannot equip such organisms with fixed action
patterns; instead it endows them with a complex psychology that causes
them to modify their behavior adaptively in response to environmental
variation (Tooby and Cosmides 2002). Cultural evolution can generate
novel behaviors by generating the cues that activate these modules
in novel combinations. For example, cooperation among relatives
requires (among other things) a means of assessing costs and benefits,
and of identifying relatives and assessing their degree of relatedness.
Such systems can be manipulated by culturally transmitted input.
Individuals have to learn the costs and benefits of different behaviors
in their particular environment. Thus, people who learn that sinners
suffer an eternity of punishment may be more likely to behave morally
than those who only fear the reprisals of their victims. Individuals
have to learn who their relatives are in different environments. So the
individual who learns that members of his patriclan are brothers may
behave quite differently than one who learns that he owes loyalty to
the band of brothers in his platoon. Once activated, such computational
systems provide input to existing motivational systems that in turn
generate behavior.
This account raises an obvious question: If cultural inputs regularly
lead to what is, from the genes point of view, maladaptive behavior,
why has selection not modified our psychology so that it is immune
to such maladaptive inputs? This is a crucial question, and we have
dealt with it at length elsewhere (Richerson and Boyd 2005:ch. 5). In
brief, we believe that cumulative cultural evolution creates a novel
evolutionary tradeoff. Social learning allows human populations to
accumulate adaptive information over many generations, leading to
the cultural evolution of highly adaptive behaviors and technology.
Because this process is much faster than genetic evolution, human
populations can evolve cultural adaptations to local environments, an
especially valuable adaptation to the chaotic, rapidly changing world
of the Pleistocene. However, the same psychological mechanisms that
create this benefit necessarily come with a built in cost. To get the benefits
of social learning, humans have to be credulous,4 for the most part
accepting the ways that they observe in their society as sensible and
proper, and such credulity opens up human minds to the spread of
maladaptive beliefs. This cost can be shaved by tinkering with human
psychology, but it cannot be eliminated without also losing the adaptive
benefits of cumulative cultural evolution.
Natural Selection in Culturally Evolved Social
Environments may have Favored New, Genetically
Transmitted Prosocial Social Instincts
We hypothesize that this new social world, created by rapid cultural
adaptation, drove the genetic evolution of new, derived social instincts
in our lineage. Cultural evolution created cooperative groups. Such
environments favored the evolution of a suite of new social instincts
suited to life in such groups including a psychology that “expects”
life to be structured by moral norms, and that is designed to learn
and internalize such norms. New emotions evolved, like shame and
guilt, which increase the chance the norms are followed. Individuals
lacking the new social instincts more often violated prevailing norms
and experienced adverse selection. They might have suffered ostracism,
been denied the benefits of public goods, or lost points in the mating
game. Cooperation and group identification in intergroup conflict set
up an arms race that drove social evolution to ever-greater extremes of
in-group cooperation. Eventually, human populations came to resemble
the hunter-gathering societies of the ethnographic record. We think that
the evidence suggests that after about 100,000 years ago most people
lived in tribal scale societies (Richerson and Boyd 1998, 2001). These
societies are based on in-group cooperation where in-groups of a few
hundred to a few thousand people are symbolically marked by language,
ritual practices, dress, and the like. These societies are egalitarian, and
political power is diffuse. People are quite ready to punish others for
transgressions of social norms, even when personal interests are not
directly at stake.
These new tribal social instincts were superimposed onto human
psychology without eliminating ancient ones favoring self, kin, and
friends. The tribal instincts that support identification and cooperation
in large groups, are often at odds with selfishness, nepotism, and face-to-face
reciprocity. People feel deep loyalty to their kin and friends,
but they are also moved by larger loyalties to clan, tribe, class, caste,
and nation. Inevitably, conflicts arise. Families are torn apart by civil
war. Parents send their children to war (or not) with painfully mixed
emotions. Criminal cabals arise to prey on the public goods produced
by larger scale institutions. Elites take advantage of key locations in the
fabric of society to extract disproportionate private rewards for their
work. The list is endless.
Some of our friends in evolutionary psychology have complained to us
that this story is too complicated. Would it not be simpler to assume that
culture is shaped by a psychology adapted to small groups of relatives?
Well, maybe. But the same people almost universally believe an equally
complex coevolutionary story about the evolution of an innate language
acquisition device (e.g., Pinker 1994:111-112). Such innate language
instincts must have coevolved with culturally transmitted languages
in much the same way that we hypothesize that the social instincts
coevolved with culturally transmitted social norms. Initially, languages
must have been acquired using mechanisms not specifically adapted for
language learning. This combination created a new and useful form of
communication. Those individuals innately prepared to learn a little
more protolanguage, or learn it a little faster, would have a richer and
more useful communication system than others not so well endowed.
Then selection could favor still more specialized language instincts,
which allowed still richer and more useful communication, and so
on. We think that human social instincts constrain and bias the kind
of societies that we construct, but the details are filled in by the local
cultural input. When cultural parameters are set, the combination of
instincts and culture produces operational social institutions.

Experiments Indicate People have Prosocial Instincts


Lots of circumstantial evidence suggests that people are motivated by
altruistic feelings toward others, feelings that motivate them to help
unrelated people even in the absence of rewards and punishments (e.g.,
Mansbridge 1990). People give to charity, often anonymously. People
risk their own lives to save others in peril. Suicide bombers give their
lives to further their cause. People vote. The list of examples is long.
Long, but not long enough to convince many who are skeptical about
human motives. The skeptics think that all examples of altruism are
really self-interest in disguise. Charity is never anonymous; the right
people know who gave what. Heroes get on Letterman. Resources are
lavished on the families of suicide bombers. They even give you those
little pins when you vote. Or, in the words of the evolutionary biologist
Michael Ghiselin, “Scratch an altruist and watch a hypocrite bleed”
(Ghiselin 1974:247). The possibility of covert selfish motives can never
be excluded in these kinds of real world examples.
In recent years, however, experimental work by psychologists and
economists has made it a lot tougher to hang on to dark suspicions about
the motives behind good deeds. In these experiments, the possibility
of selfish reward is carefully excluded. Nonetheless, people still behave
altruistically, sometimes risking several months’ salary. They also engage
in costly punishment of nonaltruists, even when there is no possibility
of reward or enhanced reputation. Moreover, experiments have been
conducted in a number of small scale non-Western societies, and
although there is much cultural variation, nowhere are people purely
selfish (Henrich et al. 2004). The news could not be much worse for
the view that people have purely selfish motives.

Human Interaction may Depend on Prosocial Instincts


Several of the chapters in this volume suggest that everyday human
interactions depend on cooperative psychological mechanisms. For
example, at the most micro level, Schegloff (this volume) shows that
even seemingly mundane everyday conversations are actually made
possible by rules that regulate who speaks when and for how long. At a
broader comparative level, Levinson (this volume) argues that face-to-face
human interaction entails complex embedded sequences of speech
and gesture that can succeed only if actors are cooperative.
Complex cooperative signaling is rare in nature. Signaling systems in
most other animals are limited to a small repertoire of signals, referential
signals are rare, and there is scant evidence for anything resembling a
two-way conversation. This state of affairs is generally consistent with
evolutionary theory that suggests that honest, low-cost communication
is a form of cooperation, and cooperation should be limited to kin and
reciprocating partners. The various forms of communication, such as
the famous waggle dance of honeybees that make social insect colonies
going concerns, are examples.
Thus, the psychological mechanisms that enable human interaction
may depend on the same prosocial instincts that regulate other forms
of human cooperation. If so, studying the way that cooperation fails
in human interaction may provide insight into the selective forces that
shaped these instincts. If, as some have argued, our prosocial instincts
evolved in small groups of kin, conversations should fail differently
among kin than nonkin. If reciprocity was the key, then failure of
conversation among friends should differ from those among strangers.
Finally, if the cultural evolution account given here is correct, ethnic
and other group boundaries should be crucial.
Easy communication in simple human societies usually ends at the
boundaries of the group that routinely cooperates. Only a few hundred
to a few thousand people spoke the same language or at least the same
dialect. Modern human groups cooperate on a large scale and have a
common language. Sociolinguists have taught us that linguistic variation
arises rapidly to reflect social cleavages within a language (Labov
2001; Lodge 1993). Typically, the bonds of patriotism rest on a speech
community. The development of mass literacy, mass communication,
and the replacement of local dialects by a national language, are the
foundations on which the modern style of nationalism and nation-state
rest (Anderson 1991). Nations are much larger systems than the ancient
tribes in which our social instincts evolved yet a nation can contrive to
feel like a tribe if members share a common language and have access
to a common set of ideas and concepts born from reading a common
set of newspapers and magazines. In Benedict Anderson’s memorable
phrase, modern nations are “imagined communities.” At the same time,
minority languages and class, caste, and regional dialects commonly
mark patterns of conflict and cooperation within nations.

"Nothing in Biology makes Sense except in the Light


of Evolution" (Dobzhansky 1973)
Evolutionary biologists are a tiny minority in their discipline, vastly
outnumbered by molecular biologists, physiologists, developmental
biologists, ecologists, and all the rest. Nonetheless, evolution plays a
central role in biology because it provides answers to why questions.
Why do humans have big brains? Why do female spotted hyenas
dominate males? Why do horses walk on the tips of their toes? The
answers to these questions draw on all parts of biology. To explain
why horses walk on their toes we need to connect the ecology of
Miocene grasslands, the developmental biology of the vertebrate limb,
the genetics of quantitative characters, the molecular biology and
biophysics of keratin, and much more. Because evolution provides the
ultimate explanation for why organisms are the way they are, it serves
to link all the other areas of biology into a single, satisfying explanatory
framework. As Dobzhansky (1973) put it, without the light of evolution,
biology “. . .becomes a pile of sundry facts some of them interesting or
curious but making no meaningful picture as a whole.”
We think that evolution can play the same role in the explanation
of human culture. The ultimate explanation for cultural phenomena
lies in understanding genetic and cultural evolutionary processes that
generate cultural phenomena. Genetic evolution is important because
culture is deeply intertwined with other parts of human biology. The
ways we think, the ways we learn, and the ways we feel shape culture,
affecting which cultural variants are learned, remembered, and taught,
and which variants persist and spread. Parents love their own children
more than those of siblings or friends, and this must be part of the
explanation for why some marriage systems persist. But why do people
value their own children more than others? Obviously an important
part of the answer is that such feelings were favored by natural selection
in our evolutionary past. Cultural evolution is also important. Because
culture is transmitted, it is subject to natural selection. Some cultural
variants persist and spread because they cause their bearers to be more
likely to survive and be imitated. The answer to why mothers and
fathers send their sons off to war may be that social groups with norms
that encourage such behavior out compete groups that do not have
such norms. Finally, genetic and cultural evolution interact in complex
ways. Social psychologists and experimental economists, working from
very different research traditions, have produced compelling evidence
that people have prosocial predispositions. But why do we have such
predispositions in the first place? Evolutionary theory and the lack of
large scale cooperation in other primates suggest that selection directly
on genes is unlikely to produce such predispositions. So, why did they
evolve? We think cultural evolutionary processes constructed a social
environment that caused ordinary natural selection acting on genes
to favor empathetic altruism, and a tendency to direct that altruism
preferentially to fellow members of symbolically marked groups.
These social instincts evolved in the late Pleistocene but the radically
new social institutions that have evolved in the Holocene were (and
continue to be) both enabled and constrained by them. Our specific
explanation may be in error; you seldom get it straight on the first try.
The important point is that evolving culture, certainly in theory and
probably in practice, has a fundamentally important role in making
humans what we are.

Notes
1. The great population geneticist J. B. S. Haldane gave what is perhaps the
pithiest summary of this principle. When asked by a reporter whether the
study of evolution had made it more likely that he would give up his life for a
brother, Haldane is supposed to have answered, “No, but I would give up my
life to save two brothers or eight cousins.”
2. The Price approach has been very fruitful, generating a much clearer
understanding of many evolutionary problems. For example, Alan Grafen’s
(1984) work on kin selection and Steven Frank’s work on the evolution of
the immune system, multicellularity, and related issues (Frank 2002). This
approach can also be used to study cultural evolution. See Henrich (2004) and
Henrich and Boyd (2002).
3. Darwin (1874), in the Descent of Man, did invoke group selection to
explain human cooperation.
It must not be forgotten that although a high standard of morality gives
but a slight or no advantage to each individual man and his children
over other men of the same tribe, yet that an increase in the number
of well-endowed men and an advancement in the standard of morality
will certainly give an immense advantage to one tribe over another. A
tribe including many members who, from possessing in a high degree
the spirit of patriotism, fidelity, obedience, courage, and sympathy,
were always ready to aid one another, and to sacrifice themselves for
the common good, would be victorious over most other tribes; and this
would be natural selection. [pp. 178-179]
4. Simon (1990) made the same argument, apparently independently. He
used the term docility because he believed that we are especially prone to
accept group beneficial beliefs. We think his account is unsatisfactory because
it does not explain why such beliefs spread.

References
Anderson, B. R. O’G. 1991. Imagined communities: Reflections on the origin
and spread of nationalism, revised and extended edition. London:
Verso.
Aoki, K. 1982. A condition for group selection to prevail over
counteracting
individual selection. Evolution 36:832-842.
Axelrod, R., and D. Dion. 1988. The further evolution of cooperation.
Science 242:1385-1390.
Binmore, K. G. 1994. Game theory and the social contract. Cambridge,
MA: MIT Press.
Boyd, R., and P. J. Richerson. 1988. The evolution of reciprocity in
sizable groups. Journal of Theoretical Biology 132:337-356.
Boyd R., and P. J. Richerson. 1992. Punishment allows the evolution
of cooperation (or anything else) in sizable groups. Ethology and
Sociobiology 13:171-195.
Boyd R., and P. J. Richerson. 2002. Group beneficial norms spread rapidly
in a structured population, Journal of Theoretical Biology 215:287-296.

Darwin, C. 1874. The descent of man and selection in relation to sex, 2nd
edition, 2 vols. New York: American Home Library.
Dobzhansky, T. 1973. Nothing in biology makes sense except in the
light of evolution. American Biology Teacher 35:25-29.
Frank, S. A. 2002. Immunology and evolution of infectious disease. Princeton:
Princeton University Press.
Ghiselin, M. T. 1974. The economy of nature and the evolution of sex.
Berkeley: University of California Press.
Grafen, A. 1984. A geometric view of relatedness. Oxford Surveys of
Evolutionary Biology 2:28-89.
Hamilton, W. D. 1964. Genetic evolution of social behavior I, II . Journal
of Theoretical Biology 7:1-52.
Hammerstein, P. 2003. Why is reciprocity so rare in animals? A Protestant
appeal. In Genetic and cultural evolution of cooperation, edited by P.
Hammerstein, 83-94. Cambridge, MA: MIT Press.
Henrich, J. 2004. Cultural group selection, coevolutionary processes and
large-scale cooperation. Journal of Economic Behavior and Organization
53:3-35.
Henrich, J., and R. Boyd. 2002. On modeling cognition and culture:
Why replicators are not necessary for cultural evolution. Culture and
Cognition 2:67-112.
Henrich, J., and F. J. Gil-White. 2001. The evolution of prestige—Freely
conferred deference as a mechanism for enhancing the benefits of
cultural transmission. Evolution and Human Behavior 22:165-196.
Henrich, J., R. Boyd, S. Bowles, C. Camerer, E. Fehr, and H. Gintis. 2004.
The foundations of human sociality: Economic experiments and ethnographic
evidence from fifteen small-scale societies. New York: Oxford University
Press.
Johnson, P. 1976. A history of Christianity. London: Weidenfeld &
Nicolson.
Keller, L., and M. Chapuisat. 1999. Cooperation among selfish individuals
in insect societies. Bioscience 49:899-909.
Labov, W. 2001. Principles of Linguistic Change, vol. 2: Social Factors.
Oxford: Blackwell.
Lodge, R. A. 1993. French: From dialect to standard. London:
Routledge.
Mansbridge, J. J. 1990. Beyond self-interest. Chicago: University of
Chicago Press.
Maynard Smith, J. 1964. Group selection and kin selection. Nature
201:1145-1146.
Nowak, M., and K. Sigmund. 1998. Evolution of indirect reciprocity by
image scoring: The dynamics of indirect reciprocity. Nature 393(June
11):573-577.
Palmer, C. T., B. E. Fredrickson, and C. F. Tilley. 1997. Categories and
gatherings: Group selection and the mythology of cultural
anthropology
. Evolution and Human Behavior 18:291-308.
Pinker, S. 1994. The language instinct
, 1st edition. New York: W.
Morrow.
Price, G. R. 1970. Selection and covariance. Nature 277(August 1):520-521.
Price, G. R. 1972. Extensions of covariance selection mathematics.
Annals of Human Genetics 35:485-490.
Queller, D. C. 1989. Inclusive fitness in a nutshell. Oxford Surveys in
Evolutionary Biology 6:73-109.
Queller, D. C., and J. E. Strassmann. 1998. Kin selection and social
insects: Social insects provide the most surprising predictions and
satisfying tests of kin selection. Bioscience 48:165-175.
Richerson, P. J., and R. Boyd. 1998. The evolution of human ultrasociality.
In Indoctrinability, ideology, and warfare: Evolutionary perspectives, edited
by I. Eibl-Eibesfeldt and F. K. Salter, 71-95. New York: Berghahn
Books.
Richerson, P. J., and R. Boyd. 2001. The evolution of subjective
commitment to groups: A tribal instincts hypothesis. In Evolution
and the capacity for commitment, edited by R. M. Nesse, 186-220. New
York: Russell Sage Foundation.
Richerson, P. J., and R. Boyd. 2005. Not by genes alone: How culture
transformed human evolution. Chicago: University of Chicago Press.
Rogers, A. R. 1990. Group selection by selective emigration: The effects
of migration and kin structure. American Naturalist 135:398-413.
Simon, H. A. 1990. A mechanism for social selection and successful
altruism. Science 250(4988):1665-1668.
Sober, E., and D. S. Wilson. 1998. Unto others: The evolution and psychology
of unselfish behavior. Cambridge, MA: Harvard University Press.
Soltis, J., R. Boyd, and P. J. Richerson. 1995. Can group-functional
behaviors evolve by cultural group selection? An empirical test.
Current Anthropology 36:437-494.
Stark, R. 1997. The rise of Christianity: How the obscure, marginal Jesus
movement became the dominant religious force in the Western world in a
few centuries. San Francisco: HarperCollins.
Tooby, J., and L. Cosmides. 1992. The psychological foundations of
culture. In The adapted mind: Evolutionary psychology and the generation
of culture, edited by J. Barkow, L. Cosmides, and J. Tooby, 19-136.
New York: Oxford University Press.
Trivers, R. L. 1971. The evolution of reciprocal altruism. Quarterly Review
of Biology 46:35-57.
Wiessner, P., and A. Tumu. 1998. Historical vines: Enga networks of
exchange, ritual, and warfare in Papua New Guinea. Smithsonian Series
in Ethnographic Inquiry. Washington, DC: Smithsonian Institution
Press.
Williams, G. C. 1966. Adaptation and natural selection: A critique of some
current evolutionary thought. Princeton: Princeton University Press.
Wynne-Edwards, V. C. 1962. Animal dispersion in relation to social behavior.
Edinburgh: Oliver and Boyd.
eighte n

Parsing Behavior: A Mundane Origin


for an Extraordinary Ability?
Richard W. Byrne

When we notice someone engaged in activity, we see not only


how their body moves and what effects those movements are
having on other things, but we also see what it means. The meaning of
action includes what is likely to happen next, as a consequence of what
has been done already; and why it is being done, that is, what overall
result is to be expected from the activity. This description applies to the
simplest of organized, purposeful actions but also to what is arguably
our most sophisticated cognitive ability, the ability to talk. When we
hear someone talking our language, we do not merely register a series of
sounds, phonemes, words, phrases, and meanings; almost immediately,
we have some understanding of what in the speaker’s mind has led up
to their speaking, whereabouts (metaphorically) their speech is going,
and what pragmatic effects the speaker might be trying to achieve by
it. These observations are so familiar and commonplace that normally
we pay them no heed: rather, we notice when people do things that
make no sense to us or say things that seem irrational. Yet our ability
to perceive the everyday world of social action as a world of meanings,
purposes, intentions, and reasons is an extraordinary one.
At the heart of the ability to read meaning in perceived action is
parsing. A characteristic of skilled action is that, in physical terms, its
organization is invisible. Whether driving a car, uttering a sentence, or
baking a cake, all that is physically present to be perceived is smooth,
fluid movement. The absence of “real gaps” between many of the separate
words in a spoken sentence is part of every entry-level linguistics course;
and just the same is true of manual actions. Once a skilled sequence
Parsing Behavior

of actions has been assembled, practicing will result in smoother and


smoother performance, to the point when underlying structure is not
signaled by any detectable interruptions in the sequence. That is the first
part of the parsing problem: the fact that we watch a linear sequence
of fluid behavior, but perceive it as segmented into discrete units that
correspond to real entities for the actor who is observed.
The second part of the parsing problem concerns the fact that
organized, complex behavior is hierarchical in structure. This means
that elements lying together in sequence may be closely related logically,
because they form part of a module or subroutine or phrase, depending
on what sort of behavior is under discussion; or much less closely
related, only lying together by virtue of the organization of some higher-
order unit of organization (see Levinson this volume). To understand
action and detect the meaning in it, it is crucial to parse its hierarchical
structure accurately. The output of the parsing process must go beyond a
sequence of discrete units, to get at the underlying relationships that we
conventionally represent in terms of a bracketed string, a tree diagram or
a phrase-structure grammar. Without that, there would be no systematic
way to connect observed behavior to the purposes that underlie it in
the mind of the actor—and, thus, to go on to understand the actor’s
intentions and the cause and effect of how that particular behavior is
efficient for achieving their purposes.
It is the thesis of this chapter that parsing has its evolutionary origins
in an unexpected place. Rather than deriving from a selective pressure
for more sophisticated vocal communication, the domain in which
we see the full flowering of parsing ability in modern humans, I argue
that parsing was originally part of a feeding adaptation, and that these
derived abilities for efficient feeding were themselves based on earlier
evolution of abilities in social behavior reading.
After briefly considering primate vocal communication, I first sketch
the evidence that a segmentation system, one that can parse a smooth
behavioral performance into separate but meaningful units of action,
is present in monkeys—and perhaps in many other species even
more distant from ourselves on an evolutionary time scale. The main
biological function of action segmentation in those species is most likely
the estimation of current behavioral dispositions in conspecifics and
the prediction of their immediate future actions. Among the primates,
it seems, only in great apes did rather special abilities of hierarchical
parsing (and planning) develop, and I will suggest that these special
capacities were parasitic on that earlier segmentation system—but were
not dependent on prior ability to understand intentions or causality.
Evolutionary Perspectives

In nonhuman great apes hierarchical parsing seems only to be only


found within the manual skill domain, where it functions in the wild
by allowing more efficient feeding; and there are plausible ecological
reasons why enhanced feeding abilities should have evolved specifically
in the great apes. Under the artificial conditions of human rearing,
hierarchical parsing and planning give rise to a wide range of richly
complex behaviors, and can be deliberately co-opted into human-
derived communication systems such as American Sign Language (ASL).
Given such abilities in living apes, it is only a small step to speculate
that in one of our own early ancestors these hierarchical parsing skills
allowed a natural system of manual communication to develop. The
long-debated idea of a gestural basis for language meshes well with
these conjectures. If linguistic syntax is seen as evolutionarily derived
from hierarchical behavior parsing, and if parsing originated in manual
action, then extension first to manual communication makes sense.
Viewing the evolutionary origins of human language as a two-step
process—first gestural then spoken—sits well with evidence from the
archaeological record that there were two main periods of cognitive
advance in hominin evolution.
Further implications may be drawn out. As emphasized, behavior
parsing is notdependent on first having causal-intentional understanding.
But it could have been a crucial step on the way to achieving this level
of mental representation: an essential precursor to human cognition,
and a necessary part of the process of representing phenomena as
causal-intentional structures. Moreover, the fact that so much can be
achieved without involving that level of mental representation—parsing
of behavioral structure, social learning of complex skills by program-
level imitation, and so on—opens the door to a heretical thought. Could
it be that the prevalence of causal-intentional understanding of our
social world is illusory, a consequence of retrospective contemplation?
Certainly, when we choose to ponder causation and attribution or when
we are asked to justify our actions by others, as adult humans we are
well able to construct causal-intentional theories that make sense (Good
1995). But perhaps the cut and thrust of everyday social action and
interaction does not need this mentalizing, or would indeed be slowed
or disrupted by it (Bargh and Chartrand 1999), and we should look
elsewhere for the evolutionary functions of theory of mind (ToM) and
causal reasoning.
Primate Vocal Communication—Primitive Speech?
Extensive study for many years has focused on primate vocalizations,
driven partly by theoretical interest in language origins and partly by the
availability of sound-manipulation technology. We now know that the
potential for flexibility in production of primate calls is very limited.
No primate can copy another’s sounds, in the way that many birds
and some cetaceans can do (Janik and Slater 1997). Even vocal dialects
are nearly unknown in primates, except in cases where human influence
may have unintentionally conditioned a local variation (Green 1975;
Mitani et al. 1992). “Nearly,” because there is recent evidence that zoo
communities of chimpanzees develop characteristic group dialects (Auser
and Wrangham 1987), and adjacent communities in the wild have been
found to differ more in their vocalizations than do more distant ones
(Crockford et al. 2004)—just as a dialect in human communities can
serve to identify group membership and label an out group (Dunbar and
Nettle 1997). Even in these cases, the modifications are small ones, to
calls that are biologically fixed in form. Young primates of many species
have often been reared out of any auditory contact with conspecifics:
nevertheless, they all develop a normal repertoire of vocalizations.
Learning does play a role in the normal development of calling, but
this is contextual learning not production learning (Janik and Slater
1997): primates learn the appropriate circumstances in which to call,
rather than learning the calls themselves. The famous case of predator-
specific alarm calls in vervet monkeys shows this process in action
(Seyfarth and Cheney 1986). The referential specificity of these calls
is to a limited extent innate, but whereas a young vervet will initially
make an “eagle alarm” to a wide range of flying things (even a large,
falling leaf on occasion), as it matures calling is restricted to large broad-
winged birds, then specifically to raptorial species, and finally the call
is given almost exclusively to the martial eagle Polemaetus bellicosus, a
vervet’s main aerial predator.
Most nonhuman primates have a vocal repertoire of more-or-less
discrete calls, but also show some graded variation, most extensively in
the chimpanzee and gorilla (Marler and Tenaza 1977). Animals perceive
human speech categorically (Kuhl 1982), and primate calls that sound
like a smoothly varying continuum to the human ear have been shown
to be composed of several circumstance-specific and function-specific
calls (Gouzoules et al. 1984). However, nothing remotely like the
multiple levels of patterning and syntactic structuring found in human
speech has been detected in any primate vocal system. The closest to
hierarchical organization is the recent discovery that one call can modify
another and so qualify its degree of definiteness, as if adding “maybe”
to its meaning (Zuberbühler 2002). This is a far cry from the generative,
productive nature of everyday human speech, and theories that try to
make direct connection between primate vocal communication and
language have a large gap to fill—with pure speculation.

Segmentation of the Action Stream


When we approach a range of problems, from car maintenance to public
speaking, we do so already prepared with a rich repertoire of preexisting
motor routines, some innate, some learnt during childhood. Consider
what is required to acquire new motor routines by imitation of more
expert practitioners.
The skilled action that we observe does not come with ready-made gaps
that correspond to logically distinct elements. This has been classically
noted to apply to speech, in which a sound gap is more likely to be part
of a plosive consonant than to signal a new word, but the point applies
to all skilled behavior. The same applies to motor action: the physical
stimulus that confronts us is smooth and fluid, not segmented. How
are we nevertheless able to pick out functional elements in the smooth
and apparently unbroken flow of action?
I propose that people are able to “see” (pick out) any element that is
already present as a pattern in their personal repertoire, and that this
underlies our ability to copy novel routines: a new behavioral routine is
always built up out of several, already familiar, simpler elements. What
constitutes an “element”? There is no universal answer: for different
observers, even for at different times in the life of a single observer,
one particular movement of a single finger or an elaborate sequence of
bimanual movements might both function as single elements. When
we watch a relatively unfamiliar process being performed, the level at
which we see elements will be low, perhaps that of finger movements;
whereas when we watch a slight variant of an already familiar activity,
the basic elements that we notice might themselves be high-level,
complex processes. Most commonly perhaps, the level at which observed
behavior matches parts of our existing repertoire would be neither of
these, but rather consist of simple and highly practiced movements that
produce visible effects on environmental objects: that is, simple, goal-
directed movements. Such elements may be particularly easy to pick
out because they are marked by a characteristic pattern of acceleration
and deceleration, even in fluid, highly practiced movement. Consistent
with this idea, people are able to pick out the boundaries between
elements of action, even when the stimulus is experimentally reduced
to fluorescent spots on the joints (Baldwin et al. n.d.).
Is it plausible that this means of segmentation is a primitive part of the
human cognitive system? A digression into recent neuropsychological
studies of monkeys suggests that it is. Nonhuman primates have been
shown able to pick out, in the behavior of others they observe, actions
that are already in their own repertoire. A system of single neurons
has been identified in the premotor cortex of rhesus monkeys Macaca
mulatta (Gallese et al. 1996; Rizzolatti et al. 1996, 2002), each of which
responds to a simple manual action, and responds equally whether
the monkey makes the action or sees another do it. The cardinal
properties of these “mirror neurons” are (1) they detect goal-directed
movements that are in the observing monkey’s own repertoire, and
(2) they generalize over whether the movement is performed by the
monkey itself or by another agent. It is unlikely that mirror neurons
have any role in imitation for monkeys, simply because monkeys have
repeatedly failed to show evidence of imitative capacity (Visalberghi
and Fragaszy 1990). Rather, Rizzolatti and colleagues suggest that the
system functions in revealing the demeanor and likely future actions
of conspecifics, by reference to actions the observing monkey might
itself have done (Rizzolatti et al. 2002).
These units have sometimes been described as “monkey see, monkey
do” cells, and in a very restricted sense this is accurate. Much of what is
described as imitation in experimental studies of nonhuman primates
involves provoking a subject to repeat an action that is in its repertoire
on seeing another perform the same action (e.g., Bugnyar and Huber
1997; Custance et al. 1999; Whiten et al. 1996; see Byrne 2002 for
discussion). However, nothing new is being learned: this sort of imitation
has been argued to be better described as response facilitation (Byrne
1994; Rizzolatti et al. 2002). In response facilitation, as opposed to the
more general sense of imitation, a preexisting response is made more
available by seeing it done, and this causes a higher probability of the
response occurring subsequently (Byrne and Russon 1998). Response
facilitation is closely related to stimulus enhancement (Galef 1988;
Spence 1937), and they may indeed be two manifestations of the same
phenomenon: priming of neural correlates (Byrne 1994, 2005). On this
view, priming a neuron that corresponds to some aspect of the social
situation or the environment results in stimulus enhancement; whereas
priming one that corresponds to an action pattern within the current
repertoire results in response facilitation. The mirror neuron system
484 Evolutionary Perspectives

provides a possible neural instantiation for imitation in the sense of


response facilitation, but not for imitative learning of new skills.
However, a segmentation system, based on elements of action that
the observer can already perform, would be a very useful starting
point for more elaborate forms of imitation—and that is what I have
proposed underlies great ape imitation (Byrne 2003). By responding to
precisely those movement patterns that correspond to potential actions,
segmentation has the power in principle to convert a continuous flow
of observed movements into a string of recognized, familiar actions.
If seeing a string of familiar actions also allows construction of links
between them, then “action-level” imitation occurs (Byrne 2002; Byrne
and Russon 1998). In action-level imitation, a linear sequence of actions
is copied without recognition of any higher-order organization that may
be present: the organization is flat. Chimpanzees have been reported
to copy the order of actions, even though the sequence was entirely
arbitrary and unrelated to success (Whiten 1998), and a detailed learning
model has been developed to describe action-level imitation in animals
(Heyes and Ray 2000). To copy arbitrary, random actions or behavior
that is genuinely linear in structure (e.g., the “fixed action patterns”
described by early ethologists, e.g., Lorenz 1950), action-level imitation
would be ideal. However, most human action, and arguably also much
of the behavior of nonhuman great apes, is planned with a hierarchical,
not linear organization. The question is, can this planning also be “seen”
in the behavior of another? That is, can a bottom-up, mechanistic
analysis go beyond action-level imitation to explain how behavioral
organization can also be parsed and thereby copied, that is, program-
level imitation (Byrne and Russon 1998)? If so, then the evolution of
behavioral parsing has implications far beyond imitation itself.

Parsing Hierarchical Structures of Behavior


It is no coincidence that the theory of behavior parsing (Byrne 2003)
should have been developed to explain great ape manual behavior
(see Fig. 18.1). Most animals simply do not learn sufficiently complex
patterns of behavior for imitative learning to be detectable by observing
them, nor would they have much need for the ability to learn by
imitation (Byrne 2002). Great apes are very different. The five-fingered
primate hand (Napier 1961) is highly effective as a manipulator and
shows some opposability in many species, but in great apes the hand
shows a considerably augmented range of aptitudes compared even
with those of monkeys. For example, in the mountain gorilla (Byrne
et al. 2001b), everyday food preparation typically involves using the
two hands in different but complementary roles (i.e., manual role
differentiation: Elliott and Connolly 1984; see Fig. 18.2). The resulting
“asymmetric bimanual co-ordination” is augmented by the gorilla’s
ability to control individual digits of the hand independently (i.e., digit
role differentiation: Byrne et al. 2001b). This allows items to be held in
part of the hand while other digits can carry out other activities; for
instance, part-processed food can be accumulated in the hand, while
part of the food-processing routine is iteratively repeated to build up a
larger handful of food. Mountain gorillas’ remarkable dexterity allows
them to deal with plants that are physically defended by an array of
spines, stings, and hard casings (Byrne 2001). In the process, they display
a huge repertoire of functionally distinct elements of action (i.e., single
actions that produce clear changes to the plant substrate; for instance,
thistle processing alone requires 72 such elements). With manual skills
of such complexity, it would certainly pay the apes to be able to learn
by imitation of others.
The evidence that great apes do indeed learn skills by imitation
comes from observational data rather than experiment, because no
good experimental test of program-level imitation has yet been devised.
Although the evidence is therefore oblique, cumulatively it is fairly
impressive (Byrne 2002, 2005).
First, there is the very fact that young great apes learn complex,
hierarchically
structured routines of manual behavior (some of them essential
to survival in adulthood) in just a few years before their weaning, in
contrast to monkeys where there is no evidence of anything comparable.
Evidence of complexity is strongest for the mountain gorilla, where five-
stage sequential processes have been described (Byrne 1999c; Byrne and
Byrne 1993; Byrne et al. 2001a), but also clear in chimpanzees, both in
tool-using tasks (Boesch and Boesch 1990; Goodall 1986; Matsuzawa
2001; Matsuzawa and Yamakoshi 1996) and in dealing with complicated
plant foods (Corp and Byrne 2002a, 2002b; Stokes and Byrne 2001).
The fact that orangutans sometimes also use tools to deal with complex
plant defenses (Fox et al. 1999) suggests that they have similar abilities,
and this is confirmed by studies of young orangutans’ efforts to deal
with the vicious spines of certain palm trees (Russon 1998). Far more
studies have been carried out on the foraging behavior of monkeys than
that of apes, yet no similar evidence has come to light.
Second, in a detailed analysis of variation in the skills of adult
mountain gorillas, it was striking that minor details (grip type, exact
fingers employed, hand preference, extent of movement) varied idio-
Figure 18.1. (a) The sprawling, umbelliferous plant Peucedanum linderi presents
a challenge to eat: the pith is edible, but it is encased in hard, woody outer
stems, and the stems themselves are unwieldy to handle as they are rigid and
often several metres long. The mountain gorilla technique for dealing with
Peucedanum begins when a convenient, 0.5 meter segment of a stem is bitten
off. (b)The segment is held horizontally in both hands, then a piece of outer
case is bitten by the incisors and torn off with a backwards movement of the
syncratically between individuals, even between mother and offspring,
whereas the overall “program-level” organization of each technique
was remarkably standardized in the local population (Byrne and Byrne
1993). If idiosyncrasy is characteristic of trial and error learning, such
standardization of techniques needs explaining. There are two
possibilities:
either the affordances of the gorilla’s hands, combined with the
physical form of the plant defenses, define a clear gradient of optimization
and, thus, with practice every gorilla will inevitably acquire the same
method; or, observational learning is involved and critical aspects of
the skills are passed on culturally. Which is more likely?
The third line of evidence is specifically relevant to that question:
it involves the study of animals disabled by crippling snare wounds.
Snares are not set to catch gorillas, but young individuals may suffer
injury because of their explorative behavior (Stokes et al. 1999). If
the standardized pattern of an adult’s food-processing technique is a
product of affordances, then in an animal with severely maimed hands
a quite different technique should result from the same trial and error
experience. Yet in both chimpanzees and gorillas, disabled individuals
acquire the same organization of behavior as the able bodied, and
instead work around their difficulties by modifying the low-level details
of implementation (Byrne and Stokes 2002; Stokes and Byrne 2001).

head. Note that the hands are not employed symmetrically, but each applies a
different grip: in this example, the right hand holds with a power grip, whereas
the left hand pinches the end of the segment. This allows the segment to be
rotated conveniently, ready for the next piece of outer case to be removed,
rather in the manner in which one might handle corn-on-the-cob. (c) Once
the pith is partly exposed, it may be eaten directly from the section of stem or
picked out with the index finger of one hand. Notice the one-day-old baby that
is resting on the chest of its mother, apparently asleep: even from the first day of
life, a gorilla has abundant opportunities to watch skilled food processing tasks,
performed at close range, and to explore discarded debris. In contrast, young
great apes seldom watch with any evident close attention when their mothers
or other individuals are processing food (unless the food is likely to be shared, as
with nuts cracked using tools). (d) More typically, mother and juvenile will feed
together but independently: but remember that, by the time the juvenile is able
to tackle Peucedanum independently, it will have spent many hundreds of hours
watching processing in a more casual way.
This strongly favors the hypothesis that the standard technique is a
culturally transmitted pattern.
Finally, one anecdotal observation supports the case that great apes
need to learn aspects of their complex feeding skills by observation.
When processing stinging nettles (see Fig. 18.2), an important food
plant in the study area, one single adult in the study population differed
in technique—the female Picasso did not fold bundles of leaves, so
was presumably often stung on her lips (Byrne 1999a). Picasso had
transferred into the study area from lower altitude, where nettles do
not grow. Because adult gorillas feed alone and out of sight of others in
dense herbage, mountain gorillas’ only opportunity for observational
learning of plant processing comes in infancy. It seems most likely
that a lack of opportunity to observe accounts for Picasso’s incomplete
technique, and intriguingly her juvenile was the only other gorilla in
the study population to lack that particular element of the skill.

Imitation without Intentionality


In the face of this evidence, I developed a theory of how great apes
could learn the program-level structure of behavior by imitation, one
that avoided any assumption that the animals had prior understanding
of purpose or intention (Byrne 1999b, 2003). This “behavior parsing”
model is based instead on the statistical regularities present within the
variability of multiple performances of the same skilled sequence of
action.
Every execution of a motor act, however familiar and well-practiced
it is, will differ slightly from others. Nevertheless, the variation is
constrained—because if certain characteristics are missing or stray too
far from their canonical form the act will fail to achieve its purpose.
Watching a single performance will not betray these underlying
constraints,
but the statistical regularities of a repeated, goal-directed
action can serve to reveal the organizational structure that lies behind
it. Unweaned great apes spend most of each day within a few feet of
their mothers (see Fig. 18.1), and because their main nutrition still
comes from milk they have almost full-time leisure to watch any nearby
activities, as well as learn about the structure of the local environment
by their own exploration. For instance, by the time a young gorilla first
begins to handle nettles, at the rather late age of about two years because
the stinging hairs discourage earlier attempts, it will have watched many
hundreds of nettle plants being expertly processed by its mother.
Consider how a young gorilla might learn from statistical regularities
of observed behavior how to process stinging nettles (Fig. 18.2). Its
mother’s behavior will be perceived as a string of discrete elements,
where each of these actions is a familiar one that it can already perform.
At this time, the young ape’s repertoire of familiar elements of action
derives from its innate manual capacities; from many hours of playing
with environmental objects, such as plants and discarded debris of
the mother’s feeding; and from its own experience of feeding on other
plants, perhaps ones simpler to process than nettles. Because motor
behavior is intrinsically variable, and plants in particular each differ
somewhat, the string of elements that it sees when watching its mother
eat nettles will differ each time. However, her starting point will always
be a growing, intact nettle stem, and—because she is expert at this
task—her final stage will always be the same, popping a neatly folded
package of nettle leaves into the mouth (see Fig. 18.2). In between these
points, variation will be particularly associated with noncritical parts of
the performance, and certain aspects must necessarily be the same—or
else, the result simply will not be success. With repeated watching, and
a mind that tends automatically to extract regularities in behavior that
varies over time, a pattern will gradually begin to become apparent. The
mother always makes a sweeping movement of one hand, held around
a nettle stem that is sometimes held in the other hand even though
the plant is still attached to the ground, and this leaves a leafless stem
protruding from the ground; she always makes a twisting movement of
the hands against each other, and immediately drops a number of leaf
petioles (which she does not eat) onto the ground; she always uses one
hand to fold a bundle of leaf blades protruding from the other hand,
and holds down this folded bundle with her thumb. Moreover, these
stages always occur in exactly the same order.
Statistical regularities thereby separate the minimal set of essential
actions from the many others that occur during plant eating but that are
not crucial to success, and reveal the correct order in which they must
be arranged. (The ability of human babies as young as eight months
to detect statistical regularities in spoken strings of nonsense words
shows that just such sensitivity to repeated orderings is active early in
human development: Saffran et al. 1996.) The usefulness of detecting
regularities applies not only to the linear sequence of movements of
each hand, but also the hands’ operation together: stages that crucially
depend on the hands’ close temporal and spatial coordination while
doing different jobs will recur in every string, whereas other coincidental
conjunctions will not.
Figure 18.2. Flow chart for a typical adult gorilla processing nettle Laportea
alatipes leaves. The process starts at the top, with selection of a growing nettle
to eat, and works downwards. Processes are shown in rectangles; those that
are optional, depending on the state of the plant itself, are shown in brackets.
As with conventional flow charts, diamonds represent choice points, with the
alternative options shown by the directed links leading from each diamond.
Unlike the single linear process of most flow charts, the diagram represents the
actions of both left and right hands: actions that are significantly lateralized to
the left hand are shown on the left of the figure, and vice versa for the right
hand. Some of these actions are nevertheless coordinated together, although the
two actions are different: these cases of asymmetric bimanual co-ordination are
shown with broken lines connecting the separate processes.
Other statistical regularities derive from modular structure and
hierarchical
organization. Whenever the operation of removing debris is
performed (by opening the hand that holds nettle leaf blades, and
delicately picking out debris with the other hand), it occurs at the
same place in the sequence. Also, on some occasions but not others,
a section of the program sequence may be repeated twice or several
times. For instance, the process of <pulling a nettle plant into range,
stripping leaves from its stem in a bimanually coordinated movement,
then detaching and dropping the leaf petioles>, may be repeated several
times before the mother continues to remove debris and fold the leaf
blades before eating. Subsections of the string of actions that are marked
out in this way may be single elements, or as in this example a string of
several elements. Both omission and repetition signal that some parts
of the string are more tightly bound together than others, that is, that
they function as modules. Optional stages, like cleaning debris, occur
between but not within modules. Moreover, repetition of a substring
gives evidence of a module used hierarchically as a subroutine, for
example, iteration to accumulate a larger handful.
Further clues to modular structure are likely to be given by the
distribution
of pauses (occurring between but not within modules), and the
possibility of smooth recovery from interruptions that occur between
modules. Gorillas often pause for several seconds during the processing
of a handful of plant material to monitor the movements and actions of
other individuals. Finally, a different module entirely may be substituted
for part of the usual sequence, for example if one hand is required for
postural support, then a normally bimanual process may need to be
performed unimanually, and if this module is recognized as an already
familiar sequence its substitution again reveals structure (see Goodwin
this volume, for a similar point concerning language repair). Eventually,
it may be that a taxonomy of substitutable methods is built up.
All these statistical regularities are precisely what enabled us, the
researchers, to discover the hierarchical nature of nettle processing by
adult gorillas (Byrne and Byrne 1993; Byrne and Russon 1998). The
behavior parsing model proposes that the same information can be
extracted and used by the apes themselves, and that this ability is what
enables a young ape to perceive and copy the sequential, bimanually
coordinated, hierarchical organization of complex skills from repeated
watching of another.
Behavior parsing enables the underlying hierarchical organization of
planned behavior to be picked out—but only under certain circumstances.
The first caveat, from what we know of living apes in the wild, is that it
is entirely possible that nonhuman apes’ capacity to parse behavior is
limited to the visible domain of manual and bodily actions, and thus
not available in the auditory domain. The bonobo Kanzi’s apparent
ability to parse human speech, when he responds correctly to words
whose referent depends on the syntactical organization of a relative
clause within a sentence (Savage-Rumbaugh et al. 1993), may cause
this qualification to be relaxed, at least for extensively human-reared
apes. For the moment, however, I will assume that living apes under
natural conditions, and our own earliest ancestors, had no such ability.
The great ape forte is evidently the manual domain, as convincingly
demonstrated in the hundreds of ASL signs acquired by participants in
“ape language” experiments (see chapters in Gardner et al. 1989). In
contrast, modern humans are routinely able to parse vocal material.
The second limitation, from the way the model works, is that “multiple
independent looks” are necessary. A single view of skilled behavior
that is unfamiliar in its organization will not result in a useful parsing,
so seeing multiple samples of efficient behavior is required. The samples
must be independent, so that there is information about the variance
within the strings of perceived elements; that is because only by a
sensitivity to the relative variability of elements can behavior parsing
locate the key (unvarying) elements. Thus, repeatedly viewing a film
clip of the same segment of skilled behavior would not serve to allow
unfamiliar behavior to be parsed. Note that, although we may well
substantially
overrate our everyday abilities (Bargh and Chartrand 1999),
modern humans do not seem to be subject to this limitation. Gergely
(this volume) shows that babies over 16 months old are able to pick
out for imitation, using simple rationality criteria, the key elements of
behavior demonstrated only once; behavior parsing alone could not
explain these data. Before that critical age, I predict that babies are still
able to show program-level imitation, but will not select out specifically
rational features of the process to copy.

Why Great Apes?


Why should it have been only this one taxon of primate that
developed
the rather special ability to parse a segmented stream of action
into a hierarchically organized structure—and thereby acquire novel,
complex skills by imitative learning? At present, the social brain or
Machiavellian intelligence hypothesis is widely accepted as the most
plausible explanation for the origin of primate intelligence (Brothers
1990; Byrne and Whiten 1988; Dunbar 1998; Humphrey 1976; Jolly
1966; Whiten and Byrne 1997). However, when it comes to
accounting
for cognitive differences between monkeys and apes, it will not do.
According to the social brain hypothesis, the root cause of intellectual
advance is social complexity. Because the ancestors of modern haplorhine
primates (monkeys and apes) needed to live in increasingly
large social groups, yet individuals of each species were thereby put in
direct competition for resources with other group members, a selection
pressure resulted that favored increased social intelligence and a
concomitant
enlargement in neocortex volume (Byrne 1996). Thus, today,
we find that primates living in larger groups have larger brains (Barton
and Dunbar 1997; Dunbar 1992), and are more likely to employ subtle
means of social manipulation such as deception (Byrne and Corp 2004).
Although this fits nicely with the differences among living species of
varied brain sizes, and gives a good account of the evolutionary
origins
of the large-brained haplorhines, it does not distinguish between
monkeys and apes. There is no systematic difference in the causal
variable:
the great apes simply do not live in larger social groups than do
many monkey species, with much smaller brains and little sign of the
sophisticated cognition of apes.
This means that serious attention must be paid to alternative, ecological
selection pressures that might have promoted intelligence, at least for this
special case (Byrne 1997): for instance, is there an ecological challenge
that affects great apes more than monkeys? Because of the anatomical
differences between monkeys and apes, the answer is yes. Great apes
are systematically larger than monkeys, and because they are adapted
to brachiation (hanging below branches on long, powerful arms) costs
of long-distance travel are much greater for them than for monkeys.
However, apes are all specialists in easy-to-digest plant material—fruit
or soft leaves—which is ephemerally available and patchily distributed,
so they must regularly travel to find their food. Almost everywhere they
live, great apes share the forest with Old World monkeys—which are not
only smaller and more efficient in long-range travel but also happen to
have gut adaptations enabling them to eat fruit when slightly less ripe,
or leaves when slightly tougher, than can apes. Monkeys, in short, are
in direct niche competition with great apes and possess all the aces:
how have the living apes survived at all? The explanation becomes clear
when the details of their diet are examined: chimpanzees make tools
to extract social insects from their nests, and to break open hard nuts;
gorillas, and to a lesser extent chimpanzees, use elaborate, multistage
routines to deal with plant defenses; orangutans use complex, indirect
routes to reach defended arboreal food, and sometimes make tools to
gain access to bees’ nests or defended plant food. In each case, “clever”
methods of food extraction are used to gain access to foods that monkeys
would be unable to reach.1 Thus, it becomes plausible that the Miocene
ancestors of the living great apes (and of ourselves) may have adapted
cognitively, in ways that would enable a broader range of food types to
be exploited. I propose that learning new skills by behavior parsing was
just this adaptation (other ecological theories of great apes’ advanced
cognition are reviewed and compared in Byrne 1997).

Parsing to "See" Intentions: The Origin of Mime and


Gestural Language?
If the behavior parsing model is correct, human language and speech
evolved in a species that was already able to parse hierarchically organized
behavior—which might be no coincidence. Moreover, the ape ability to
“see below the surface” of behavior, and detect the logical organization
that produced it, has implications for other cognitive activities. Indeed,
the ability to learn new skills by imitation may be seen as just part
of a fundamental process of interpreting or understanding complex
behavior.
In the behavior-parsing model, processing starts from observed
behavior
and no prior understanding is required of the physical cause and
effect of the actions on objects in the world, nor the intentions or other
mental states of the demonstrator. However, we know from common
experience that these more abstract representations form part of how
adult humans understand and discuss the world: so their evolutionary
origin must be explained. Behavior parsing might be a necessary step
on the road to seeing the world in an intentional-causal way.
Consider causation. Because a perceived parsing of complex action
will (in many cases) be applied to actions on objects in the world,
changes in the physical world will become linked to the sequence of
action—statistically. There is more to cause than correlation, but it
can be questioned whether that matters for everyday purposes, or for
evolution. Reliable correlation might be described as a “Pretty Good
Cause,” and only physicists dealing with the fundamentals of matter
may need to go much beyond it. The fact is that most things are seen
as likely to happen, to the extent that they or things very like them
have happened before under the same circumstances. The sun will rise
tomorrow morning because it has been doing so for a long time at rather
regular and statistically predictable intervals; not flawless logic, but
good enough. Any parent who has tried to answer a series of “Why?”
questions from a young child will know how soon one gets out of one’s
depth with causation: Ok, so day and night are caused by the Earth
going round the Sun, but why does it do that? Probing deeper into the
physics of most everyday situations helps little with everyday social
living. Knowing that actions are all ultimately a matter of exchange
of a whole range of exotically named particles, none of which are ever
going to be detected according to the normal meaning of that word,
does not provide a very satisfying advance on cause as correlation—and
behavior parsing picks out the correlational structure of the changing
environment quite well.
How could behavior parsing help us with intentionality? The perceived
organization of behavior that results from the parsing process will
inevitably be set in a real-world context of achievement of valuable ends,
just because individuals observed while engaged in skilful action will
only be doing so for biologically sufficient reasons. Often, demonstrators
will be close associates or relatives of the observers, confronting much
the same problems as them. Thus, associating a particular organizational
structure with the typical result of its performance is in many cases a
relatively trivial task: the point of achieving that particular result is
something the observer probably already understands. Intended purpose
is indicated by the usual result of successful performance. (“Unsuccessful”
is also identified statistically, here, on the basis of visible behavior. It
corresponds to those occasions when the action needs to be redone,
rather than moving on to another action.) This means that, in principle,
behavior parsing makes it possible to compute the prior intention of
the other individual: by recognizing a behavior pattern that would, if
the observing self performed it, achieve a comprehensible goal for the
self. Any animal capable of program-level imitation should therefore
also be able to detect from others’ behavior at least some intentions as
results, in cases where they have been able to gain the necessary prior
experience of that behavior. As in the case of causation, this is a weak
sense of the term: rather than an imagined mental state, intentions of
these kinds need be no more than proper results of the normal behavior
sequence. But similarly this sense of intention may be good enough for
most everyday purposes: animals sensitive to intentions as results will
not be able to conceive of false belief and deliberate trickery, but they
will be able to pick out the purposes of many everyday social actions.
Animals with behavior parsing abilities, as indexed by their ability to
imitate at program level, might still be rather limited in understanding—
with causation reduced to correlation, and intentions reduced to
expected results. However, combined with the delicate and sophisticated
manual control of action that we find in all the living apes, even this
limited kind of understanding should be sufficient for communication
by means of gesture. Natural gestural communication in nonhuman
apes is a rather neglected topic, but current evidence shows that in
captivity both chimpanzees and gorillas develop gestures not seen in
the wild, and use them intentionally in dyadic communication (Pika
et al. 2003; Tanner 1998; Tanner and Byrne 1996, 1999; Tomasello et
al. 1985, 1989). Moreover, one of these studies found “iconic” gestures
that physically resembled action patterns, ones that the signaler
apparently desired the recipient to perform (Tanner and Byrne 1996;
note that Pika et al. 2003 did not find any such gestures in their study,
perhaps because the animals were still immature). Iconic gestures are
potentially a source of mimetic communication (Donald 1991), and
extensive use of mime by individuals who were able to compose and
parse hierarchically structured sequences could readily develop into a
simple gestural language, as proposed for early hominin communication
by several theorists (e.g., Corballis 1991; Hewes 1973). Moreover, the
ability of living great apes to extend their gestural repertoires when
helped by humans has been amply demonstrated in the various
“ape sign language” projects: whatever is believed of their linguistic
sophistication, there is no doubt that those chimpanzees, gorillas and
orangutans have learned many new manual gestures.
No such culture of mime and gestural communication does exist in
any living, nonhuman ape (to our knowledge), but it does not seem a
farfetched
speculation that this should have arisen in the evolutionary past
in a close relative, our own direct ancestor, who would certainly have had
the necessary behavior parsing capacities (see Fig. 18.3). In comparison
with the yawning gulf between primate vocal communication and
human speech, the gap is eminently bridgeable between the manual
interpretation and manual learning abilities of living apes and an ancestral
hominin that used gestural means to communicate an arbitrary range
of intentions. Research on the gestural capacities of modern humans
(e.g., deaf people) has the potential to illuminate fundamental aspects of
the foundations of language, and their relationship with more ancient,
primate abilities (Fig. 18.3) (Goldin-Meadow 1999; Senghas et al. 2004
see also Pyers and Goldin-Meadow this volume).
On this hypothesis, development of spoken language was secondary,
and required release from the domain-specificity of behavior parsing
that we observe in nonhuman apes. This could have been a later
development: the much greater freedom that speaking allows, with
the hands released to engage in concurrent activity, might reasonably
Parsing Behavior 497

Figure 18.3. A proposed evolutionary path of cognition: monkey to human.


The starting point is the most recent ancestor species in common to the Old
World monkeys and the surviving great apes including humans. From the
characteristics common to all its living descendants, this animal was relatively
large brained and showed social sophistication that was based on rapid learning
in social contexts. As described in the text, the primary adaptation in the lineage
leading to the great apes, diverging at about 12 Mya, was an ability to parse
hierarchically organized behavior, and thus acquire new feeding skills with
complex organization by observation. This enables behavior to be “understood”
in a simple way, in which causality is interpreted purely as correlation, and
intentions as normal results. The uniquely human adaptations, speculatively, are
separated into (1) the advent of gesture-based language in early hominins, and
(2) the more recent development of spoken language, and concomitantly the full
flowering of various “theory of mind” abilities.

be argued to correlate with a second advance in hominin sophistication


(Corballis 1991). Identification of these two points of cognitive advance
in the archaeological record is problematic, but note (1) the rapid increase
in brain sizes in the closely related species of hominin that populated
W and S Africa between 4.4 and 1.8 Mya (from the chimpanzee-size
brains of Australopithecus and Ardipithecus, to the large-brained Homo
ergaster) might be identified with animals that increasingly relied on
gestural language for their success, and (2) the apparent discontinuity
in cultural sophistication of Cro-Magnon from Neanderthal (Europe),
or Middle from Lower Stone Age (Africa), might reflect the advances
possible when language no longer tied up the hands.
It is perhaps worth stressing that this theoretical sequence does
not imply that early hominin gestural language had the full range of
expression of modern spoken language, whereas the sign languages of
modern deaf communities are known to do so. Certainly, derived from
the hierarchical parsing system of great ape manual skills, early hominin
communicative gesture would be expected to show some syntax (at
least as complex as that of phrase-structure grammars), and thus to
be generative and technically of “unlimited” productivity. However,
there is no reason to attribute to these animals the full modern range
of semantic distinctions; and although living great apes do show some
aspects of ToM (Byrne 1995, 1998, 2000; Tomasello et al. 2003), it seems
likely that these fall short of the full mentalizing abilities of five-year-
old children (Astington et al. 1988; Perner and Wimmer 1985; Wellman
1990). More work on the semantic roles understood by living great
apes has the potential to clarify the evolutionary origins of semantic
differentiation (Byrne et al. 2004).

Tailpiece: A Heretical Thought


Those who conduct behavioral experiments or analyze observational
data from the field to discover whether any animal has the ability to
represent the mental states of others, become acutely aware that their
task is a difficult one because simpler mechanisms can in principle
generate richly complex behavior. In particular, this chapter has argued
that an understanding of planned behavior, in terms of hierarchically
organized structure that can be copied, with causality approximated by
correlation and purpose by normal results, can result from a mechanistic
process of behavioral analysis that need not involve any “mentalizing”
about the actual mental states of the observed party. Thus, great apes
show program-level imitation, but might still not possess ToM and
causal understanding. But what about humans?
Humans can and do represent causes and intentions: we explain
(away) our actions, on grounds of our beliefs, false or otherwise; we
teach our children by explaining that one thing causes another or that
some people have different beliefs to ourselves; and so on. But do these
retrospective, verbal accounts actually correspond to causal mental states
that generate our behavior when we are not explaining anything? We
are always reluctant to accept how much of our behavior is an automatic
and fast product of mental processes of which we are unaware (Bargh
and Chartrand 1999), but I think this should be seriously considered
for the case of ToM.
There are two possibilities. On the one hand, it may be that calculations
about others’ mental states are causal, and that the normal process
of automatization with practice simply renders them faster and more
efficient, to the point when they can only be made conscious by “offline"
deliberation. On the other hand, the heretical alternative is that
other processes, mechanistic but unconscious—analogous to those that
allow us to parse behavior—actually cause most of our everyday social
behavior and interactions with the world of objects, and mentalizing
is a secondary process (also see Good 1995; Danziger this volume). On
this view, mentalizing has different functions: these include teaching,
when we explain processes or people to a child, and deceit, when we
retrospectively construe our behavior in a way very different to what
we know to be accurate. Any such process of verbal (mis)construal
is certainly a function of language ability, and so must be recent
in human evolution. In contrast, the behavioral capacities that we
attribute to “ToM” may be much more ancient, and shared with the
living nonhuman great apes today, although they cannot explain and
discuss their actions as we can.

Note
1. There is a certain irony that Humphrey (1976) used the apparent lack
of any such environmental challenges to gorilla cognition as a justification
for advocating a hypothesis that advanced primate cognition developed in
response to the social challenge of living in a long-lasting group.

References
Astington, J. W., P. Harris, and D. R. Olson. 1988. Developing theories of
mind. Cambridge: Cambridge University Press.
Auser, M. D., and R. W. Wrangham. 1987. Manipulation of food calls
in captive chimpanzees. Folia Primatologia 48:207–210.
Baldwin, D., E. Neuhaus, G. Guha, and A. Craven. n.d. Extracting
structure
from dynamic human action. Unpublished MS, Department of
Psychology, University of Oregon.
Bargh, J. A., and T. L. Chartrand. 1999. The unbearable automaticity of
being. American Psychologist 54:462–479.
Barton, R. A., and R. I. M. Dunbar. 1997. Evolution of the social brain.
In Machiavellian intelligence, vol. 2: Extensions and evaluations, edited
by A. Whiten and R. W. Byrne, 240–263. Cambridge: Cambridge
University Press.
Boesch, C., and H. Boesch. 1990. Tool use and tool making in wild
chimpanzees. Folia Primatologica 54:86–99.
Brothers, L. 1990. The social brain: A project for integrating primate
behavior and neurophysiology in a new domain. Concepts in
Neuroscience 1:27–51.
Bugnyar, T., and L. Huber. 1997. Push or pull: An experimental study
of imitation in marmosets. Animal Behaviour 54:817–831.
Byrne, R. W. 1994. The evolution of intelligence. In Behaviour and
evolution, edited by P. J. B. Slater and T. R. Halliday, 223–265.
Cambridge: Cambridge University Press.
——. 1995. The thinking ape: Evolutionary origins of intelligence. Oxford:
Oxford University Press.
——. 1996. Machiavellian intelligence. Evolutionary Anthropology 5:172–180.
——. 1997. The technical intelligence hypothesis: An additional
evolutionary stimulus to intelligence? In Machiavellian intelligence,
vol. 2: Extensions and evaluations, edited by A. Whiten and R. W. Byrne,
289–311. Cambridge: Cambridge University Press.
——. 1998. Cognition in great apes. In Brain and cognition in monkeys,
apes and man, edited by A. D. Milner, 228–244. Oxford: Oxford
University Press.
——. 1999a. Cognition in great ape ecology. Skill-learning ability opens
up foraging opportunities. Symposia of the Zoological Society of London
72:333–350.
——. 1999b. Imitation without intentionality. Using string parsing to
copy the organization of behaviour. Animal Cognition 2:63–72.
——. 1999c. Object manipulation and skill organization in the complex
food preparation of mountain gorillas. In The mentality of gorillas and
orangutans, edited by S. T. Parker, R. W. Mitchell, and H. L. Miles,
147–159. Cambridge: Cambridge University Press.
——. 2000. The evolution of primate cognition. Cognitive Science 24:543–570.
——. 2001. Clever hands: The food processing skills of mountain gorillas.
In Mountain gorillas: Three decades of research at Karisoke, edited by
M. M. Robbins, P. Sicotte, and K. J. Stewart, 293–313. Cambridge:
Cambridge University Press.
——. 2002. Imitation of complex novel actions: What does the evidence
from animals mean? Advances in the Study of Behavior 31:77–105.
——. 2003. Imitation as behaviour parsing. Philosophical Transactions
of the Royal Society of London (B) 358:529–536.
——. 2005. Detecting, understanding, and explaining animal imitation.
In Perspectives on imitation: From mirror neurons to memes, edited by S.
Hurley and N. Chater, 255–282. Cambridge, MA: MIT Press.
Byrne, R. W., P. J. Barnard, I. Davidson, V. M. Janik, W. C. McGrew, A.
Miklósi, and P. Wiessner. 2004. Understanding culture across species.
Trends in Cognitive Sciences 8:341–346.
Byrne, R. W., and J. M. E. Byrne. 1993. Complex leaf-gathering skills of
mountain gorillas (Gorilla g. berengei): Variability and standardization.
American Journal of Primatology 31:241–261.
Byrne, R. W., and N. Corp. 2004. Neocortex size predicts deception
rate in primates. Proceedings of the Royal Society of London: Biology
271:1693–1699.
Byrne, R. W., N. Corp, and J. M. E. Byrne. 2001a. Estimating the
complexity of animal behavior: How mountain gorillas eat thistles.
Behaviour 138:525–557.
——. 2001b. Manual dexterity in the gorilla: Bimanual and digit role
differentiation in a natural task. Animal Cognition 4:347–361.
Byrne, R. W., and A. E. Russon. 1998. Learning by imitation: A hierarchical
approach. Behavioral and Brain Sciences 21:667–721.
Byrne, R. W., and E. J. Stokes. 2002. Effects of manual disability on
feeding skills in gorillas and chimpanzees: a cognitive analysis.
International Journal of Primatology 23:539–554.
Byrne, R. W., and A. Whiten. 1988. Machiavellian Intelligence: Social
expertise and the evolution of intellect in monkeys, apes and humans.
Oxford: Clarendon Press.
Corballis, M. C. 1991. The lopsided ape. Oxford: Oxford University
Press.
Corp, N., and R. W. Byrne. 2002a. Leaf processing of wild chimpanzees:
Physically defended leaves reveal complex manual skills. Ethology
108:1–24.
——. 2002b. The ontogeny of manual skill in wild chimpanzees: Evidence
from feeding on the fruit of Saba florida . Behaviour 139:137–168.
Crockford, C., I. Herbinger, L. Vigilant, and C. Boesch. 2004. Wild
chimpanzees produce group-specific calls: A case for vocal learning?
Ethology 110:221–243.
Custance, D., A. Whiten, and T. Fredman. 1999. Social learning of an
artificial fruit task in capuchin monkeys (Cebus apella). Journal of
Comparative Psychology 113(1):13–23.
Donald, M. 1991. Origins of the human mind: Three stages in the evolution
of culture and cognition. Cambridge, MA: Harvard University Press.
Dunbar, R. I. M. 1992. Neocortex size as a constraint on group size in
primates. Journal of Human Evolution 20:469–493.
——. 1998. The social brain hypothesis. Evolutionary Anthropology
6:178–190.
Dunbar, R. I. M., and D. Nettle. 1997. Social markers and the evolution
of reciprocal exchange. Current Anthropology 38:93–99.
Elliott, J. M., and K. J. Connolly. 1984. A classification of manipulative
hand movements. Developmental Medicine and Child Neurology 26:283–
296.
Fox, F., A. Sitompul, and C. P. Van Schaik. 1999. Intelligent tool use in
wild Sumatran orangutans. In The mentality of gorillas and orangutans,
edited by S. T. Parker, H. L. Miles, and R. W. Mitchell, 99–116.
Cambridge: Cambridge University Press.
Galef, B. G. 1988. Imitation in animals: History, definitions, and
interpretation of data from the psychological laboratory. In Social
learning: Psychological and biological perspectives, edited by T. Zentall
and B. G. Galef Jr., 3–28. Hillsdale, NJ: Erlbaum.
Gallese, V., L. Fadiga, L. Fogassi, and G. Rizzolatti. 1996. Action
recognition
in the premotor cortex. Brain 119:593–609.
Gardner, R. A., B. T. Gardner, and T. E. Van Cantfort. 1989. Teaching
sign language to chimpanzees. New York: State University of New York
Press.
Goldin-Meadow, S. 1999. The role of gesture in communication and
thinking. Trends in Cognitive Science 3(11):419–429.
Good, D. 1995. When does foresight end and hindsight begin? In Social
intelligence and interaction: Expressions and implications of the social bias
in human intelligence, edited by E. N. Goody, 139–149. Cambridge:
Cambridge University Press.
Goodall, J. 1986. The chimpanzees of Gombe: Patterns of behavior.
Cambridge,
MA: Harvard University Press.
Gouzoules, S., H. Gouzoules, and P. Marler. 1984. Rhesus monkey (Macaca
mulatta) screams: Representational signalling in the recruitment of
agonistic aid. Animal Behaviour 32(1):1 82.
Green, S. 1975. Dialects in Japanese monkeys: vocal learning and
cultural transmission of locale-specific vocal behaviour? Zeitsch rift
fur Tierpsychologie 38:304–314.
Hewes, G. W. 1973. Primate communication and the gestural origins
of language. Current Anthropology 14:5–24.
Heyes, C. M., and E. D. Ray. 2000. What is the significance of imitation
in animals? Advances in the Study of Behavior 29:215–245.
Humphrey, N. K. 1976. The social function of intellect. In Growing
Points in Ethology, edited by P. P. G. Bateson and R. A. Hinde, 303–317.
Cambridge: Cambridge University Press.
Janik, V. M., and P. J. B. Slater. 1997. Vocal learning in mammals.
Advances in the Study of Behavior 26:59–99.
Jolly, A. 1966. Lemur social behaviour and primate intelligence. Science
153(735):501–506.
Kuhl, P. K. 1982. Discrimination of speech by non-human animals—basic
auditory sensitivities conducive to the perception of speech-sound
categories. Journal of the Acoustical Society of America 70:340–349.
Lorenz, K. 1950. The comparative method in studying innate behaviour
patterns. Symposia of the Society for Experimental Biology 4:221–268.
Marler, P., and R. Tenaza. 1977. Signalling behaviour of apes with special
reference to vocalization. In How animals communicate, edited by T.
Sebeok, 965–1033. Bloomington: Indiana University Press.
Matsuzawa, T. 2001. Primate foundations of human intelligence: A view
of tool use in nonhuman primates and fossil hominids. In Primate
origins of human cognition and behavior, edited by T. Matsuzawa, 3–25.
Tokyo: Springer-Verlag.
Matsuzawa, T., and G. Yamakoshj (eds.). 1996. Comparisons of chimpanzee
material culture between Bossou and Nimba, West Africa. Cambridge:
Cambridge University Press.
Mitani, J. C., T. Hasegawa, J. Gros-Louis, P. Marler, and R. W. Byrne.
1992. Dialects in wild chimpanzees? American Journal of Primatology
27:233–243.
Napier, J. R. 1961. Prehensility and opposability in the hands of primates.
Symposia of the Zoological Society of London 5:115–132.
Perner, J., and H. Wimmer. 1985. “John thinks that Mary thinks that”
attribution of second-order beliefs by 5- to 10-year old children.
Journal of Experimental Child Psychology 39:437–471.
Pika, S., K. Liebal, and M. Tomasello. 2003. Gestural communication
in young gorillas (Gorilla gorilla): Gestural repertoire, learning, and
use. American Journal of Primatology 60:95–111.
Rizzolatti, G., L. Fadiga, L. Fogassi, and V. Gallese. 1996. Premotor cortex
and the recognition of motor actions. Brain Research 3:131–141.
——. 2002. From mirror neurons to imitation: Facts and speculations. In
The imitative mind: Development, evolution, and brain bases, edited by
A. Meltzoff and W. Prinz, 247–266. Cambridge: Cambridge University
Press.
Russon, A. E. 1998. The nature and evolution of intelligence in orangutans
(Pongo pygmaeus). Primates 39(4):485–503.
Saffran, J. R., R. N. Aslin, and F. L. Newport. 1996. Statistical learning
by 8-month-old infants. Science 274(5294):1926–1928.
Savage-Rumbaugh, E. S., J. Murphy, R. A. Sevcik, K. E. Brakke, S. L.
Williams, and D. M. Rumbaugh. 1993. Language comprehension
in ape and child. Monographs of the Society for Research in Child
Development 58:1–252.
Senghas, A., S. Kita, and A. Ozyurek. 2004. Children creating core
properties of language: evidence from an emerging sign language in
Nicaragua. Science 305(5691):1779–1782.
Seyfarth, R. M., and D. L. Cheney. 1986. Vocal development in vervet
monkeys. Animal Behaviour 34:1640–1658.
Spence, K. W. 1937. Experimental studies of learning and higher mental
processes in infra-human primates. Psychological Bulletin 34:806–850.
Stokes, E. J., and A. W. Byrne. 2001. Cognitive capacities for behavioural
flexibility in wild chimpanzees (Pan troglodytes): The effect of snare
injury on complex manual food processing. Animal Cognition 4:11–28.
Stokes, E. J., D. Quiatt, and V. Reynolds. 1999. Snare injuries to
chimpanzees
(Pan troglodytes) at 10 study sites in East and West Africa.
American Journal of Primatology
49:104–105.
Tanner, J. E. 1998. Gestural communication in a group of zoo-living
lowland gorillas. Ph.D. dissertation, Department of Psychology, St
Andrews.
Tanner, J. E., and R. W. Byrne. 1996. Representation of action through
iconic gesture in a captive lowland gorilla. Current Anthropology
37:162–173.
——. 1999. The development of spontaneous gestural communication
in a group of zoo-living lowland gorillas. In The mentalities of gorillas
and orangutans: Comparative perspectives, edited by S. T. Parker, R. W.
Mitchell, and H. L. Miles, 211–239. Cambridge: Cambridge University
Press.
Tomasello, M., J. Call, and B. Hare. 2003. Chimpanzees understand
psychological states—the question is which ones and to what extent.
Trends in Cognitive Sciences 7:153–156.
Tomasello, M., B. George, A. Kruger, J. Farrar, and E. Evans. 1985. The
development of gestural communication in young chimpanzees.
Journal of Human Evolution 14:175–186.
Tomasello, M., D. Gust, and T. A. Frost. 1989. A longitudinal investigation
of gestural communication in young chimpanzees. Primates 30:35–50.
Visalberghi, E., and D. M. Fragaszy. 1990. Do monkeys ape? In “Language”
and Intelligence in Monkeys and Apes, edited by S. T. Parker and K. A.
Gibson, 247–273. Cambridge: Cambridge University Press.
Wellman, H. M. 1990. Children’s theories of mind. Cambridge, MA:
Bradford–MIT Press.
Whiten, A. 1998. Imitation of the sequential structure of actions by
chimpanzees (Pan troglodytes). Journal of Comparative Psychology
112:270–281.
Whiten, A., and R. W. Byrne (eds.). 1997. Machiavellian intelligence, vol. 2:
Extensions and evaluations. Cambridge: Cambridge University Press.
Whiten, A., D. M. Custance, J.-C. Gomez, P. Teixidor, and K. A. Bard.
1996. Imitative learning of artificial fruit processing in children (Homo
sapiens) and chimpanzees (Pan troglodytes). Journal of Comparative
Psychology 110:3–14.
Zuberbühler, K. 2002. A syntactic rule in forest monkey communication.
Animal Behaviour 63:293–299.
ni ete n

Why Don't Apes Point?


Michael Tomasello

Chimpanzees gesture to one another regularly. Although some of


their gestures are relatively inflexible displays invariably elicited
by particular environmental events, an important subset are learned by
individuals and used flexibly—such things as “arm raise” to elicit play or
“touch side” to request nursing. We know that such gestures are learned
because in many cases only some individuals use them, and indeed
several observers have noted the existence of idiosyncratic gestures
used by only single individuals (Goodall 1986). And their flexible use
has been repeatedly documented in the sense of a single gesture being
used for multiple communicative ends, and the same communicative
end being served by multiple gestures (Tomasello et al. 1985, 1989).
Flexible use is also evident in the fact that apes only use their visually
based gestures such as “arm raise” when the recipient is already visually
oriented toward them—so-called audience effects (Kaminski et al. 2004;
Tomasello et al. 1994, 1997).
Chimpanzees and other great apes also know quite a bit about what
other individuals do and do not see. They follow the gaze direction of
conspecifics to relatively distal locations (Tomasello et al. 1998), and
they even follow another’s gaze direction around and behind barriers to
locate specific targets (Tomasello et al. 1999). This gaze following is not an
inflexible response to a stimulus, as from a certain age chimpanzees look
where another individual is looking and, if they find nothing interesting
on that line of sight, check back a second time and try again (Call et
al. 2000). Indeed if a chimpanzee follows another’s line of sight and
repeatedly finds nothing there, they will quit following that individual’s
gaze altogether (Tomasello et al. 2001). And some experiments have
even demonstrated that chimpanzees know the content of what another
Why Don’t Apes Point?

sees, as individuals act differently if a competitor does or does not see


a potentially contestable food item (Hare et al. 2000, 2001).
And so the question arises: If chimpanzees have the ability to gesture
flexibly and they also know something about what others do and do
not see—and there are certainly occasions in their lives when making
someone see something would be useful—why do they not sometimes
attempt to direct another’s attention to something it does not see by
means of a pointing gesture or something equivalent? Some might
object that they do do this on occasion in some experimental settings,
but this only deepens the mystery. The observation is that captive
chimpanzees will often “point” (whole arm with open hand) to food so
that humans will give it to them (Leavens and Hopkins 1998) or also,
in the case of human-raised apes, to currently inaccessible locations
they want access to (Savage-Rumbaugh 1990). This means that apes
can, in unnatural circumstances with members of the human species,
learn to do something in some ways equivalent to pointing (in one of
its functions). And yet there is not a single reliable observation, by any
scientist anywhere, of one ape pointing for another.1
But maybe we should look at this question from the other direction,
that is, from the direction of humans. The fact is that chimpanzees and
the other great apes are doing the typical thing, by not pointing; it is
human beings who are doing this strange thing called pointing. What
are humans doing when they do this, and why are they doing it? As
an advocate of the comparative method with psychological research,
I believe that these two questions—why apes do not point and why
humans do—are best answered together. I will attempt that here, using
for comparison human infants (to avoid the dizzying complexities of
language) and our nearest primate relatives, the great apes, especially
chimpanzees (for whom there is the greatest amount of empirical
work).

The Comprehension of Pointing


In an experiment with apes and human children, Tomasello et al. (1997)
had one person, called the “hider,” hide food or a toy from the subject
in one of three distinctive containers. Later, a second person, called the
“helper,” showed the subject where it was by tilting the appropriate
container toward them, so that they could see the prize, just before their
attempt to find it. After this warm-up period in which he defined his
role, the helper began helping not by showing the food or toy but by
giving signs, one of which was pointing (with gaze alternation between
Parsing Behavior

subject and bucket as an additional cue to his intentions). The apes as


a group were very poor (at chance) in comprehending the meaning of
the pointing gesture, even though they were attentive and motivated
on virtually every trial. (Itakura et al. 1999, used a trained chimpanzee
conspecific to give a similar cue but still found negative results.) Human
two-year-old children, in contrast, performed very skillfully in this so-
called object choice task. Subsequent studies have shown that apes are
also generally unable to use other kinds of communicative cues (see
Call and Tomasello in press, for a review), and that even prelinguistic
human infants of fourteen months of age can comprehend the meaning
of the pointing gesture in this situation (Behne et al. in press).
It is important to recall that apes are very good at following gaze
direction in general (including of humans), and so their struggles in
the object choice task do not emanate from an inability to follow the
directionality of the pointing–gazing cue. Rather, it seems that they
do not understand the meaning of this cue—they do not understand
either that the human is directing their attention in this direction
intentionally or why she is doing so. As evidence for this interpretation,
Hare and Tomasello (2004) compared this pointing gesture with a
similar but different cue. Specifically, in one condition they had the
experimenter first establish a competitive relationship with the ape, and
then subsequently reach unsuccessfully in the direction of the baited
bucket (because the hole through which he reached would not enable
her arm to go far enough). In this situation, with an extended arm that
resembled in many ways a pointing gesture (but with thwarted effort
and without gaze alternation), apes suddenly became successful. One
interpretation is that in this situation apes understood the human’s
simple goal or intention to get into the bucket, and from this inferred
the presence of food there (and other research has shown their strong
skills for making inferences of this type; Call in press).
But understanding goals or intentions is not the same thing as
understanding communicative intentions. Nor is following gaze the same
thing as understanding communicative intentions. In simple behavior
reading or gaze following, the individual just gathers information from
another individual in whatever way it can—by observing behavior and
other happenings in the immediate surroundings and making inferences
from them. The object choice task, however, is a communicative
situation in which the subject must understand the experimenter’s
communicative intentions, that is, she must understand that the
looking or pointing behavior of the human is done “for me” and so is
relevant in some way for the foraging task I am facing. Said another
way, to use the cue effectively, the subject should understand that the
experimenter intends for her gaze or point to be taken as informative.
Instead, chimpanzees seem to see the task as simply another case of
problem solving in which all things in the context should be taken as
potential sources of information—with the gaze direction or pointing
of the interactant as just another information source. Human infants,
on the other hand, understand in this situation that the adult has made
this gesture for them, in an attempt to direct their attention to one of
the buckets, and so this gesture should be relevant for their current
goal to find the toy (see Sperber and Wilson 1986, on relevance). That
is to say, they understand the adult’s communicative intention—her
intention to inform me of something—which is an intention toward
my intentional states (an embedded intention).
An important aspect of this process is the joint attentional frame, or
common communicative ground, which gives the pointing gesture its
meaning in specific contexts (Clark 1996; Enfield this volume). Thus,
if you encounter me on the street and I simply point to the side of a
building, the appropriate response would be “Huh?” But if we both
know together that you are searching for your new dentist’s office, then
the point is immediately meaningful. In the object choice task, human
infants seem to establish with the experimenter a joint attentional
frame—perhaps mutual knowledge—that “what we are doing” is playing
a game in which I search for the toy (and you help me)—so the point
is now taken as informing me where the toy is located. The infant asks
herself, so to speak, why is the adult directing my attention to that
bucket, why is it relevant to this game?
It is very likely that apes do not create with one another such joint
attentional frames, or common communicative ground, with either
conspecifics or humans. Tomasello et al. (in press) argue and present
evidence that, more generally, apes do not form with others joint
intentions to do things collaboratively (an analysis that also applies
to their so-called cooperative hunting; see N. 2 below), and without
some kind of joint goals or intentions there are few opportunities for
joint attention. In a direct cross-species comparison, Warneken et al. (in
press) found that human one- and two-year-olds already engage with
others collaboratively in various ways (even encouraging the other in
his role when he is recalcitrant), whereas young chimpanzees engage
with others in a much less collaborative fashion (with no encouraging
of the other to play her role; see Povinelli and O’Neill 2000, for a similar
finding). And Tomasello and Carpenter (in press), in another direct
comparison and using identical operational criteria, found basically
no joint attentional engagement in young chimpanzees interacting
with humans. It is also relevant that from their earliest attempts at
communication, human infants engage in a kind of conversation or
“negotiation of meaning” in which they adjust their communicative
attempts in the light of the listener’s signs of comprehension or
noncomprehension (Golinkoff 1993)—a style of communication that
is essentially collaborative, and that other primate species do not, as
far as we know, employ (there are no observations of one ape asking
another for clarification or repairing a communicative formulation in
anticipation of its being misunderstood).
And so my answer to the question of why apes do not seem to
comprehend
the pointing gesture is that: (1) they do not understand the
embedded structure of informing or communicative intentions (she
intends to change my intentional states, i.e., by informing me of
something);
and (2) they do not participate with others in the kinds of
collaborative joint attentional engagements that create the common
communicative ground necessary for pointing and other deictic gestures
to be meaningful in particular contexts.

The Production of Pointing


Classically, human infants are thought to point for two main reasons:
(1) they point imperatively when they want the adult to do something
for them (e.g., give them something, “Juice!”); and (2) they point
declaratively when they want the adult to share attention with them
to some interesting event or object (“Look!”; Bates et al. 1975). Although
some apes, especially those with extensive human contact, sometimes
point imperatively for humans (see above), no apes point declaratively
ever. Indeed, when Tomasello and Carpenter (in press) repeatedly used
procedures that reliably illicit declarative pointing from young human
infants, they were unable to induce any declarative pointing from any of
three young chimpanzees. Typically developing human infants, on the
other hand, spontaneously begin pointing declaratively at around the
first birthday—the same age at which they first point imperatively. The
difference between these two types of pointing is clearly not motoric or
cognitive in any simple and straightforward sense. The main difference
is motivational (with perhaps a cognitive dimension to this in the sense
that infants may be motivated to do things that apes cannot even
conceive). So why do human infants simply point to things when they
do not want to obtain them?
In a recent study, Liszkowski et al. (2004, this volume) addressed this
question by having an adult react to the declarative points of 12-month-olds
systematically in one of four different ways—and then observing
their reaction. In one condition, the adult reacted as “she wants me
to look at the object” by simply looking at the object. In a second
condition the adult reacted as “she wants me to get excited” by simply
emoting positively toward the child. In a third control condition the
adult showed no reaction. In all three of these cases infants reacted in
ways that showed they were not satisfied with the adult’s response—this
was not their goal—by doing such things as pointing again. In contrast,
in a fourth condition the adult responded by looking back and forth
from the object to the infant and commenting positively. Infants were
satisfied with this response—they pointed one long time—implying
that this response was indeed what they wanted. One interpretation
of this adult response is that it represents a sharing of interest and
attention to some external entity, and this by itself is rewarding for
infants—apparently in a way it is not for any other species on the planet.
This interpretation is supported by the fact that infants at this age also
regularly hold up objects to show them to others, seeming wanting
nothing from the adult but a sharing of experience (and emotion),
and again apes simply never hold things up to show them to others
(Tomasello and Caimioni 1997).
An important clarification. In the case of imperative pointing, which
some apes sometimes do for humans, it is important to recognize that an
individual may point imperatively in different ways, with different kinds
of underlying understanding. One might point imperatively simply as a
procedure for making things happen, based on past experience in which
this behavior induced others to do such things as fetch objects. But it
is also possible that one might point imperatively in full knowledge
that what is happening is that one is making one’s desire manifest, and
the other person understands this and chooses, deliberately, to help
obtain it. Thus, Schwe and Markman (1997) had an adult respond to
the requests of two-year-olds by, among other things, refusing them or
misunderstanding them. When the child’s request was refused she was
not happy and displayed this in various ways. But when her request
was misunderstood—even in cases in which the adult actually gave her
what she wanted unintentionally (“You want this (wrong object)? You
can’t have it but you can have this one (right object) instead.”)—the
child was not fully satisfied and often repeated her request. Under this
interpretation, infants from a certain age are pointing imperatively not
as a blind procedure for making things happen, but as a request that
the adult know her goal and decide to help her attain it. We cannot
be certain, but it may be that apes with humans are doing one kind of
imperative pointing and human infants are doing another.
In addition to these two main motives for infants’ pointing, Liszkowski
et al. (in press; Liszkowski this volume) identified a third major motive.
An adult engaged in one of several activities in front of the child.
This was an adult activity, such as stapling papers, and the adult did
not attempt to engage the child in it in any way. The adult was then
distracted for a moment, during which time the key object, for example,
the stapler, was displaced (in one of several ways). The adult then
returned, picked up her papers, and looked around searchingly (palms
up, quizzical expression—no language). Preverbal infants as young as
12 months of age quite often pointed to the stapler for the adult (and
not to a distractor object that had been displaced at the same time). In
our interpretation, the infant in this situation is simply informing the
adult of something she does not know, that is to say, helping her by
providing her with information she does not have. This interpretation
is not far-fetched, as a similar helping motive is also evident in 18-
month-old infants’ behavior in noncommunicative situations, when
they do such things as help adults reach out-of-reach objects, open
doors for them when their hands are full, and so forth—whereas in this
same paradigm human-raised apes showed few signs of such helping
(Warneken and Tomasello n.d.).
As hinted at above, these motives may imply some unique
understanding
of others. For example, the declarative motive assumes a partner with
the psychological states of interest and attention, which one can then
attempt to share. But perhaps most strikingly, the informative motive
implies an understanding of the distinction between knowledge and
ignorance in the partner. I inform you of things because, presumably,
I think that you do not know them and you would like to have the
information. It is widely believed that young infants do not have an
understanding of knowledge vs. ignorance, but recent research has
demonstrated that they do. Tomasello and Haberl (2003) had an adult
say to 12- and 18-month-old infants “Oh, wow! That’s so cool! Can you
give it to me?” while gesturing ambiguously in the direction of three
objects. Two of these objects were “old” for the adult—he and the child
had played together with them previously—and one was “new” to him
(although not to the child, who had played with it also previously).
Infants gave the adult the object that was new for him. Infants knew
which objects the adult had experienced, and which he had not.
In a recent similar study, Moll et al. (in press) found that when an
adult looked at an object she and the child had just finished playing
with together and said excitedly “Oh, wow! That’s so cool!,” 14- and
18-month-old infants assumed she was not talking about the object—
they knew she could not be excited about the object that they had
just played with together—and so they looked for some other target of
her excitement. When the object was new to the adult—they had not
previously played with it together—infants simply assumed that the
adult was excited about the object. There is no systematic research on
apes’ skills of determining what is new or old for another person. But
when the Moll et al. paradigm was used with three young chimpanzees,
they did not differentiate between the cases in which the object was
old and new for the human (Tomasello and Carpenter in press). It
is also relevant that in a systematic review of ape vocal and gestural
communication, Tomasello (2003b) considers their ability to adjust for
different audiences and notes that the audience effects that exist are
based on whether others are present or not in the immediate context,
or whether they are oriented toward them bodily. There is no evidence
that primates take account of others’ intentional or mental states to
adjust their communicative formulations.
In general, in the current analysis, the underlying motives for infants’
pointing, and responding to adult points, may be decomposed into
two basic underlying motives: helping and sharing. With imperative
points they are requesting help, and when they respond to these from
adults they are helping. With declarative points, and in responding to
these, they are sharing. With informative points they are helping others
by sharing information (and as they learn language they begin to ask
questions as a way of requesting that others share information with
them). Apparently, other ape species do not have these same motivations
to help and share with others. And so my answer to the question of
why chimpanzees and other apes do not produce points—for sure not
declarative and informative points, no matter how they are brought
up—is that: (1) they do not have the motives to share experience with
others or to help them by informing; and (2) they do not really know
what is informationally new for others, and so what is worthy of their
communicative efforts.

Learning to Point
No one knows how human infants come to point for others. But given
cross-cultural differences in infants’ gestural behavior (although these
have not been documented as specifically as one might like), it would
seem clear that the major process is one of learning. There are two
main candidates.
First is some form of ritualization. For example, a very young infant
might reach for a distant object, at which point her mother might discern
the intention and obtain the object for her—leading to a ritualized form
of reaching that resembles pointing (Vygotsky 1978). We can also extend
this hypothetical scenario to the case that, by most accounts, seems
more likely, when infants use arm and index finger extension to orient
their own attention to things. If an adult were to respond to this by
attending to the same thing and then share excitement with the infant
by smiling and talking to her, then this kind of pointing might also
become ritualized—that is, a learned procedure for producing a desired
social effect. In this scenario it would be possible for an infant to point
for others while still not understanding the pointing gesture of others,
and indeed a number of empirical studies find just such dissociations
in many young infants (Franco and Butterworth 1996). Infants who
learn to point via ritualization, therefore, may understand their gesture
from the “inside” only, as a procedure for getting something done,
not as an invitation to share attention using a mutually understood
communicative convention.
The alternative is that the infant observes an adult point for her
and comprehends that the adult is attempting to induce her to share
attention to something, and then imitatively learns that when she
has the same goal she can use the same means, thus creating an
intersubjective symbolic act for sharing attention. It is crucial that in
this learning process—one form of what Tomasello et al. (1993) called
cultural learning—the infant is not just mimicking adults sticking out
their fingers; she is truly understanding and attempting to reproduce the
adult’s intentionally communicative act, including both means and end.
It is crucial because a bidirectional symbol can only be created when the
child first understands the intentions behind the adult’s communicative
act, and then identifies with those intentions herself as she produces
the “same” means for the “same” end.
Empirically we do not know whether infants learn to point via
ritualization
or imitative learning or whether, as I suspect, some infants learn
in one way (esp. prior to their first birthdays) and some learn in the
other. And it may even happen that an infant who learns to point via
ritualization at some later point comes to comprehend adult pointing in
a new way, and so comes to a new understanding of her own pointing
and its equivalence to the adult version. Thus, Franco and Butterworth
(1996) found that when many infants first begin to point they do not
seem to monitor the adult’s reaction at all. Some months later they look
to the adult after they have pointed to observe her reaction, and some
months after that they look to the adult first, to secure her attention on
themselves, before they engage in the pointing act—perhaps evidencing
a new understanding of the adult’s comprehension.
Virtually all of chimpanzees’ flexibly produced gestures are intention
movements that have been ritualized in interaction with others. For
example, an infant chimpanzee who wants to climb on its mother’s
back may first actually pull down physically on her rear end to make
the back accessible, after which the mother learns to anticipate on first
touch, which the infant then notices and exploits in the future. The
general form of this type of learning is thus:
1. Individual A performs behavior X (noncommunicative).
2. Individual B reacts consistently with behavior Y.
3. Subsequently B anticipates A’s performance of X, on the basis of its
initial step, by performing Y.
4. Subsequently, A anticipates B’s anticipation and produces the initial
step in a ritualized form (waiting for a response) to elicit Y.

The main point is that a behavior that was not at first a communicative
signal becomes one by virtue of the anticipations of the interactants
over time. There is very good evidence from a series of longitudinal
and experimental studies that chimpanzees do not learn their gestures
by imitating one another but, rather, by ritualizing them with one
another in this way (see Tomasello and Call 1997, for a review). This
means that chimpanzees use and understand their gestures as one-way
procedures for getting things done, not as intersubjectively shared,
bidirectional coordination devices or symbols. At least some support
for this hypothesis is also provided by the fact that young chimpanzees,
unlike human infants, do not spontaneously reverse roles when someone
acts on them and invites a reciprocal action in return; that is, they do
not engage in role reversal imitation of instrumental acts (Tomasello
and Carpenter in press).
In general, two decades of experimental research have demonstrated
conclusively that, among primates, human beings are by far the most
skilled and motivated imitators (see Tomasello 1996, for a review). More
controversially, I would claim that some types of imitative learning are
uniquely human, specifically those that require the learner to understand
the intentions of the actor, that is, not only the actor’s goal but also
his plan of action or means of execution for reaching that goal. When
the intentions are actually communicative intentions—involving the
embedding of one intention within another or the reversing of roles
within a communicative act—apes are simply, in my view, not capable
of either understanding or reproducing these. This means that their
communicative devices are not in any sense shared in the manner of
human communicative conventions such as pointing and language.

Shared Intentionality
So why don’t apes point? I have given here more or less five fundamental
reasons:
they do not understand communicative intentions
they do not participate in joint attentional engagement as
common communicative ground within which deictic gestures are
meaningful
they do not have the motives to help and to share
they are not motivated to inform others of things because they
cannot determine what is old and new information for them (i.e.,
they do not really understand informing, per se)
they cannot imitatively learn communicative conventions as
inherently
bidirectional coordination devices with reversible roles
And so the obvious question is: is this really five different reasons, or
are these all part of one or a few more fundamental reason(s)?
My proposal here is that all of these reasons are basically reflections of
the more fundamental fact that only humans engage with one another
in acts of what some philosophers of action call shared intentionality,
or sometimes “we” intentionality, in which participants have a shared
goal and coordinated action roles for pursuing that shared goal (Bratman
1992; Clark 1996; Gilbert 1989; Searle 1995; Tuomela 1995). The activity
itself may be complex (e.g., building a building, playing a symphony)
or simple (e.g., taking a walk together, engaging in conversation), so
long as the interactants are engaged with one another in a particular
way. In all cases the goals and intentions of each interactant must
include as content something of the goals and intentions of the other.
When individuals in complex social groups share intentions with
one another repeatedly in particular interactive contexts, the result is
habitual social practices and beliefs that sometimes create what Searle
(1995) calls social or institutional facts: such things as marriage, money,
and government, which only exist because of the shared practices and
beliefs of a group.
In my previous approach to these problems (e.g., Tomasello 1999),
I hypothesized that only human beings understand one another as
intentional agents—with goals and perceptions of their own—and
this is what accounts for many uniquely human social cognitive skills,
including those of cultural learning and conventional communication,
that would seem to involve one or another form of shared intentionality.
We now have data, however, that has convinced me that at least some
great apes do understand that others have goals and perceptions (not, by
the way, thought and beliefs), as summarized by Tomasello et al. (2003).
The details of these data do not concern us here, but the immediate
theoretical problem is how we should account for uniquely human
cultural cognition, as we sometimes call it, if not by humans’ exclusive
ability to understand others intentionally.
Tomasello et al. (in press) present a new proposal that identifies the
uniquely human social cognitive skills not as involving the
understanding
of intentionality simpliciter, but as involving the ability
to create with others in collaborative interactions joint intentions
and joint attention (which in the old theory basically came for free
once one understood others as intentional agents). These basic skills
of shared intentionality involve both a new motivation for sharing
psychological states, such as goals and experiences, with conspecifics,
and perhaps as well new forms of cognitive representation (what we
call dialogic cognitive representations) for doing so. Evolutionarily (see
also Boyd this volume), the proposal is that individual humans who
were especially skilled at collaborative interactions with others were
adaptively favored, and the requisite social–cognitive skills that they
possessed were such that, at some point, the collaborative interactions
in which they engaged became qualitatively new—they became
collaborative interactions in which individuals were able to form a
shared goal to which they jointly committed themselves. Following
Bratman (1992), such shared intentional activities, as he calls them,
also involve understanding others’ plans for pursuing those joint goals
(meshing subplans), and even helping the other in his role if this is
needed. There is basically no evidence from any nonhuman animal
species of collaborative interactions in which different individuals play
different roles that are planned and coordinated, with assistance from
the other as needed.2
Tomasello et al. (in press) take a very close look at human infants
from this point of view and find that whereas infants of nine months
of age can coordinate with adults in some interesting ways that might
reflect an initial ability to form joint goals—such things as rolling a ball
back and forth or putting away toys together—it is at around 12 to 14
months of age that full-fledged shared intentionality seems to emerge. It
is at this age that infants for the first time seem truly motivated to share
experience with others through declarative and informative pointing,
that they encourage others to play their role when a collaborative
interaction breaks down, that they can reverse roles in collaborative
interactions, and that they start to acquire linguistic conventions.
So the specific proposal here—with regard to the question of why
human infants point but other apes do not—is that only humans have
the skills and motivations to engage with others collaboratively, to
form with others joint intentions and joint attention in acts of shared
intentionality. The constitutive motivations are mainly helping and
sharing, which obviously (and as argued above) are an important part
of indicating acts such as pointing. Understanding and coordinating
with others’ plans toward goals is in general a necessary part of human
communication, understood as joint action (Clark 1996). Reversing
roles is a very important part in these collaborative interactions, and is
likely that the understanding of perspectives is simply the perceptual–
attentional side of such role reversal (Baressi and Moore 1996). And so,
although we certainly do not have at the moment all details worked
out, it would seem a plausible suggestion that uniquely human forms
of communication–including both nonlinguistic and linguistic
conventions–rest fundamentally on a foundation of uniquely human
forms of collaborative engagement involving shared intentionality.

And How about Language?


I would like to conclude my discussion of pointing by making the
argument—in a very cursory fashion—that many of the aspects
of language that make it such a uniquely powerful form of human
cognition and communication are already present in the humble act
of pointing. And so in searching for the phylogenetic roots of human
linguistic competence, we might profitably begin with the pointing
gesture, which is at least a bit less complicated.
First of all, as stressed by Clark (1996) and as argued above, both
pointing and language are collaborative communicative acts. In both
cases, recipients either signal comprehension or noncomprehension,
and communicators adjust accordingly, sometimes repairing their
communicative acts to help the other understand. This collaborative
communicative structure derives from a human adaptation for
collaborative
activity more generally, involving the ability and inclination to
form with others joint intentions and joint attention. As part of this
collaborative structure, humans have developed various conventionalized
devices for coordinating their social interactions, including both pointing
and linguistic symbols. Both of these are bidirectional communicative
signs, learned by imitation and enabling the reversal of roles in the
communicative act; they are both therefore socially shared. Because they
are “arbitrary,” and not purely indexical, humans may use linguistic
symbols to indicate explicitly a virtual infinity of different conceptual
perspectives on things—but still the collaborative structure of pointing
and linguistic symbols are fundamentally the same.
Second and relatedly, to be effective both pointing and linguistic
communication must take into account the perspective of the recipient
(“recipient design” a la Schegloff this volume; see also Enfield and
Levinson in this volume). In many cases, pointing presupposes the joint
attentional common ground as “topic” (old or shared information),
and the pointing act is actually a predication, or focus, informing the
recipient of something new, worthy of her attention. In other cases,
pointing serves to establish a new topic, about which further things may
then be communicated. Both of these are functions served by whole
utterances in linguistic communication (see Lambrecht’s 1994, predicate
focus and argument focus constructions). When human infants first
begin talking, many of their earliest utterances are combinations of
gestures (mostly pointing) with words, which divide up in various
ways the topic and focus functions (Tomasello 2003a). Language goes
beyond deictic gestures in the ease with which linguistic symbols
may be grammaticalized into constructions with complex topic-focus
configurations, but again the building blocks are the same.
Third, the motivations for pointing and communicating linguistically
are basically the same (with the possible exception of some performatives
that can only be formulated linguistically). In both cases, the most
fundamental motivations are helping and sharing, including informing
as a special case. Interestingly, Dunbar (1996) has argued and presented
evidence that in the evolution of human language the motive to
gossip—to simply share information for no immediate reason—is the
key motive. His argument is that modern languages are much more
complicated than they need to be to simply coordinate human actions
in the here and now; their complicated structure reveals in various ways
the need to communicate about things displaced from the here and
now in complicated ways. The important point for current purposes is
simply that the basic motives of such narrative discourse are sharing and
informing, the same basic motives underlying the pointing gesture.
Fourth and finally, in most analyses acts of linguistic communication
compose two fundamental components: proposition and propositional
attitude (locution and illocution). But pointing also includes these two
components. The locutionary aspect is the spatial indication of the
intended referent, for example, a toy. But then it is something else
again—the illocutionary aspect—that determines whether the
pointing
gesture is taken to be an imperative (I want you to bring me that
toy) or declarative (I want you to share my interest in that toy) or
informative (I want you to find the toy you are seeking). It hardly
needs emphasizing how much further language takes us in formulating
complex linguistic constructions embodying complex propositions and
propositional attitudes. But again it is important that the roots of this
complexity are already present in the much simpler communicative
act of pointing.

Conclusion
To explain human cognitive uniqueness, many theorists invoke
language.
This contains an element of truth, because only humans use
language and it is clearly important to, indeed constitutive of, uniquely
human cognition in many ways. However, as I have noted before, asking
why only humans use language is like asking why only humans build
skyscrapers, when the fact is that only humans, among primates, build
freestanding shelters at all. And so for my money, at our current level of
understanding, asking why apes do not have language may not be our
most productive question. A much more productive question, and one
that can currently lead us to much more interesting lines of empirical
research, is asking the question why apes do not even point.

Notes
1. There is actually one reported incident of a bonobo pointing for
conspecifics
in the wild (Veà and Sabater-Pi 1998). This has never been repeated
by any other observers of bonobos or other ape species. There have also been
suggestions in the past that apes point with their whole body (Menzel 1971),
or just with their eyes (de Waal 2001), but these have never been substantiated
as anything more than personal impressions.
2. The most complex cooperative activity of chimpanzees is group
hunting,
in which two or more males seem to play different roles in corralling
a monkey (Boesch and Boesch 1989). But in analyses of the sequential
unfolding of participant behavior over time in these hunts, many observers
have characterized this activity as essentially identical to the group hunting
of other social mammals such as lions and wolves (Cheney and Seyfarth
1990; Tomasello and Call 1997). Although it is a complex social activity, as
it develops over time each individual simply assesses the state of the chase
at each moment and decides what is best for it to do. There is nothing that
would be called collaboration in the narrow sense of joint intentions and
attention based on coordinated plans. In experimental studies (e.g., Crawford
1937; Chalmeau 1994), the most complex behavior observed is something like
two chimpanzees pulling a heavy object in parallel, and during this activity
almost no communication among partners is observed (Povinelli and O’Neill
2000). There are no published experimental studies—and several unpublished
negative results (two of them ours)—in which chimpanzees collaborate by
playing different and complementary roles in an activity.

References
Barresi, J., and C. Moore. 1996. Intentional relations and social
understanding
. Behavioral and Brain Sciences 19:107–154.
Bates, E., L. Camaioni, and V. Volterra. 1975. The acquisition of performatives
prior to speech. Merrill-Palmer Quarterly 21:205–224.
Behne, T., M. Carpenter, and M. Tomasello. in press. One-year-olds
comprehendthe communicative intentions behind gestures in a hiding
game. Developmental Science.
Boesch, C., and H. Boesch. 1989. Hunting behavior of wild chimpanzees
in the Tai Forest National Park. American Journal of Physical Anthropology
78:547–573.
Bratman, M. E. 1992. Shared cooperative activity. Philosophical Review
101(2):327–341.
Call, J. in press. Inferences about the location of food in the great apes.
Journal of Comparative Psychology.
Call, J., B. Agnetta, and M. Tomasello. 2000. Social cues that chimpanzees
do and do not use to find hidden objects. Animal Cognition 3:23–34.
Call, J., and M. Tomasello. in press. What do chimpanzees know about
seeing revisited: An explanation of the third kind. In Issues in joint
attention, edited by N. Eilan, C. Hoerl, T. McCormack, and J. Roessler.
Oxford: Oxford University Press.
Chalmeau, R. 1994. Do chimpanzees cooperate in a learning task?
Primates 35:385–392.
Cheney, D. L., and R. M. Seyfarth. 1990. How monkeys see the world.
Chicago: University of Chicago Press.
Clark, H. 1996. Using language. Cambridge: Cambridge University Press.
Crawford, M. P. 1937. The cooperative solving of problems by young
chimpanzees. Comparative Psychology Monographs 14:1–88.
De Waal, F. 2001. Pointing primates: Sharing knowledge without
language. Chronicle of Higher Education, B7–B9.
Dunbar, R. 1996. Grooming, gossip and the evolution of language. London:
Faber and Faber.
Franco, F., and G. Butterworth. 1996. Pointing and social awareness:
Declaring and requesting in the second year. Journal of Child Language
23:307–336.
Gilbert, M. 1989. On social facts. Princeton: Princeton University Press.
Golinkoff, R. 1993. When is communication a meeting of the minds?
Journal of Child Language 20:199–208.
Goodall, J. 1986. The chimpanzees of Gombe: Patterns of behavior.
Cambridge, MA: Harvard University Press.
Hare, B., J. Call, B. Agnetta, and M. Tomasello. 2000. Chimpanzees know
what conspecifics do and do not see. Animal Behaviour 59:771–785.
Hare, B., J. Call, and M. Tomasello. 2001. Do chimpanzees know what
conspecifics know? Animal Behavior 61:139–151.
Hare, B., and M. Tomasello. 2004. Chimpanzees are more skillful in
competitive than in cooperative cognitive tasks. Animal Behaviour
68:571–581.
Itakura, S., B. Agnetta, B. Hare, and M. Tomasello. 1999. Chimpanzees
use human and conspecific social cues to locate hidden food.
Developmental Science 2:448–456.
Kaminski, J., J. Call, and M. Tomasello. 2004. Body orientation and face
orientation: Two factors controlling apes’ begging behavior from
humans. Animal Cognition 7:216–223.
Lambrecht, K. 1994. Information structure and sentence form. Cambridge:
Cambridge University Press.
Leavens, D. A., and W. D. Hopkins. 1998. Intentional communication
by chimpanzees (Pan troglodytes): A cross-sectional study of the use
of referential gestures. Developmental Psychology 34:813–822.
Liszkowski, U., M. Carpenter, A. Henning, T. Striano, and M. Tomasello.
2004. 12-month-olds point to share attention and interest.
Developmental
Science 7:297–307.
Liszkowski, U., M. Carpenter, and M. Tomasello. in press. 12-month-olds
point to inform others. Journal of Cognition and Development.
Menzel, E. W., Jr. 1971. Communication about the environment in a
group of young chimpanzees. Folia Primatologica 15:220–232.
Moll, H., C. Coring, M. Carpenter, and M. Tomasello. in press. Infants
follow attention to aspects of objects. Journal of Cognition and
Development.
Povinelli, D. J., and D. K. O’Neill. 2000. Do chimpanzees use their gestures
to instruct each other? In Understanding other minds: Perspectives from
autism, 2nd edition, edited by S. Baron-Cohen, H. Tager-Flusberg, and
D. J. Cohen, 459–487. Oxford: Oxford University Press.
Savage-Rumbaugh, S. 1990. Language as a cause-effect communication
system. Philosophical Psychology 3:55–76.
Schwe, H., and E. Markman. 1997. Young children’s appreciation of
the mental impact of their communicative signals. Developmental
Psychology 33:630–635.
Searle J. 1995. The construction of social reality. New York: Free Press.
Sperber, D., and D. Wilson. 1986. Relevance: Communication and cognition.
Cambridge, MA: Harvard University Press.
Tomasello, M. 1996. Do apes ape? In Social learning in animals: The
roots of culture, edited by J. Galef and C. Heyes, 319–346. New York:
Academic Press.
Tomasello, M. 1999. The cultural origins of human cognition. Cambridge,
MA: Harvard University Press.
Tomasello, M. 2003a. Constructing a Language: A Usage-Based Theory of
Language Acquisition. Cambridge, MA: Harvard University Press.
Tomasello, M. 2003b. The pragmatics of primate communication.
In Handbook of Pragmatics, edited by J. Verschueren. Amsterdam:
Benjamins.
Tomasello, M., and J. Call. 1997. Primate Cognition. Oxford: Oxford
University Press.
Tomasello, M., J. Call, and A. Gluckman. 1997. The comprehension
of novel communicative signs by apes and human children. Child
Development 68:1067–1081.
Tomasello, M., J. Call, and B. Hare. 1998. Five primate species follow
the visual gaze of conspecifics. Animal Behaviour 55:1063–1069.
Tomasello, M., J. Call, and B. Hare. 2003. Chimpanzees understand
psychological states: The question is which ones and to what extent.
Trends in Cognitive Science 7:153–156.
Tomasello, M., J. Call, K. Nagell, R. Olguin, and M. Carpenter. 1994.
The learning and use of gestural signals by young chimpanzees: A
trans-generational study. Primates 37:137–154.
Tomasello, M., J. Call, J. Warren, T. Frost, M. Carpenter, and K. Nagell.
1997. The ontogeny of chimpanzee gestural signals: A comparison
across groups and generations. Evolution of Communication 1:223–253.

Tomasello, M., and L. Camaioni. 1997. A comparison of the gestural


communication of apes and human infants. Human Development
40:7–24.
Tomasello. M., and M. Carpenter. in press. The emergence of social
cognition in three young chimpanzees. Monographs of the Society for
Research in Child Development.
Tomasello, M., M. Carpenter, J. Call, T. Behne, and H. Moll. in press.
Understanding and sharing intentions: The origins of cultural
cognition
. Behavioral and Brain Sciences.
Tomasello, M., B. George, A. Kruger, J. Farrar, and A. Evans. 1985. The
development of gestural communication in young chimpanzees.
Journal of Human Evolution 14:175–186.
Tomasello, M., D. Gust, and T. Frost. 1989. A longitudinal investigation
of gestural communication in young chimpanzees. Primates 30:35–50.
Tomasello, M., and K. Haberl. 2003. Understanding attention: 12- and
18-month-olds know what’s new for other persons. Developmental
Psychology 39:906–912.
Tomasello, M., B. Hare, and B. Agnetta. 1999. Chimpanzees follow gaze
direction geometrically. Animal Behaviour 58:769–777.
Tomasello, M., B. Hare, and T. Fogleman. 2001. The ontogeny of gaze
following in chimpanzees and rhesus macaques. Animal Behaviour
61:335–343.
Tomasello, M., A. Kruger, and H. Ratner. 1993. Cultural learning. Target
article for Behavioral and Brain Sciences 16:495–552.
Tuomela, R. 1995. The importance of us: A philosophical study of
basic social notions. Stanford Series in Philosophy. Stanford: Stanford
University Press.
Veà, J., and J. Sabater-Pi. 1998. Spontaneous pointing behaviour in
the Wild Pygmy Chimpanzee (Pan paniscus). Folia Primatologica
69:289–290.
Vygotsky, L. 1978. Mind in society: The development of higher psychological
processes. Cambridge, MA: Harvard University Press.
Warneken, F., and M. Tomasello. n.d. Helping in one-year-olds and young
chimpanzees. Manuscript submitted for publication.
Warneken, F., F. Chen, and M. Tomasello. in press. Collaboration in
one-year-olds and young chimpanzees. Child Development.
Index

action chains biological anthropology, 7–8


interaction, 45–6, 47 biological evolution, 472–3
segmentation, 482–4 bootstrap problem, 467
actions, parsing of, 478–80, 488–92 brain, thinking processes, 388–90
affiliational imperative, interaction, British Sign Language (BSL), 217–18
399–400, 412–22
African Americans, Trackton, infant caregivers, interaction with infants, 19–20,
interaction, 283, 285, 287, 292 279–95
airline flight decks, distributed cognition, caste system, interaction, 55
378–9 causal-intentional theories, 480, 494–6
altruism, 459, 470, 473 CCCs, see Cognitive Causal Chains
American Sign Language (ASL), 20, 342–4, CCCCs, see Cultural Cognitive Causal
354, 480 Chains
animal societies, 453 children
anthropology, developments in, 7–9 communication paradox, 196–8
apes conversation development, 211
behavior parsing, 479–80, 488–99 deaf, 217–24, 353–65
food processing, 484–93, 494 false-belief understanding, 4, 181,
gaze following, 506–7 191–5, 197, 208–10
gestures, 496–8, 506–7, 515 gestures, 365–70
imitation by, 235–6, 484–92, 515–16 interpretive abilities, 195–6
language use, 492 language development, 179, 182–98,
maimed, 230, 233, 487–8 207, 210–14, 353–4
monkey comparisons, 492–4 linguistic invention, 353–71
nettle processing, 25, 488–92 mental-state awareness, 185–9
pointing and, 26, 154, 507–10 mental-state vocabulary, 212–13,
sociality, 2 215–17, 221–2
social transmission, 230–6 metarepresentational abilities, 189–95
Theory of Mind (ToM), 498–9 psychological development, 4–5
tool use, 230–2, 234 syntax use, 213–14, 215–17
vocalizations, 481–2 Theory of Mind (ToM), 180–2, 185
see also monkeys see also infants
aphasia, 14–15, 40, 98, 103–17, 119–22, chimpanzees, see apes
361 Christianity, spread of, 466–7
artifacts, cognitive, 378 cognition
Asperger’s syndrome, 41 common ground, 407–8
attentiveness, conversation, 99–100 culture and, 431–2
autism, 4, 41 distributed cognition, 376–95, 434
evolutionary theories, 496–8
Baka, Cameroon, 181 in interaction, 21–4, 87–9, 172–3
behavior representations, 432–4
parsing, 25, 478–99 see also social cognition
see also group behavior Cognitive Causal Chains (CCCs), 23–4,
binding problem, 46, 53, 388 434–6
Index

cognitive labor, 377–8 cooking


cognitive opacity, 236–8, 241, 246–8 cultural transmission, 237–8
collaboration development of, 333
human interaction, 15, 97–8, 517 see also feeding behavior
see also joint commitment cooperation
collateral communication, 129 group benefits, 454–9
commitments instinctive, 24–5, 471–2
hierarchies, 136–7 interaction, 45, 46–7, 50, 471
individual, 130 large-scale, 460–1
induced, 302, 323–5 opportunities for, 456–8
see also joint commitments reciprocity, 459–61
common ground selection and, 458–9
audience, 409–12 see also joint commitment
cognition, 407–8 coordination problem, 131
conversation, 401–12 Cultural Cognitive Causal Chains
exploiting, 23 (CCCCs), 436–9, 440
inference, 401–7 cultural conventions
interaction, 49, 112 common ground, 401
production of, 300–1 evolution of, 20–1
sign language, 335 cultural epidemiology, 439, 445–6
social relationships, 399–400, 412–23 cultural evolution
sources of, 400–1 biology and, 472–3
speech, 401–5 environmental factors, 443–6
communication group selection, 464–7
cognitive understanding, 154–6 human interaction and, 28–9, 453–73
collateral, 129 novel behaviors, 467–8
cultural transmission, 440–3 prosocial instincts, 469–70
goals, 155, 160–9 psychological factors, 431–2, 443–6
intentional, 155, 159–60, 508–10 stability, 440–3
linguistic, 518–20 cultural variation
motives, 155–6, 160–9 interaction, 18–21, 55–62, 72–3, 83–7,
and technological developments, 20–1, 279–95
329, 333–5 Theory of Mind (ToM), 181–2
see also pointing cultures, stability, 440–6
competition, group behavior, 464–5
computer-mediated communication Darwinian theories, 440–1, 464
(CMC), 332–47 deaf adults, computer-mediated
conflicts, cooperation and, 453, 458 communication (CMC), 332–47
conventions, conversation, 268–70 deaf children
conversation false-belief understanding, 217–24
action recognition, 87–9 gestures, 354–65
analysis, 7, 47–8 interaction with parents, 360–5
children’s development of, 211, 286–7 linguistic invention, 353–65
common ground, 401–12 narrative, 361–2
conventions, 268–70 Nicaragua, 219–24
cultural variations, 56–61, 72–3 parents of, 359
false belief and, 193 reflexive language, 362–4
implicature, 265–8 dialect, common ground, 401
minimization, 85–6 Dinka, Sudan, 464–5
nextness, 86 discourse, see conversation
overall structure, 82–3 disgust, 445–6
progressivity, 86–7 distributed artificial intelligence (DAI), 330
repair practices, 78–9, 100–3, 118–19, distributed cognition
211 human interaction, 376–80, 388–95,
sequential, 73–7 434
structure, 41 ship navigation, 380–8
trouble problem, 77–9 divination
turn taking, 51–2, 53, 71–2, 84 interaction, 304–23
word selection problem, 79–82 language of, 301
see also language setting of, 302–4
Index

division of labor, 453, 456–7 group behavior


Down’s syndrome, 41 competition, 464–5
cooperation, 24–5, 454–61
emotion, joint commitments, 146–7 evolutionary theories, 462–8
emulation learning, 233 Guatemala
environmental factors, cultural evolution, intentionality, 262
443–6 Quiche language interactions, 84
epidemiological approach, culture, 439, Gusii, infant interaction, 282, 286, 287, 292
445–6
Erasmus, 445–6 Hamilton’s rule, 459, 462
Euro-American culture, infant interaction, hierarchies
280, 283–4, 288–9 behavior parsing, 479–80, 484–8, 491–2
evolutionary theories commitments, 136–7
cognition, 496–8 hominins
cultural evolution, 28–9, 431–2, 440–6 language evolution, 497–8
group behavior, 462–8 teleological reasoning, 246–8
social interaction, 24–6, 453–73 human interaction
exploitation, joint commitments, 139–40, collaboration and commitment, 15,
144–5 126–7, 517
cooperation, 471
false-belief understanding cultural evolution, 431–2, 453–73
deaf children, 217–24 cultural variation, 18–21, 55–62, 72–3,
development of, 4, 181, 191–5, 197, 83–7, 279–95
208–10 definitions, 375
language and, 192–5, 210–14, 222–3 distributed cognition, 376–80
Mopan Maya, 181, 261–2 generic problems and solutions, 14, 71–83
Nicaragua sign language (NSL), 219–24 gestures, 365–71
typically developing children, 214–17 infants, 40
feeding behavior language and, 299–302, 354
evolution of, 479–80, 494 linguistic meanings, 263–5
gorillas, 485–92 local richness, 14–15
see also cooking macrolevel and microlevel, 330–2
furniture assembly, joint commitment, multimodal, 375
127–38 non-verbal, 14–15, 40, 98, 441
origins of, 391–4
game theory, 454–6 principles of, 40–4
gaze following properties of, 13–15
apes, 506–7 psychology of, 4–5, 15–18
infants, 159–60, 170–1 rituals, 304–25
gestures, 22, 55–6 social structure and, 70–1
apes, 496–8, 506–7, 515 time spent, 41–2
deaf children, 354–9 understanding, 394–5
environmentally coupled, 385–90 see also interaction engine; social
hearing parents, 359 interaction
iconic, 496 human pedagogy, 241–6, 248
narrative, 361–2 human sociality
present and nonpresent, 360–1 culture and, 18–21
reflexive language, 362–4 distinctive properties of, 2–3
speech and, 300, 365–70 framework, 26–9
teachers and, 367–70 interactional practices, 83–9
see also pointing; sign language interdisciplinarity, 9–12
Goffman, Erving, 90 learning skills, 344–5
gorillas, see apes
grammar, gesture languages, 357 imitation
Gricean pragmatics by apes, 235–6, 484–92, 515–16
common ground, 401–5 by monkeys, 433, 483
conversational implicature, 265–6, 270 human pedagogy, 240–6, 248
flouting of, 271 learning by, 229, 238–40, 485–92,
intentions, 49–50, 54, 264–5, 273 515–16
outline of, 5–6 social transmission, 17–18, 229, 238–46
implicature irony, understanding of, 195–6
conversational, 265–6 Italy, infant interaction, 282–3
natural, 267–8
individuals, interaction engine, 26–7 Japan, infant interaction, 289
induced commitments, 302, 323–5 joint actions, 133–4
infants joint attention, 400–1
attention directing, 167–8, 170–1 joint commitments
attention sharing, 161–7, 171–2 emergence of, 134–9
communicative goals and motives, emotion, 146–7
160–9 entanglements, 138–9
declarative pointing, 157, 158–9, establishing, 129–34
510–11 exploitation, 139–40, 144–5
false-belief understanding, 222–3 hierarchies, 136–7
gaze alternation, 159–60, 170–1 morality, 146–7
imitative learning, 238–40, 243–6 overcommitment, 140, 145–6
imperative pointing, 157–8, 159, partitioning, 127–9
510–12 projective pairs, 131–3
influencing another person, 283–8 risks, 139–47
information about the world, 288–91 social actions, 126–7
inner experience, 281–3 stacking and persistence, 137–8
intentional communication, 159–60
interaction with caregivers, 19–20, kin selection, 459, 462
279–95
learning by, 242–3 language
play, 290 acquisition of, 99–100, 207, 354
pointing, 16, 153–74, 508–16 evolution of, 470, 471–2
social-cognitive abilities, 170–3, 184–5, false belief and, 192–5, 210–14, 222–3
224–5 interaction role, 42–4, 299–302
social interaction, 40, 292–5 motherese, 19, 180, 287–8
see also children national, 471–2
inference, common ground, 401–7 origins, 480
informational imperative, interaction, philosophy of, 261–2
399, 411 pointing and, 518–20
inner experience, expression of, 281–3 properties of, 356
institutional organization, cognition, ritual, 300–25
377–8 sociocultural role, 28
integration, induced commitment and, Theory of Mind (ToM) and, 16–17, 179,
323–5 182–98
intelligence, evolution of, 492–4 see also conversation; sign language;
intentionality speech
and cause, 480, 494–6 Laos, common ground, 402–5
communications, 155, 159–60, 508–10 learning
Gricean, 49–50, 54, 264–5, 273 cultural evolution, 468
Mopan Maya, 19, 262–3, 272 emulation learning, 233
shared, 516–18 human pedagogy, 241–6, 248
intention attribution, 18–19, 25, 48, 50, 54 imitation role, 229, 238–40, 485–92,
interaction 515–16
human, see human interaction pointing, 513–16
social, see social interaction linguistic anthropology, 9
interaction engine linguistic invention, deaf children, 353–65
culture and, 55–62, 273
ingredients, 48–55 macrolevel interactions, 330–2
joint commitment and, 126 material environment, interaction with,
outline of, 13 378–9
properties of, 44–8 Maya, see Mopan Maya; Yucatec Maya
in sociality framework, 26–7 memetic evolution, 440–1
interaction matrix, 27, 28–30 memory, collective, 377–8
internet, see computer-mediated mental state
communication (CMC) awareness, 185–9, 346, 498–9
interpretive abilities, children, 195–6 vocabulary, 212–13, 215–17, 221–2
metarepresentational abilities, pointing
development of, 189–95 apes and, 26, 154, 507–10
Mexico, interactions, 56 aphasia and, 98, 103–17, 119–22, 361
microlevel interactions, 330–2 communication and, 25–6, 153–4
Milgram Experiments, 15, 140–7, 148 comprehension of, 507–10
“mind reading,” 9, 16, 54, 407 declarative pointing, 157, 158–9,
minimization, conversation, 85–6 510–11
mirror neurons, 48, 54, 433, 483–4 imperative pointing, 157–8, 159,
monkeys 510–12
ape comparisons, 492–4 infants and, 16, 153–74, 507–16
group behavior, 464 language and, 518–20
imitation, 433, 483 learning to, 513–16
vocalizations, 481 motives for, 510–13
see also apes ontogeny, 154, 173–4
Mopan Maya see also gestures
false belief, 181, 260–1 Price, George, 462
intentionality, 19, 262–3, 272 primates
philosophy of language, 261–2 group cooperation, 463
wrongdoing, 259–61 non-human, see apes
morality prisoner’s dilemma, 456, 457
joint commitments, 146–7 problem solving, gestures and, 366–70
punishment, 458, 460–1 progressivity, conversation, 86–7
motherese, 19, 180, 287–8 projective pairs, joint commitments, 131–3
multimodal communications, 46, 53, 98, prosody, non-verbal, 104–5, 108–9, 111,
118–22 112
multimodal interaction, 375 psychology
mutual knowledge, 49 cultural evolution, 431–2, 443–6
mutual salience, 49, 54 human interaction, 4–5, 15–18
reverse psychology, 271
narrative, deaf children, 361–2 punishment, moralistic, 458, 460–1
nations, language, 471–2
nettles, processing by gorillas, 25, 488–92 rational learning, 17–18
New Guinea, group selection, 465–6 recipient design, 6, 45, 49, 89, 519
nextness, conversation, 86 reciprocity, group behavior, 459–61
Nicaraguan Sign Language (NSL), 17, 195, referential actions, 242
219–24 reflexive language, deaf children, 362–4
cohort differences, 221–2 reflexive thinking, 49
development of, 219–20 repair practices, conversation, 78–9,
false-belief understanding, 220–1 100–3, 118–19, 211
Nichols’s hypothesis, 445–6 representation, cognition and, 432–4
nonce signals, 45 representational redescription (RR) theory,
Nuer, Sudan, 464–5 197
research traditions
ontogeny anthropology, 7–9
and interaction, 29 Gricean pragmatics, 5–6
pointing, 154, 173–4 social interaction analysis, 7
ostensive signals, learning, 242, 243–6 Theory of Mind (ToM), 4–5
overcommitment, joint commitments, response facilitation, 483–4
140, 145–6 reverse psychology, 271
risks, joint commitments, 139–47
parents rituals, interaction, 300–25
of deaf children, 359 Rossel Island, Papua New Guinea
see also caregivers interactions, 56, 58–61
parsing behavior, see behavior sign language, 42–4
participation structure, 52
pedagogy Samoa, intentionality, 262
evolution of, 246–8 SCCCs, see Social Cognitive Causal Chains
human, 241–6, 248 Schelling mirror world, 49, 51–2, 54, 62
perception, representation and, 433–4 scissors, use of, 444
performance errors, 100, 118, 120 second-order beliefs, 195–6
Philippines, intentionality, 262 sedan-chair principle, 131
semiosis, 407 evolution of, 481–2
sentences, gesture languages, 358–9 gestures and, 300, 365–70
sequence-organization, conversation, 73–7 see also language
shaman, language of, 301–25 stag hunt game, 454–6
ship navigation, distributed cognition, Sudan, group selection, 464–5
380–95 syntax, language development, 193–4,
sign language 213–14, 215–17
adaptations to technology, 336–47
American Sign Language (ASL), 20, talk interaction, see conversation
342–4, 354, 480 teachers, gestures and, 367–70
British Sign Language (BSL), 217–18 teaching, see pedagogy
communication technology, 333–5 technology
evolution of, 20–1, 219–24 communication developments, 20–1,
false belief and, 217–18, 220–2 329, 333–5
“home-sign” systems, 21–2, 42–4, 45, tools, 332–3
354–71 teleological reasoning
learning, 354–5 apes, 230–6
Nicaragua, 17, 195, 219–24 early hominids, 246–8
webcam transmission, 334–47 telephone communication
see also gestures common ground, 415–16
social actions, joint commitments, 126–7, deaf people, 333–4
147–8 Test-Operate-Test-Exit (TOTE), 51
social cognition Theory of Mind (ToM)
development of, 224–5 action recognition, 87–9
infants, 170–3 apes, 498–9
Social Cognitive Causal Chains (SCCCs), cultural diversity, 181–2
434–6, 437–9 deaf children, 217–19
social instincts features of, 4–5, 180–1, 208–10
cooperation, 471–2 infant pointing, 172–3
cultural evolution, 469–71 interaction and, 48–9, 73–4
social intelligence, 40–1 language and, 16–17, 179, 182–98
social interaction ToM, see Theory of Mind
affiliational imperative, 399–400, 412–22 tools
animals, 453 development of, 332–3
cognition and, 21–4, 87–9, 172–3 hominid use, 246–8
common ground, 399–423 use by apes, 230–2, 234
culture and, 18–21, 279–95 tribal society, instincts, 469–70
development of, 293–5 trouble problem, interaction, 77–9
evolutionary perspectives, 24–6, 453–4 turn-constructional unit (TCU), 77–8,
reciprocity, 459–61 79
systematics of, 7 turn taking
see also human interaction conversation, 51–2, 53, 84
sociality problem of, 71–3
apes, 2
see also human sociality unexpected-contents task, 209–10
social learning, 468 unseen-displacement task, 209
social relationships, common ground, utterances, multimodal, 118–22
412–22
social structure, interaction and, 70–1 Vygotskian concept, 187, 190
social transmission
apes, 230–6 webcam, sign-language use, 334–47
cognitive opacity, 236–8, 241 word selection, conversation, 79–82
cultural evolution, 440–3 words, sign language, 355–7
emulation learning, 233
imitation, 17–18, 229 Yucatec Maya
sociocultural anthropology, 8–9 infant interaction, 282–3, 284, 286,
sociocultural frame, 27, 30 289–90, 291
specificity hypothesis, 212 ritual interaction, 300–25
speech
common ground, 401–5 zone of proximal development, 190

You might also like